前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >计算机视觉学术速递[6.28]

计算机视觉学术速递[6.28]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 17:27:06
1.2K0
发布2021-07-02 17:27:06
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CV 方向,今日共计46篇

Transformer(3篇)

【1】 PVTv2: Improved Baselines with Pyramid Vision Transformer 标题:PVTv2:使用金字塔视觉Transformer改进基线

作者:Wenhai Wang,Enze Xie,Xiang Li,Deng-Ping Fan,Kaitao Song,Ding Liang,Tong Lu,Ping Luo,Ling Shao 机构:The University of Hong Kong, Nanjing University of Science and Technology, IIAI, SenseTime Research 备注:Technical Report 链接:https://arxiv.org/abs/2106.13797 摘要:计算机视觉领域的Transformer最近取得了令人鼓舞的进展。在这项工作中,我们改进了原有的金字塔视觉变换器(PVTv1),增加了三个改进设计,包括(1)具有卷积的局部连续特征,(2)具有零填充的位置编码和(3)具有平均池的线性复杂度注意层。通过这些简单的修改,我们的PVTv2在分类、检测和分割方面显著地改进了PVTv1。此外,在ImageNet-1K预训练下,PVTv2比最近的作品(包括Swin-Transformer)获得了更好的性能。我们希望这项工作将使最先进的视觉Transformer的研究更容易。代码位于https://github.com/whai362/PVT . 摘要:Transformer in computer vision has recently shown encouraging progress. In this work, we improve the original Pyramid Vision Transformer (PVTv1) by adding three improvement designs, which include (1) locally continuous features with convolutions, (2) position encodings with zero paddings, and (3) linear complexity attention layers with average pooling. With these simple modifications, our PVTv2 significantly improves PVTv1 on classification, detection, and segmentation. Moreover, PVTv2 achieves much better performance than recent works, including Swin Transformer, under ImageNet-1K pre-training. We hope this work will make state-of-the-art vision Transformer research more accessible. Code is available at https://github.com/whai362/PVT .

【2】 Vision Transformer Architecture Search 标题:视觉转换器体系结构研究

作者:Xiu Su,Shan You,Jiyang Xie,Mingkai Zheng,Fei Wang,Chen Qian,Changshui Zhang,Xiaogang Wang,Chang Xu 机构:School of Computer Science, The University of Sydney, SenseTime Research, Beijing University of Posts and Telecommunications, Department of Automation, Tsinghua University, Institute for Artificial Intelligence, Tsinghua University (THUAI) 链接:https://arxiv.org/abs/2106.13700 摘要:近年来,transformers将图像建模为一系列具有自我注意机制的人工分割面片,在解决计算机视觉任务方面显示出巨大的优越性。然而,目前的视觉变换器(vit)的体系结构仅仅是从自然语言处理(NLP)任务中继承而来,还没有得到充分的研究和优化。在本文中,我们进一步研究了视觉任务中Transformer的内在结构,并提出了一种结构搜索方法ViTAS来搜索具有相似硬件预算的最优结构。具体地说,我们设计了一种新的有效的ViTs权值共享范式,使得可以从一个超级Transformer中得到具有不同令牌嵌入、序列大小、头数、宽度和深度的体系结构。此外,为了迎合不同架构的差异,我们在超级转换器中引入了\textit{private}类标记和自我注意映射。另外,为了适应不同预算的搜索,我们提出搜索同一操作的抽样概率。实验结果表明,与现有的纯Transformer结构相比,我们的ViTAS获得了很好的效果。例如,使用$1.3$G FLOPs预算,我们的搜索架构在ImageNet上达到$74.7\%$top-$1$精度,比当前的基线ViT架构高出$2.5\%$。代码位于\url{https://github.com/xiusu/ViTAS}. 摘要:Recently, transformers have shown great superiority in solving computer vision tasks by modeling images as a sequence of manually-split patches with self-attention mechanism. However, current architectures of vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks and have not been sufficiently investigated and optimized. In this paper, we make a further step by examining the intrinsic structure of transformers for vision tasks and propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets. Concretely, we design a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer. Moreover, to cater for the variance of distinct architectures, we introduce \textit{private} class token and self-attention maps in the super-transformer. In addition, to adapt the searching for different budgets, we propose to search the sampling probability of identity operation. Experimental results show that our ViTAS attains excellent results compared to existing pure transformer architectures. For example, with $1.3$G FLOPs budget, our searched architecture achieves $74.7\%$ top-$1$ accuracy on ImageNet and is $2.5\%$ superior than the current baseline ViT architecture. Code is available at \url{https://github.com/xiusu/ViTAS}.

【3】 Shape registration in the time of transformers 标题:Transformer时代的形状配准

作者:Giovanni Trappolini,Luca Cosmo,Luca Moschella,Riccardo Marin,Emanuele Rodolà 机构:Department of Computer Engineering, Sapienza University of Rome, Department of Computer Science, Simone Melzi 链接:https://arxiv.org/abs/2106.13679 摘要:本文提出了一种基于变换器的非刚性三维点云配准方法。该方法是数据驱动的,首次在注册任务中采用transformer结构。我们的方法是通用的,适用于不同的设置。给定一个具有某些所需属性(例如蒙皮权重或其他动画提示)的固定模板,我们可以向其注册原始获取的数据,从而将所有模板属性传输到输入几何体。或者,给定一对形状,我们的方法可以将第一个形状注册到第二个形状上(反之亦然),从而获得两个形状之间的高质量密集对应。在这两种情况下,我们的结果质量使我们能够针对实际应用,如纹理转移和形状插值。此外,我们还表明,包括表面的潜在密度估计简化了学习过程。通过利用这种架构的潜力,我们可以训练我们的模型,只需要一组稀疏的地面真值对应($10\sim20\%$的总分)。所提出的模型和我们进行的分析为将来探索基于Transformer的注册和匹配应用架构铺平了道路。定性和定量评估表明,在不同的数据集和场景下,我们的管道在可变形和无序的三维数据注册方面优于最先进的方法。 摘要:In this paper, we propose a transformer-based procedure for the efficient registration of non-rigid 3D point clouds. The proposed approach is data-driven and adopts for the first time the transformer architecture in the registration task. Our method is general and applies to different settings. Given a fixed template with some desired properties (e.g. skinning weights or other animation cues), we can register raw acquired data to it, thereby transferring all the template properties to the input geometry. Alternatively, given a pair of shapes, our method can register the first onto the second (or vice-versa), obtaining a high-quality dense correspondence between the two. In both contexts, the quality of our results enables us to target real applications such as texture transfer and shape interpolation. Furthermore, we also show that including an estimation of the underlying density of the surface eases the learning process. By exploiting the potential of this architecture, we can train our model requiring only a sparse set of ground truth correspondences ($10\sim20\%$ of the total points). The proposed model and the analysis that we perform pave the way for future exploration of transformer-based architectures for registration and matching applications. Qualitative and quantitative evaluations demonstrate that our pipeline outperforms state-of-the-art methods for deformable and unordered 3D data registration on different datasets and scenarios.

检测相关(4篇)

【1】 Partially fake it till you make it: mixing real and fake thermal images for improved object detection 标题:部分伪装,直到你成功:混合真实和虚假的热图像,以改进物体检测

作者:Francesco Bongini,Lorenzo Berlincioni,Marco Bertini,Alberto Del Bimbo 机构:MICC - Universita degli Studi di Firenze Firenze Italy 链接:https://arxiv.org/abs/2106.13603 摘要:在本文中,我们提出了一种新的数据增强方法,用于缺乏训练数据集的视觉内容域,在真实场景中合成合成三维对象。我们展示了该系统在热视频目标检测方面的性能,与可见光谱数据集相比,1)训练数据集非常有限;2)由于难以对场景材料的热特性进行建模,因此创建完全真实的合成场景非常麻烦且成本高昂。我们比较了不同的增广策略,包括通过RL技术获得的最新方法、模拟数据的注入和生成模型的使用,并研究了如何将我们提出的增广策略与其他技术最佳地结合起来。实验结果证明了我们的方法的有效性,我们的单模态检测器在FLIR-ADAS数据集上取得了最新的结果。 摘要:In this paper we propose a novel data augmentation approach for visual content domains that have scarce training datasets, compositing synthetic 3D objects within real scenes. We show the performance of the proposed system in the context of object detection in thermal videos, a domain where 1) training datasets are very limited compared to visible spectrum datasets and 2) creating full realistic synthetic scenes is extremely cumbersome and expensive due to the difficulty in modeling the thermal properties of the materials of the scene. We compare different augmentation strategies, including state of the art approaches obtained through RL techniques, the injection of simulated data and the employment of a generative model, and study how to best combine our proposed augmentation with these other techniques.Experimental results demonstrate the effectiveness of our approach, and our single-modality detector achieves state-of-the-art results on the FLIR ADAS dataset.

【2】 SRPN: similarity-based region proposal networks for nuclei and cells detection in histology images 标题:SRPN:基于相似性的组织学图像细胞核和细胞检测区域建议网络

作者:Yibao Sun,Xingru Huang,Huiyu Zhou,Qianni Zhang 机构: 2Huiyu Zhou is with the School of Informatics 备注:None 链接:https://arxiv.org/abs/2106.13556 摘要:组织学图像中细胞核和细胞的检测在临床实践和病理学研究中具有重要价值。然而,核或细胞的形态变化等多种原因使得传统的目标检测方法在很多情况下都不能获得令人满意的性能,这是一项具有挑战性的任务。检测任务由分类和定位两个子任务组成。在密集目标检测条件下,分类是提高检测性能的关键。考虑到这一点,我们提出了基于相似性的区域建议网络(SRPN)来检测组织学图像中的细胞核和细胞。特别地,为网络构建设计了一个称为嵌入层的定制卷积层。在区域建议网络中加入了嵌入层,使得网络能够基于相似性学习学习区分特征。与传统方法相比,通过相似性学习获得的特征能显著提高分类性能。SRPN可以很容易地集成到标准的卷积神经网络结构中,例如更快的R-CNN和RetinaNet。我们在组织学图像中对多器官细胞核检测和印戒细胞检测任务进行了测试。实验结果表明,应用相似性学习的网络在两个任务上都取得了优于同类网络的性能。特别是,与以前的方法相比,所提出的SRPN在MoNuSeg核分割和检测基准上达到了最先进的性能,与基线相比在印戒细胞检测基准上达到了最先进的性能。源代码可在以下位置公开获取:https://github.com/sigma10010/nuclei_cells_det. 摘要:The detection of nuclei and cells in histology images is of great value in both clinical practice and pathological studies. However, multiple reasons such as morphological variations of nuclei or cells make it a challenging task where conventional object detection methods cannot obtain satisfactory performance in many cases. A detection task consists of two sub-tasks, classification and localization. Under the condition of dense object detection, classification is a key to boost the detection performance. Considering this, we propose similarity based region proposal networks (SRPN) for nuclei and cells detection in histology images. In particular, a customized convolution layer termed as embedding layer is designed for network building. The embedding layer is added into the region proposal networks, enabling the networks to learn discriminative features based on similarity learning. Features obtained by similarity learning can significantly boost the classification performance compared to conventional methods. SRPN can be easily integrated into standard convolutional neural networks architectures such as the Faster R-CNN and RetinaNet. We test the proposed approach on tasks of multi-organ nuclei detection and signet ring cells detection in histological images. Experimental results show that networks applying similarity learning achieved superior performance on both tasks when compared to their counterparts. In particular, the proposed SRPN achieve state-of-the-art performance on the MoNuSeg benchmark for nuclei segmentation and detection while compared to previous methods, and on the signet ring cell detection benchmark when compared with baselines. The sourcecode is publicly available at: https://github.com/sigma10010/nuclei_cells_det.

【3】 To the Point: Efficient 3D Object Detection in the Range Image with Graph Convolution Kernels 标题:切中要害:基于图卷积核的深度图像三维目标检测

作者:Yuning Chai,Pei Sun,Jiquan Ngiam,Weiyue Wang,Benjamin Caine,Vijay Vasudevan,Xiao Zhang,Dragomir Anguelov 机构:Waymo LLC,Google Brain 备注:None 链接:https://arxiv.org/abs/2106.13381 摘要:三维目标检测对于机器人的应用至关重要。对于存在二维透视范围图像的任务,我们建议直接从该范围图像视图学习三维表示。为此,我们设计了一个二维卷积网络体系结构,在整个网络中承载每个像素的三维球坐标。它的层可以使用任意卷积内核来代替默认的内积内核,并利用每个像素周围的底层局部几何结构。我们概述了四个这样的内核:根据单词包范式的密集内核,以及受最近图形神经网络进展启发的三个图形内核:转换器、点网和边卷积。我们还探讨了跨模态融合与相机图像,方便操作的透视范围图像视图。我们的方法在Waymo开放数据集上具有竞争力,将最先进的行人检测AP从69.7%提高到75.5%。我们的最小模型在质量上仍然优于流行的PointPillars,它的效率也很高,它需要的触发器和模型参数少了180倍 摘要:3D object detection is vital for many robotics applications. For tasks where a 2D perspective range image exists, we propose to learn a 3D representation directly from this range image view. To this end, we designed a 2D convolutional network architecture that carries the 3D spherical coordinates of each pixel throughout the network. Its layers can consume any arbitrary convolution kernel in place of the default inner product kernel and exploit the underlying local geometry around each pixel. We outline four such kernels: a dense kernel according to the bag-of-words paradigm, and three graph kernels inspired by recent graph neural network advances: the Transformer, the PointNet, and the Edge Convolution. We also explore cross-modality fusion with the camera image, facilitated by operating in the perspective range image view. Our method performs competitively on the Waymo Open Dataset and improves the state-of-the-art AP for pedestrian detection from 69.7% to 75.5%. It is also efficient in that our smallest model, which still outperforms the popular PointPillars in quality, requires 180 times fewer FLOPS and model parameters

【4】 RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection 标题:RSN:用于高效、准确的LiDAR 3D目标检测的距离稀疏网络

作者:Pei Sun,Weiyue Wang,Yuning Chai,Gamaleldin Elsayed,Alex Bewley,Xiao Zhang,Cristian Sminchisescu,Dragomir Anguelov 机构:Waymo LLC,Google 备注:None 链接:https://arxiv.org/abs/2106.13365 摘要:从激光雷达数据中检测三维物体是大多数自动驾驶系统的关键组成部分。安全,高速驾驶需要更大的探测范围,这是由新的激光雷达启用。这些更大的探测范围需要更有效和准确的探测模型。为了实现这一目标,我们提出了一种简单、高效、准确的三维目标检测算法距离稀疏网(RSN),以解决这种扩展检测机制中的实时三维目标检测问题。RSN从距离图像中预测前景点,并对选定的前景点进行稀疏卷积来检测目标。密集范围图像上的轻量级2D卷积导致显著较少的选定前景点,从而使得RSN中随后的稀疏卷积能够有效地操作。结合距离图像的特征进一步提高了检测精度。RSN在Waymo开放数据集(WOD)的150m x 150m检测区域上以每秒60帧以上的速度运行,同时比以前发布的检测器更精确。截至2020年11月,根据基于激光雷达的行人和车辆检测的APH/1级指标,RSN在WOD排行榜中排名第一,同时比备选方案快数倍。 摘要:The detection of 3D objects from LiDAR data is a critical component in most autonomous driving systems. Safe, high speed driving needs larger detection ranges, which are enabled by new LiDARs. These larger detection ranges require more efficient and accurate detection models. Towards this goal, we propose Range Sparse Net (RSN), a simple, efficient, and accurate 3D object detector in order to tackle real time 3D object detection in this extended detection regime. RSN predicts foreground points from range images and applies sparse convolutions on the selected foreground points to detect objects. The lightweight 2D convolutions on dense range images results in significantly fewer selected foreground points, thus enabling the later sparse convolutions in RSN to efficiently operate. Combining features from the range image further enhance detection accuracy. RSN runs at more than 60 frames per second on a 150m x 150m detection region on Waymo Open Dataset (WOD) while being more accurate than previously published detectors. As of 11/2020, RSN is ranked first in the WOD leaderboard based on the APH/LEVEL 1 metrics for LiDAR-based pedestrian and vehicle detection, while being several times faster than alternatives.

分类|识别相关(3篇)

【1】 Efficient Document Image Classification Using Region-Based Graph Neural Network 标题:基于区域图神经网络的高效文档图像分类

作者:Jaya Krishna Mandivarapu,Eric Bunch,Qian You,Glenn Fung 机构:American Family Insurance, Machine Learning Research Group 链接:https://arxiv.org/abs/2106.13802 摘要:文档图像分类仍然是一个热门的研究领域,因为它可以在不同行业的许多企业应用中商业化。近年来,随着大规模预训练计算机视觉和语言模型以及图形神经网络的发展,为文档图像分类提供了许多工具。然而,使用大规模的预训练模型通常需要大量的计算资源,这可能会挫败自动文档图像分类节省成本的优势。本文提出了一种基于图卷积神经网络的文档图像分类框架,该框架融合了文档的文本、视觉和布局信息。我们在公开的数据集和真实的保险文档分类数据集上,对我们提出的算法与几种最先进的视觉和语言模型进行了严格的基准测试。在公开数据和真实数据上的实证结果表明,我们的方法实现了接近SOTA的性能,但需要更少的计算资源和时间进行模型训练和推理。这使得解决方案具有更好的成本优势,尤其是在企业应用程序的可扩展部署方面。结果表明,该算法的分类性能与SOTA算法相当接近。我们还提供了计算资源,模型大小,训练和推理时间之间的综合比较我们提出的方法和基线。此外,我们用我们的方法和其他基线描绘了每幅图像的成本。 摘要:Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources which could defeat the cost-saving advantages of automatic document image classification. In the paper we propose an efficient document image classification framework that uses graph convolution neural networks and incorporates textual, visual and layout information of the document. We have rigorously benchmarked our proposed algorithm against several state-of-art vision and language models on both publicly available dataset and a real-life insurance document classification dataset. Empirical results on both publicly available and real-world data show that our methods achieve near SOTA performance yet require much less computing resources and time for model training and inference. This results in solutions than offer better cost advantages, especially in scalable deployment for enterprise applications. The results showed that our algorithm can achieve classification performance quite close to SOTA. We also provide comprehensive comparisons of computing resources, model sizes, train and inference time between our proposed methods and baselines. In addition we delineate the cost per image using our method and other baselines.

【2】 HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition 标题:HAN:一种高效的基于骨架的层次化手势识别网络

作者:Jianbo Liu,Ying Wang,Shiming Xiang,Chunhong Pan 机构: and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences 备注:Under peer review for TCSVT 链接:https://arxiv.org/abs/2106.13391 摘要:以往的基于骨架的手势识别方法大多是将骨架序列排列成伪图像或时空图形,并采用深度卷积神经网络(CNN)或图形卷积网络(GCN)进行特征提取。这些方法虽然取得了较好的效果,但在动态获取交互式手部局部特征方面存在固有的局限性,计算效率仍然是一个严重的问题。本文引入了自我注意机制来缓解这一问题。考虑到手关节的层次结构,本文提出了一种有效的基于骨架的层次自注意网络(HAN),该网络基于纯自注意,不需要CNN、RNN或GCN算子。具体来说,关节自我注意模块用于捕获手指的空间特征,手指自我注意模块用于聚合整只手的特征。在时间特征方面,利用时间自我注意模块捕捉手指和整只手的时间动态。最后,利用融合自注意模块对这些特征进行融合,实现手势分类。实验表明,该方法在三个手势识别数据集上都取得了较好的识别效果,且计算复杂度大大降低。 摘要:Previous methods for skeleton-based gesture recognition mostly arrange the skeleton sequence into a pseudo picture or spatial-temporal graph and apply deep Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) for feature extraction. Although achieving superior results, these methods have inherent limitations in dynamically capturing local features of interactive hand parts, and the computing efficiency still remains a serious issue. In this work, the self-attention mechanism is introduced to alleviate this problem. Considering the hierarchical structure of hand joints, we propose an efficient hierarchical self-attention network (HAN) for skeleton-based gesture recognition, which is based on pure self-attention without any CNN, RNN or GCN operators. Specifically, the joint self-attention module is used to capture spatial features of fingers, the finger self-attention module is designed to aggregate features of the whole hand. In terms of temporal features, the temporal self-attention module is utilized to capture the temporal dynamics of the fingers and the entire hand. Finally, these features are fused by the fusion self-attention module for gesture classification. Experiments show that our method achieves competitive results on three gesture recognition datasets with much lower computational complexity.

【3】 Generalized One-Class Learning Using Pairs of Complementary Classifiers 标题:基于互补分类器对的广义单类学习

作者:Anoop Cherian,Jue Wang 机构:com•Jue Wang is with the Research School of Engineering, The AustralianNational University 备注:Accepted at Trans. PAMI. arXiv admin note: text overlap with arXiv:1908.05884 链接:https://arxiv.org/abs/2106.13272 摘要:单类学习是一个经典的问题,它将模型与数据相匹配,而这些数据的注释只对单个类可用。在这篇文章中,我们探讨了一类学习的新目标,我们统称为广义一类判别子空间(goods)。我们的核心思想是学习一对互补分类器来灵活地约束一类数据分布,其中数据属于互补对中一个分类器的正半空间,而属于另一个分类器的负半空间。为了避免冗余,同时允许分类器决策面的非线性,我们建议将每个分类器设计为一个正交框架,并通过两个冲突目标的联合优化来学习这些框架,即:i)最小化两个框架之间的距离,以及ii)最大化帧和数据之间的边距。因此,学习的正交框架将描述一个分段线性决策面,该决策面允许有效的推理,而我们的目标是将数据限制在一个最小的体积内,使决策裕度最大化,从而有力地捕获数据分布。我们探讨了我们的公式在不同的约束条件下对组成分类器的几种变体,包括核化特征映射。我们通过对计算机视觉中的一些应用(如视频序列中的异常检测、人类姿势和人类活动)的数据进行实验,证明了我们的方法的经验优势。我们还通过在几个UCI数据集上的实验,探索了GODS在非视觉任务中的通用性和有效性,展示了最新的结果。 摘要:One-class learning is the classic problem of fitting a model to the data for which annotations are available only for a single class. In this paper, we explore novel objectives for one-class learning, which we collectively refer to as Generalized One-class Discriminative Subspaces (GODS). Our key idea is to learn a pair of complementary classifiers to flexibly bound the one-class data distribution, where the data belongs to the positive half-space of one of the classifiers in the complementary pair and to the negative half-space of the other. To avoid redundancy while allowing non-linearity in the classifier decision surfaces, we propose to design each classifier as an orthonormal frame and seek to learn these frames via jointly optimizing for two conflicting objectives, namely: i) to minimize the distance between the two frames, and ii) to maximize the margin between the frames and the data. The learned orthonormal frames will thus characterize a piecewise linear decision surface that allows for efficient inference, while our objectives seek to bound the data within a minimal volume that maximizes the decision margin, thereby robustly capturing the data distribution. We explore several variants of our formulation under different constraints on the constituent classifiers, including kernelized feature maps. We demonstrate the empirical benefits of our approach via experiments on data from several applications in computer vision, such as anomaly detection in video sequences, human poses, and human activities. We also explore the generality and effectiveness of GODS for non-vision tasks via experiments on several UCI datasets, demonstrating state-of-the-art results.

分割|语义相关(3篇)

【1】 Diversifying Semantic Image Synthesis and Editing via Class- and Layer-wise VAEs 标题:通过类和层VAE实现多样化的语义图像合成和编辑

作者:Yuki Endo,Yoshihiro Kanamori 机构: KanamoriUniversity of Tsukuba 备注:Accepted to Pacific Graphics 2020, codes available at this https URL 链接:https://arxiv.org/abs/2106.13416 摘要:语义图像合成是从单个语义掩模生成真实感图像的过程。为了丰富多模态图像合成的多样性,以往的方法都是通过学习单个潜在空间来控制输出图像的全局外观。但是,单个潜在代码通常不足以捕获各种对象样式,因为对象外观取决于多个因素。为了处理决定对象样式的单个因素,我们提出了一种变分自动编码器(VAE)框架的类和层扩展,该框架允许通过学习多个潜在空间在局部到全局水平上灵活控制每个对象类。此外,我们通过对三个不同领域的真实数据集和合成数据集的大量实验,证明了我们的方法生成的图像与最先进的方法相比,既合理又多样。我们还表明,我们的方法在图像合成和编辑任务中有广泛的应用。 摘要:Semantic image synthesis is a process for generating photorealistic images from a single semantic mask. To enrich the diversity of multimodal image synthesis, previous methods have controlled the global appearance of an output image by learning a single latent space. However, a single latent code is often insufficient for capturing various object styles because object appearance depends on multiple factors. To handle individual factors that determine object styles, we propose a class- and layer-wise extension to the variational autoencoder (VAE) framework that allows flexible control over each object class at the local to global levels by learning multiple latent spaces. Furthermore, we demonstrate that our method generates images that are both plausible and more diverse compared to state-of-the-art methods via extensive experiments with real and synthetic datasets inthree different domains. We also show that our method enables a wide range of applications in image synthesis and editing tasks.

【2】 Semi-supervised Meta-learning with Disentanglement for Domain-generalised Medical Image Segmentation 标题:领域泛化医学图像分割的半监督解缠元学习

作者:Xiao Liu,Spyridon Thermos,Alison O'Neil,Sotirios A. Tsaftaris 机构: School of Engineering, University of Edinburgh, Edinburgh EH,FB, UK, The Alan Turing Institute, London, UK, Canon Medical Research Europe Ltd., Edinburgh, UK 备注:Accepted by MICCAI 2021 链接:https://arxiv.org/abs/2106.13292 摘要:将深度模型推广到新中心(这里称为域)的新数据仍然是一个挑战。这在很大程度上归因于源域和不可见域之间数据统计(域转移)的变化。最近,基于梯度的元学习方法将训练数据分为元训练集和元测试集,以模拟和处理训练过程中的域转移,从而提高了泛化性能。然而,目前的全监督元学习方法在医学图像分割中不具有可扩展性,需要大量的工作来创建像素级的注释。同时,在低数据区,模拟的域偏移可能无法很好地逼近源域和不可见域之间的真实域偏移。为了解决这个问题,我们提出了一个新的半监督元学习框架与解纠缠。我们显式地建模与域转移相关的表示。分离表示并结合它们来重建输入图像允许使用未标记的数据来更好地近似元学习的真实域移动。因此,该模型可以获得更好的泛化性能,特别是在标记数据量有限的情况下。实验表明,该方法对不同的分割任务具有较强的鲁棒性,并在两个公共基准上获得了最新的泛化性能。 摘要:Generalising deep models to new data from new centres (termed here domains) remains a challenge. This is largely attributed to shifts in data statistics (domain shifts) between source and unseen domains. Recently, gradient-based meta-learning approaches where the training data are split into meta-train and meta-test sets to simulate and handle the domain shifts during training have shown improved generalisation performance. However, the current fully supervised meta-learning approaches are not scalable for medical image segmentation, where large effort is required to create pixel-wise annotations. Meanwhile, in a low data regime, the simulated domain shifts may not approximate the true domain shifts well across source and unseen domains. To address this problem, we propose a novel semi-supervised meta-learning framework with disentanglement. We explicitly model the representations related to domain shifts. Disentangling the representations and combining them to reconstruct the input image allows unlabeled data to be used to better approximate the true domain shifts for meta-learning. Hence, the model can achieve better generalisation performance, especially when there is a limited amount of labeled data. Experiments show that the proposed method is robust on different segmentation tasks and achieves state-of-the-art generalisation performance on two public benchmarks.

【3】 Semantic annotation for computational pathology: Multidisciplinary experience and best practice recommendations 标题:计算病理学的语义注释:多学科经验和最佳实践建议

作者:Noorul Wahab,Islam M Miligy,Katherine Dodd,Harvir Sahota,Michael Toss,Wenqi Lu,Mostafa Jahanifar,Mohsin Bilal,Simon Graham,Young Park,Giorgos Hadjigeorghiou,Abhir Bhalerao,Ayat Lashen,Asmaa Ibrahim,Ayaka Katayama,Henry O Ebili,Matthew Parkin,Tom Sorell,Shan E Ahmed Raza,Emily Hero,Hesham Eldaly,Yee Wah Tsang,Kishore Gopalakrishnan,David Snead,Emad Rakha,Nasir Rajpoot,Fayyaz Minhas 机构:Tissue Image Analytics Centre, University of Warwick, Coventry, UK, University of Nottingham, Nottingham, UK, Department of Pathology, Menoufia University, Egypt, University Hospital Coventry and Warwickshire, Coventry, UK 链接:https://arxiv.org/abs/2106.13689 摘要:近年来,随着全幻灯片成像(WSI)技术的发展,出现了大量基于计算机视觉和人工智能(AI)的诊断、预测和预测算法。计算病理学(CPath)提供了一个综合的解决方案,利用信息嵌入在病理WSIs超越我们通过视觉评估获得。对于WSIs的自动分析和机器学习(ML)模型的验证,需要在幻灯片、组织和细胞水平上进行注释。病理图像中重要视觉结构的标注是CPath项目的重要组成部分。不正确的注释会导致难以解释的算法,并可能产生不准确和不一致的结果。尽管注释在CPath项目中起着至关重要的作用,但是对于注释应该如何执行还没有明确的指导方针或最佳实践。在本文中,我们通过介绍在执行大规模注释练习期间获得的经验和最佳实践来解决这一缺点,该大型注释练习涉及病理学家、ML专家和研究人员组成的多学科团队,作为病理图像数据湖分析、知识和教育(PathLAKE)联盟的一部分。我们提出了一个现实世界的案例研究以及不同类型的注释,诊断算法,注释数据字典和注释结构的例子。这项工作中报告的分析强调了在CPath项目的生命周期中可以用作注释指南的最佳实践建议。 摘要:Recent advances in whole slide imaging (WSI) technology have led to the development of a myriad of computer vision and artificial intelligence (AI) based diagnostic, prognostic, and predictive algorithms. Computational Pathology (CPath) offers an integrated solution to utilize information embedded in pathology WSIs beyond what we obtain through visual assessment. For automated analysis of WSIs and validation of machine learning (ML) models, annotations at the slide, tissue and cellular levels are required. The annotation of important visual constructs in pathology images is an important component of CPath projects. Improper annotations can result in algorithms which are hard to interpret and can potentially produce inaccurate and inconsistent results. Despite the crucial role of annotations in CPath projects, there are no well-defined guidelines or best practices on how annotations should be carried out. In this paper, we address this shortcoming by presenting the experience and best practices acquired during the execution of a large-scale annotation exercise involving a multidisciplinary team of pathologists, ML experts and researchers as part of the Pathology image data Lake for Analytics, Knowledge and Education (PathLAKE) consortium. We present a real-world case study along with examples of different types of annotations, diagnostic algorithm, annotation data dictionary and annotation constructs. The analyses reported in this work highlight best practice recommendations that can be used as annotation guidelines over the lifecycle of a CPath project.

Zero/Few Shot|迁移|域适配|自适应(1篇)

【1】 Domain-guided Machine Learning for Remotely Sensed In-Season Crop Growth Estimation 标题:基于领域指导的机器学习在作物生长遥感估测中的应用

作者:George Worrall,Anand Rangarajan,Jasmeet Judge 机构: University of FloridaA, Rangarajan is with the Department of Computer & Information Science& Engineering, University of Florida[9]–[ 1 3] 备注:7 pages, 7 tables, 11 figures 链接:https://arxiv.org/abs/2106.13323 摘要:先进的机器学习技术已被用于遥感(RS)应用中,如作物制图和产量预测,但在跟踪作物进程方面的应用仍然不足。在这项研究中,我们展示了在基于长-短期记忆的领域引导神经网络(DgNN)中使用作物生长驱动因素的农艺知识来进行季节性作物进度估计。DgNN使用分支结构和注意力来分离独立的作物生长驱动因素,并捕捉它们在整个生长季节中的不同重要性。DgNN使用爱荷华州2003-2019年期间的遥感数据对玉米实施,美国农业部作物进度报告用作地面实况。全州范围内的DgNN的性能显示出明显的改善,比序列和密集的神经网络结构,以及广泛使用的隐马尔可夫模型方法。在所有生长阶段,DgNN的Nash-Sutfliffe效率都比其他NN高3.5%,在试验年份,具有最高余弦相似性的周数比其他NN多33%。DgNN和序贯NN在作物异常生长期间更为稳健,尽管所有方法都难以估计吐丝-颗粒物转变。最后,均匀流形近似和层激活的投影可视化显示了基于LSTM的NNs是如何将作物生长时间序列与纯稠密结构分开的。本研究的结果显示NNs在作物生长阶段估计(CGSE)中的可行性和使用领域知识的好处。本文提出的DgNN方法可以扩展到提供其他作物的近实时CGSE。 摘要:Advanced machine learning techniques have been used in remote sensing (RS) applications such as crop mapping and yield prediction, but remain under-utilized for tracking crop progress. In this study, we demonstrate the use of agronomic knowledge of crop growth drivers in a Long Short-Term Memory-based, Domain-guided neural network (DgNN) for in-season crop progress estimation. The DgNN uses a branched structure and attention to separate independent crop growth drivers and capture their varying importance throughout the growing season. The DgNN is implemented for corn, using RS data in Iowa for the period 2003-2019, with USDA crop progress reports used as ground truth. State-wide DgNN performance shows significant improvement over sequential and dense-only NN structures, and a widely-used Hidden Markov Model method. The DgNN had a 3.5% higher Nash-Sutfliffe efficiency over all growth stages and 33% more weeks with highest cosine similarity than the other NNs during test years. The DgNN and Sequential NN were more robust during periods of abnormal crop progress, though estimating the Silking-Grainfill transition was difficult for all methods. Finally, Uniform Manifold Approximation and Projection visualizations of layer activations showed how LSTM-based NNs separate crop growth time-series differently from a dense-only structure. Results from this study exhibit both the viability of NNs in crop growth stage estimation (CGSE) and the benefits of using domain knowledge. The DgNN methodology presented here can be extended to provide near-real time CGSE of other crops.

半弱无监督|主动学习|不确定性(3篇)

【1】 On the Robustness of Pretraining and Self-Supervision for a Deep Learning-based Analysis of Diabetic Retinopathy 标题:基于深度学习的糖尿病视网膜病变预训练和自我监控的稳健性研究

作者:Vignesh Srinivasan,Nils Strodthoff,Jackie Ma,Alexander Binder,Klaus-Robert Müller,Wojciech Samek 机构: NSand WS are also with BIFOLD – Berlin Institute for the Foun-dations of Learning and Data, ABis with the Department of Informatics, Oslo University, Technische Universit¨at Berlin 链接:https://arxiv.org/abs/2106.13497 摘要:越来越多的医学用例中,基于深度神经网络的分类算法达到了与人类医学专家竞争的性能水平。为了缓解小数据集的挑战,这些系统通常依赖于预训练。在这项工作中,我们的目标是评估这些方法的广泛影响。以糖尿病视网膜病变分级为例,我们比较了不同训练程序的影响,包括最近建立的基于对比学习的自我监督训练方法。为此,我们研究了不同的方面,如定量性能,统计学习的特征表示,可解释性和鲁棒性的图像失真。我们的结果表明,从ImageNet预训练初始化的模型在性能、泛化和对图像失真的鲁棒性方面都有显著的提高。特别地,自监督模型显示了监督模型的进一步优点。通过ImageNet预训练初始化的自我监督模型不仅报告了更高的性能,而且还减少了对大病灶的过度拟合,同时改进了对指示疾病进展的微小病灶的考虑。从更广泛的意义上理解预训练的效果,而不仅仅是简单的性能比较,这对于更广泛的医学成像界来说是至关重要的。 摘要:There is an increasing number of medical use-cases where classification algorithms based on deep neural networks reach performance levels that are competitive with human medical experts. To alleviate the challenges of small dataset sizes, these systems often rely on pretraining. In this work, we aim to assess the broader implications of these approaches. For diabetic retinopathy grading as exemplary use case, we compare the impact of different training procedures including recently established self-supervised pretraining methods based on contrastive learning. To this end, we investigate different aspects such as quantitative performance, statistics of the learned feature representations, interpretability and robustness to image distortions. Our results indicate that models initialized from ImageNet pretraining report a significant increase in performance, generalization and robustness to image distortions. In particular, self-supervised models show further benefits to supervised models. Self-supervised models with initialization from ImageNet pretraining not only report higher performance, they also reduce overfitting to large lesions along with improvements in taking into account minute lesions indicative of the progression of the disease. Understanding the effects of pretraining in a broader sense that goes beyond simple performance comparisons is of crucial importance for the broader medical imaging community beyond the use-case considered in this work.

【2】 A Novel Self-Learning Framework for Bladder Cancer Grading Using Histopathological Images 标题:一种新的基于组织病理学图像的膀胱癌分级自学习框架

作者:Gabriel García,Anna Esteve,Adrián Colomer,David Ramos,Valery Naranjo 机构:Instituto de Investigaci´on e Innovaci´on en Bioingenier´ıa, Universitat Politecnica de Valencia, Valencia, Spain, Hospital Universitario y Polit´ecnico La Fe, Avinguda de Fernando Abril Martorell, Valencia, Spain. 链接:https://arxiv.org/abs/2106.13559 摘要:近年来,膀胱癌的发病率和死亡率显著增加。目前,根据肿瘤的生长情况已知两种亚型:非肌肉浸润性膀胱癌(NMIBC)和肌肉浸润性膀胱癌(MIBC)。在这项工作中,我们关注MIBC亚型,因为它预后最差,并且可以扩散到邻近器官。我们提出了一个自我学习的框架来分级膀胱癌的组织学图像染色通过免疫组织化学技术。具体来说,我们提出了一种新的深度卷积嵌入注意聚类(DCEAC),它允许根据文献中建立的模式将组织学斑块分为不同的疾病严重程度。提出的DCEAC模型遵循两步完全无监督学习方法,从512x512像素的高分辨率样本中区分非肿瘤、轻度和浸润性模式。我们的系统比以前的基于聚类的方法具有更好的性能,包括卷积注意模块,它允许在分类阶段之前细化潜在空间的特征。所提出的网络在不同的度量中超过了最先进的方法2-3%,在多类场景中达到了0.9034的最终平均精度。此外,所报告的类激活映射证明,我们的模型能够自行学习临床医生认为相关的相同模式,而无需事先进行注释步骤。这一事实表明,肌肉浸润性膀胱癌分级取得了突破性进展,填补了在标记数据上训练模型的空白。 摘要:Recently, bladder cancer has been significantly increased in terms of incidence and mortality. Currently, two subtypes are known based on tumour growth: non-muscle invasive (NMIBC) and muscle-invasive bladder cancer (MIBC). In this work, we focus on the MIBC subtype because it is of the worst prognosis and can spread to adjacent organs. We present a self-learning framework to grade bladder cancer from histological images stained via immunohistochemical techniques. Specifically, we propose a novel Deep Convolutional Embedded Attention Clustering (DCEAC) which allows classifying histological patches into different severity levels of the disease, according to the patterns established in the literature. The proposed DCEAC model follows a two-step fully unsupervised learning methodology to discern between non-tumour, mild and infiltrative patterns from high-resolution samples of 512x512 pixels. Our system outperforms previous clustering-based methods by including a convolutional attention module, which allows refining the features of the latent space before the classification stage. The proposed network exceeds state-of-the-art approaches by 2-3% across different metrics, achieving a final average accuracy of 0.9034 in a multi-class scenario. Furthermore, the reported class activation maps evidence that our model is able to learn by itself the same patterns that clinicians consider relevant, without incurring prior annotation steps. This fact supposes a breakthrough in muscle-invasive bladder cancer grading which bridges the gap with respect to train the model on labelled data.

【3】 Generalized Unsupervised Clustering of Hyperspectral Images of Geological Targets in the Near Infrared 标题:近红外地质目标高光谱图像的广义无监督聚类

作者:Angela F. Gao,Brandon Rasmussen,Peter Kulits,Eva L. Scheller,Rebecca Greenberger,Bethany L. Ehlmann 机构:Caltech 备注:10 pages, 4 figures. Accepted, CVPR PBVS Workshop 2021 链接:https://arxiv.org/abs/2106.13315 摘要:随着数据的可获取性和成本效益的提高,红外高光谱图像在地质问题中的应用越来越普遍。对光谱相似的物质进行聚类和分类通常是从地球上的经济矿产勘探到火星上的行星勘探等应用的第一步。由专家开发的光谱参数引导的半人工分类方法耗时且有偏差,而监督方法需要大量的标记数据且难以推广。在这里,我们开发了一个完全无监督的特征提取和聚类工作流程,由专家光谱地质学家输入和定量度量提供信息。我们的管道使用一个轻量级的自动编码器,然后高斯混合建模来映射任何图像中的光谱多样性。我们使用阿曼蛇绿岩钻芯的专家标记数据验证了我们的管道在亚毫米级的性能,并使用火星上Jezero陨石坑(月球车着陆地点)的部分分类轨道数据评估了米级的性能。此外,我们还研究了在传统的高光谱图像分析中使用的各种预处理技术的效果。这条管道提供了一个快速准确的相似地质材料聚类图,并在实验室图像和遥感图像中一致地识别和分离了主要矿物类别。我们将我们的管道称为“矿物(石膏)光谱无监督聚类的通用管道” 摘要:The application of infrared hyperspectral imagery to geological problems is becoming more popular as data become more accessible and cost-effective. Clustering and classifying spectrally similar materials is often a first step in applications ranging from economic mineral exploration on Earth to planetary exploration on Mars. Semi-manual classification guided by expertly developed spectral parameters can be time consuming and biased, while supervised methods require abundant labeled data and can be difficult to generalize. Here we develop a fully unsupervised workflow for feature extraction and clustering informed by both expert spectral geologist input and quantitative metrics. Our pipeline uses a lightweight autoencoder followed by Gaussian mixture modeling to map the spectral diversity within any image. We validate the performance of our pipeline at submillimeter-scale with expert-labelled data from the Oman ophiolite drill core and evaluate performance at meters-scale with partially classified orbital data of Jezero Crater on Mars (the landing site for the Perseverance rover). We additionally examine the effects of various preprocessing techniques used in traditional analysis of hyperspectral imagery. This pipeline provides a fast and accurate clustering map of similar geological materials and consistently identifies and separates major mineral classes in both laboratory imagery and remote sensing imagery. We refer to our pipeline as "Generalized Pipeline for Spectroscopic Unsupervised clustering of Minerals (GyPSUM)."

时序|行为识别|姿态|视频|运动估计(7篇)

【1】 Animatable Neural Radiance Fields from Monocular RGB Video 标题:来自单目RGB视频的可动画神经辐射场

作者:Jianchuan Chen,Ying Zhang,Di Kang,Xuefei Zhe,Linchao Bao,Huchuan Lu 机构:Dalian University of Technology, Tencent AI Lab 备注:9 pages, 9 figures 链接:https://arxiv.org/abs/2106.13629 摘要:我们提出了动画神经辐射场详细的人类头像创建从单目视频。该方法通过在学习场景表示网络的同时引入显式的姿态引导变形,将神经辐射场(NeRF)扩展到具有人体运动的动态场景。特别地,我们估计每一帧的人体姿态,并学习一个恒定的人体模板正则空间,使得在姿态参数的显式控制下,从观察空间到正则空间的自然形状变形成为可能。为了弥补姿态估计的不精确性,在学习过程中引入了更新初始姿态的姿态细化策略,不仅有助于学习更精确的人体重建,而且加快了收敛速度。实验结果表明,该方法实现了1)具有高质量细节的隐式人体几何和外观重建,2)基于任意视图的人体真实感绘制,3)具有任意姿态的人体动画。 摘要:We present animatable neural radiance fields for detailed human avatar creation from monocular videos. Our approach extends neural radiance fields (NeRF) to the dynamic scenes with human movements via introducing explicit pose-guided deformation while learning the scene representation network. In particular, we estimate the human pose for each frame and learn a constant canonical space for the detailed human template, which enables natural shape deformation from the observation space to the canonical space under the explicit control of the pose parameters. To compensate for inaccurate pose estimation, we introduce the pose refinement strategy that updates the initial pose during the learning process, which not only helps to learn more accurate human reconstruction but also accelerates the convergence. In experiments we show that the proposed approach achieves 1) implicit human geometry and appearance reconstruction with high-quality details, 2) photo-realistic rendering of the human from arbitrary views, and 3) animation of the human with arbitrary poses.

【2】 Multiview Video Compression Using Advanced HEVC Screen Content Coding 标题:基于改进HEVC屏幕内容编码的多视点视频压缩

作者:Jarosław Samelak,Marek Domański 机构:Poznań University of Technology, Institute of Multimedia Telecommunications, Poland 链接:https://arxiv.org/abs/2106.13574 摘要:提出了一种基于屏幕内容编码的多视点视频编码方法。假设与所有视图相对应的帧在一个时刻被压缩到单个帧中,即,多视图编码的帧兼容方法被应用。对于这种编码方案,本文证明了屏幕内容编码可以有效地用于多视点视频编码。考虑了两种方法:第一种使用标准HEVC屏幕内容编码,第二种使用高级屏幕内容编码。后者是作者最初的提议,即利用四分之一像素运动矢量和HEVC屏幕内容编码的其他非标准扩展。实验结果表明,即使使用标准的HEVC屏幕内容编码,多视点视频编码也比同播HEVC编码有更高的效率。所提出的高级屏幕内容编码提供了与MV-HEVC几乎相同的编码效率,MV-HEVC是最先进的多视点视频压缩技术。作者认为,先进的屏幕内容编码可以有效地用于新的多功能视频编码(VVC)技术。然而,VVC的参考多视图扩展还不存在,因此,对于基于VVC的编码,实验比较留待以后的工作。 摘要:The paper presents a new approach to multiview video coding using Screen Content Coding. It is assumed that for a time instant the frames corresponding to all views are packed into a single frame, i.e. the frame-compatible approach to multiview coding is applied. For such coding scenario, the paper demonstrates that Screen Content Coding can be efficiently used for multiview video coding. Two approaches are considered: the first using standard HEVC Screen Content Coding, and the second using Advanced Screen Content Coding. The latter is the original proposal of the authors that exploits quarter-pel motion vectors and other nonstandard extensions of HEVC Screen Content Coding. The experimental results demonstrate that multiview video coding even using standard HEVC Screen Content Coding is much more efficient than simulcast HEVC coding. The proposed Advanced Screen Content Coding provides virtually the same coding efficiency as MV-HEVC, which is the state-of-the-art multiview video compression technique. The authors suggest that Advanced Screen Content Coding can be efficiently used within the new Versatile Video Coding (VVC) technology. Nevertheless a reference multiview extension of VVC does not exist yet, therefore, for VVC-based coding, the experimental comparisons are left for future work.

【3】 Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair 标题:利用潜在相关对考虑多对多对应的文本查询视频时刻检索

作者:Sho Maeoki,Yusuke Mukuta,Tatsuya Harada 机构:Tokyo, Japan, The University of Tokyo RIKEN 备注:15 pages, 10 figures 链接:https://arxiv.org/abs/2106.13566 摘要:在本文中,我们承担了从视频语料库中提取基于文本的视频片段的任务。为了训练模型,使用文本矩配对数据集学习正确的对应关系。在典型的训练方法中,地面真文本矩对被视为正对,而其他对被视为负对。然而,除了基本真值对,一些文本时刻对应该被认为是积极的。在这种情况下,一个文本注释对于许多视频时刻都是正的。相反,一个视频时刻可以对应许多文本注释。因此,文本注释和视频片段之间存在多对多的对应关系。基于这些对应关系,我们可以形成潜在的相关对,这些对不是作为基本事实给出的,也不是负的;有效地将这些相关对结合到训练中可以提高检索性能。文本查询应该描述视频瞬间发生的事情。因此,用相似文本注释的包含相似动作的不同视频时刻可能包含相似动作,因此这些对可以被视为潜在相关对。在本文中,我们提出了一种新的训练方法,利用潜在的相关对,这是检测基于语言分析的文本注释。在两个基准数据集上的实验表明,该方法在定性和定量上都提高了检索性能。 摘要:In this paper we undertake the task of text-based video moment retrieval from a corpus of videos. To train the model, text-moment paired datasets were used to learn the correct correspondences. In typical training methods, ground-truth text-moment pairs are used as positive pairs, whereas other pairs are regarded as negative pairs. However, aside from the ground-truth pairs, some text-moment pairs should be regarded as positive. In this case, one text annotation can be positive for many video moments. Conversely, one video moment can be corresponded to many text annotations. Thus, there are many-to-many correspondences between the text annotations and video moments. Based on these correspondences, we can form potentially relevant pairs, which are not given as ground truth yet are not negative; effectively incorporating such relevant pairs into training can improve the retrieval performance. The text query should describe what is happening in a video moment. Hence, different video moments annotated with similar texts, which contain a similar action, are likely to hold the similar action, thus these pairs can be considered as potentially relevant pairs. In this paper, we propose a novel training method that takes advantage of potentially relevant pairs, which are detected based on linguistic analysis about text annotation. Experiments on two benchmark datasets revealed that our method improves the retrieval performance both quantitatively and qualitatively.

【4】 Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering 标题:分层面向对象的视频问答时空推理

作者:Long Hoang Dang,Thao Minh Le,Vuong Le,Truyen Tran 机构:Applied Artificial Intelligence Institute, Deakin University, Australia 备注:Accepted by IJCAI 2021 链接:https://arxiv.org/abs/2106.13432 摘要:视频问答(videoqa)是开发新的人工智能功能的强大测试平台。这项任务需要学习在时空中,在视觉和语言领域中对物体、关系和事件进行推理。高级推理要求从联想视觉模式识别提升到对对象、对象的行为和交互的符号式操作。为了达到这个目标,我们提出了一种面向对象的推理方法,视频被抽象为交互对象的动态流。在视频事件流的每个阶段,这些对象彼此交互,并且它们的交互是根据查询和视频的整体上下文进行推理的。该机制具体化为一系列通用神经单元及其多层结构,称为层次化面向对象时空推理(HOSTR)网络。该神经网络模型以层次嵌套的时空图的形式维护对象的一致生命线。在这个图中,动态交互的面向对象表示沿着视频序列建立起来,以自底向上的方式进行分层抽象,并向关键信息收敛以获得正确答案。该方法在多个主要的视频QA数据集上进行了评估,并在这些任务中建立了新的技术状态。对模型行为的分析表明,面向对象推理是一种可靠的、可解释的、有效的视频质量保证方法。 摘要:Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol-like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks. This neural model maintains the objects' consistent lifelines in the form of a hierarchically nested spatio-temporal graph. Within this graph, the dynamic interactive object-oriented representations are built up along the video sequence, hierarchically abstracted in a bottom-up manner, and converge toward the key information for the correct answer. The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model's behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA.

【5】 Interpreting Depression From Question-wise Long-term Video Recording of SDS Evaluation 标题:从SDS评估的问题式长期录像解读抑郁

作者:Wanqing Xie,Lizhong Liang,Yao Lu,Chen Wang,Jihong Shen,Hui Luo,Xiaofeng Liu 机构: Lu is with the School of Computer Science and Engineering, Sun Yat-senUniversity, HarbinEngineering University 备注:Published in IEEE Journal of Biomedical and Health Informatics 链接:https://arxiv.org/abs/2106.13393 摘要:抑郁自评量表(SDS)是一种常用的抑郁症初筛方法。然而,不可控的自我管理措施很容易受到漫不经心或欺骗性回答的影响,并产生不同的结果与临床医生管理汉密尔顿抑郁量表(HDRS)和最终诊断。临床上,面部表情和动作在临床医生的评估中起着至关重要的作用,而自我评估中对面部表情和动作的探索不足。在这项工作中,我们收集了一个新的数据集,200名受试者的自评问卷的有效性和相应的问题的视频记录。为了从抑郁自评量表和配对视频中自动解释抑郁,我们提出了一个长时变长视频的端到端分层框架,该框架还以问卷结果和回答时间为条件。具体地说,我们采用了一个层次模型,该模型利用一个3D CNN进行局部时间模式探索,并利用一个冗余感知的自我注意(RAS)方案进行问题式的全局特征聚合。针对冗余的长期FE视频处理,我们的RAS能够有效地利用问题集中每个视频片段的相关性来强调区分信息,并基于特征对的亲和性消除冗余。然后,将问题视频特征与问卷分数连接起来,进行最终的抑郁检测。我们的深入评估也显示了融合SDS评估和视频记录的有效性,以及我们的框架相对于传统的时态建模方法的优越性。 摘要:Self-Rating Depression Scale (SDS) questionnaire has frequently been used for efficient depression preliminary screening. However, the uncontrollable self-administered measure can be easily affected by insouciantly or deceptively answering, and producing the different results with the clinician-administered Hamilton Depression Rating Scale (HDRS) and the final diagnosis. Clinically, facial expression (FE) and actions play a vital role in clinician-administered evaluation, while FE and action are underexplored for self-administered evaluations. In this work, we collect a novel dataset of 200 subjects to evidence the validity of self-rating questionnaires with their corresponding question-wise video recording. To automatically interpret depression from the SDS evaluation and the paired video, we propose an end-to-end hierarchical framework for the long-term variable-length video, which is also conditioned on the questionnaire results and the answering time. Specifically, we resort to a hierarchical model which utilizes a 3D CNN for local temporal pattern exploration and a redundancy-aware self-attention (RAS) scheme for question-wise global feature aggregation. Targeting for the redundant long-term FE video processing, our RAS is able to effectively exploit the correlations of each video clip within a question set to emphasize the discriminative information and eliminate the redundancy based on feature pair-wise affinity. Then, the question-wise video feature is concatenated with the questionnaire scores for final depression detection. Our thorough evaluations also show the validity of fusing SDS evaluation and its video recording, and the superiority of our framework to the conventional state-of-the-art temporal modeling methods.

【6】 DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval 标题:DNS:用于高效、准确的视频索引和检索的提取和选择

作者:Giorgos Kordopatis-Zilos,Christos Tzelepis,Symeon Papadopoulos,Ioannis Kompatsiaris,Ioannis Patras 机构:Received: date Accepted: date 链接:https://arxiv.org/abs/2106.13266 摘要:在本文中,我们讨论了在大规模数据集中高性能和计算效率的基于内容的视频检索问题。当前方法通常提出:(i)采用时空表示和相似性计算的细粒度方法,以高计算成本实现高性能,或(ii)将视频表示/索引为全局向量的粗粒度方法,其中时空结构丢失,提供低性能但也具有低计算成本。在这项工作中,我们提出了一个知识提取框架,我们称之为提取和选择(DnS),从性能良好的细粒度教师网络开始学习:a)在不同检索性能和计算效率权衡下的学生网络;b)在测试时快速将样本定向到适当学生的选择网络,以保持高检索性能和高计算效率。我们训练了几个具有不同体系结构的学生,并在性能和效率(即速度和存储需求)方面达成不同的权衡,包括使用二进制表示存储索引视频的细粒度学生。重要的是,所提出的方案允许在大的、未标记的数据集中进行知识提炼——这将导致优秀的学生。我们在三个不同的视频检索任务中对五个公共数据集上的DnS进行了评估,并证明a)我们的学生在几种情况下取得了最先进的性能;b)我们的DnS框架在检索性能、计算速度和存储空间之间提供了一个很好的权衡。在特定的配置中,我们的方法实现了与教师相似的映射,但是速度快了20倍,需要的存储空间少了240倍。我们收集的数据集和实现是公开的:https://github.com/mever-team/distill-and-select. 摘要:In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, which we call Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selection Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets -- this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that our DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, our method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. Our collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.

【7】 FOVQA: Blind Foveated Video Quality Assessment 标题:FOVQA:盲凹视频质量评估

作者:Yize Jin,Anjul Patney,Richard Webb,Alan Bovik 机构: Bovik are with the Department of Electrical and ComputerEngineering, The University of Texas at Austin 链接:https://arxiv.org/abs/2106.13328 摘要:以往的盲或无参考(NR)视频质量评价(VQA)模型很大程度上依赖于自然场景统计(NSS)的特征,但假设图像统计在空间域是平稳的。其中一些模型在标准图片上相当成功。然而,在虚拟现实(VR)应用中,中心凹视频压缩正在重新引起人们的注意,并且考虑到越来越高的空间和时间分辨率的内容以及测量注视方向的实用方法,空间变化质量评估的概念是很有意义的。中心凹视频压缩的失真随着偏心率的增加而增加,这意味着自然场景的统计数据是空间变化的。为了促进中心凹压缩/流媒体算法的发展,我们设计了一种基于空变自然场景统计(NSS)和自然视频统计(NVS)新模型的无参考中心凹视频质量评价模型FOVQA。具体地说,我们分别建立了一个空变广义高斯分布(SV-GGD)模型和一个空变异步广义高斯分布(SV-AGGD)模型,分别计算了平均减去对比度归一化(MSCN)系数和相邻MSCN系数的乘积。我们设计了一个中心凹视频质量预测器,提取径向基特征和其他特征,捕捉感知恼人的快速质量下降。我们发现,与其他领先的FIQA/VQA模型相比,FOVQA在新的2D-LIVE-FBT-FCVR数据库上实现了最先进的性能(SOTA)。我们已在以下网址提供FOVQA的实施:http://live.ece.utexas.edu/research/Quality/FOVQA.zip. 摘要:Previous blind or No Reference (NR) video quality assessment (VQA) models largely rely on features drawn from natural scene statistics (NSS), but under the assumption that the image statistics are stationary in the spatial domain. Several of these models are quite successful on standard pictures. However, in Virtual Reality (VR) applications, foveated video compression is regaining attention, and the concept of space-variant quality assessment is of interest, given the availability of increasingly high spatial and temporal resolution contents and practical ways of measuring gaze direction. Distortions from foveated video compression increase with increased eccentricity, implying that the natural scene statistics are space-variant. Towards advancing the development of foveated compression / streaming algorithms, we have devised a no-reference (NR) foveated video quality assessment model, called FOVQA, which is based on new models of space-variant natural scene statistics (NSS) and natural video statistics (NVS). Specifically, we deploy a space-variant generalized Gaussian distribution (SV-GGD) model and a space-variant asynchronous generalized Gaussian distribution (SV-AGGD) model of mean subtracted contrast normalized (MSCN) coefficients and products of neighboring MSCN coefficients, respectively. We devise a foveated video quality predictor that extracts radial basis features, and other features that capture perceptually annoying rapid quality fall-offs. We find that FOVQA achieves state-of-the-art (SOTA) performance on the new 2D LIVE-FBT-FCVR database, as compared with other leading FIQA / VQA models. we have made our implementation of FOVQA available at: http://live.ece.utexas.edu/research/Quality/FOVQA.zip.

GAN|对抗|攻击|生成相关(5篇)

【1】 NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image Generation 标题:NP-Draw:一种用于图像生成的非参数结构潜变量模型

作者:Xiaohui Zeng,Raquel Urtasun,Richard Zemel,Sanja Fidler,Renjie Liao 机构:University of Toronto, Vector Institute, NVIDIA, Canadian Institute for Advanced Research 备注:UAI2021, code at this https URL 链接:https://arxiv.org/abs/2106.13435 摘要:本文提出了一种用于图像生成的非参数结构潜变量模型,称为NP-DRAW,它以逐部分的方式依次在一块潜在画布上绘制图像,然后对画布上的图像进行解码。我们的主要贡献如下。1) 我们提出了一种非参数先验分布的图像部分的外观,使潜在变量'什么画'每一步成为一个分类随机变量。与文献中使用的高斯法相比,这提高了表达能力,大大简化了学习。2) 我们通过一个转换器来建模部件的顺序依赖结构,与文献中使用的RNN相比,它更强大,更容易训练。3) 提出了一种有效的启发式句法分析算法对先验知识进行预训练。在MNIST、Omniglot、CIFAR-10和CelebA上的实验表明,该方法明显优于DRAW和AIR等已有的结构化图像模型,并与其他通用生成模型具有竞争性。此外,我们还发现,我们的模型固有的组成性和可解释性在低数据学习机制和潜在空间编辑方面带来了显著的好处。代码位于\url{https://github.com/ZENGXH/NPDRAW}. 摘要:In this paper, we present a non-parametric structured latent variable model for image generation, called NP-DRAW, which sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas. Our key contributions are as follows. 1) We propose a non-parametric prior distribution over the appearance of image parts so that the latent variable ``what-to-draw'' per step becomes a categorical random variable. This improves the expressiveness and greatly eases the learning compared to Gaussians used in the literature. 2) We model the sequential dependency structure of parts via a Transformer, which is more powerful and easier to train compared to RNNs used in the literature. 3) We propose an effective heuristic parsing algorithm to pre-train the prior. Experiments on MNIST, Omniglot, CIFAR-10, and CelebA show that our method significantly outperforms previous structured image models like DRAW and AIR and is competitive to other generic generative models. Moreover, we show that our model's inherent compositionality and interpretability bring significant benefits in the low-data learning regime and latent space editing. Code is available at \url{https://github.com/ZENGXH/NPDRAW}.

【2】 Generative Modeling for Multi-task Visual Learning 标题:面向多任务可视化学习的产生式建模

作者:Zhipeng Bao,Martial Hebert,Yu-Xiong Wang 机构:Carnegie Mellon University, University of Illinois at Urbana-Champaign 链接:https://arxiv.org/abs/2106.13409 摘要:近年来,生成性建模在计算机视觉领域显示出巨大的应用前景,但它主要集中在合成具有真实感的图像上。本文以可共享特征表示的多任务学习为出发点,研究了一个新的学习共享生成模型的问题。相应地,我们提出了一个通用的面向多任务的生成性建模(MGM)框架。虽然在多任务场景中同时合成RGB图像和像素级注释是一个挑战,但是我们的框架使我们能够使用合成图像和弱注释(即图像级场景标签)来促进多个视觉任务。对具有挑战性的多任务基准(包括NYUv2和Taskonomy)的实验评估表明,我们的MGM框架大幅度提高了所有任务的性能,始终优于最先进的多任务方法。 摘要:Generative modeling has recently shown great promise in computer vision, but it has mostly focused on synthesizing visually realistic images. In this paper, motivated by multi-task learning of shareable feature representations, we consider a novel problem of learning a shared generative model that is useful across various visual perception tasks. Correspondingly, we propose a general multi-task oriented generative modeling (MGM) framework, by coupling a discriminative multi-task network with a generative network. While it is challenging to synthesize both RGB images and pixel-level annotations in multi-task scenarios, our framework enables us to use synthesized images paired with only weak annotations (i.e., image-level scene labels) to facilitate multiple visual tasks. Experimental evaluation on challenging multi-task benchmarks, including NYUv2 and Taskonomy, demonstrates that our MGM framework improves the performance of all the tasks by large margins, consistently outperforming state-of-the-art multi-task approaches.

【3】 Countering Adversarial Examples: Combining Input Transformation and Noisy Training 标题:对抗对抗性例子:输入转换和噪声训练相结合

作者:Cheng Zhang,Pan Gao 机构:Nanjing University of Aeronautics and Astronautics 链接:https://arxiv.org/abs/2106.13394 摘要:近年来的研究表明,基于神经网络的图像分类器极易受到敌方实例的攻击,这对安全敏感的图像识别任务构成了威胁。以往的工作表明,JPEG压缩可以在一定程度上克服敌方实例分类精度的下降。但是,随着压缩比的提高,传统的JPEG压缩不足以抵御这些攻击,反而会导致良性图像的准确率急剧下降。本文首先对传统的JPEG压缩算法进行了改进,使之更适合于神经网络,以充分滤除敌对扰动。具体来说,基于对频率系数的分析,我们设计了一个适合于压缩的量化表。将压缩作为一种数据增强策略,将模型无关预处理与噪声训练相结合。我们通过对不同压缩级别编码的图像进行训练来微调预训练模型,从而生成多个分类器。最后,由于较低(较高)的压缩比可以稍微(积极地)去除扰动和原始特征,因此我们使用这些经过训练的多模型进行模型集成。模型集合的多数票被采纳为最终预测。实验结果表明,该方法在保持原有精度的前提下,提高了防御效率。 摘要:Recent studies have shown that neural network (NN) based image classifiers are highly vulnerable to adversarial examples, which poses a threat to security-sensitive image recognition task. Prior work has shown that JPEG compression can combat the drop in classification accuracy on adversarial examples to some extent. But, as the compression ratio increases, traditional JPEG compression is insufficient to defend those attacks but can cause an abrupt accuracy decline to the benign images. In this paper, with the aim of fully filtering the adversarial perturbations, we firstly make modifications to traditional JPEG compression algorithm which becomes more favorable for NN. Specifically, based on an analysis of the frequency coefficient, we design a NN-favored quantization table for compression. Considering compression as a data augmentation strategy, we then combine our model-agnostic preprocess with noisy training. We fine-tune the pre-trained model by training with images encoded at different compression levels, thus generating multiple classifiers. Finally, since lower (higher) compression ratio can remove both perturbations and original features slightly (aggressively), we use these trained multiple models for model ensemble. The majority vote of the ensemble of models is adopted as final predictions. Experiments results show our method can improve defense efficiency while maintaining original accuracy.

【4】 Energy-Based Generative Cooperative Saliency Prediction 标题:基于能量的产生式合作显著性预测

作者:Jing Zhang,Jianwen Xie,Zilong Zheng,Nick Barnes 机构: Australian National University, Baidu Research, University of California, Los Angeles 链接:https://arxiv.org/abs/2106.13389 摘要:传统的显著性预测模型通常学习从图像到相应的地面真实显著性图的确定性映射。本文从生成模型的角度研究了图像显著性预测问题,通过学习给定图像显著性映射上的条件概率分布,并将预测过程视为一个抽样过程。具体地说,我们提出了一个基于生成式合作网络的生成式合作显著性预测框架,其中条件潜变量模型和基于条件能量的模型被联合训练来以合作的方式预测显著性。我们称我们的模型为SalCoopNets。潜变量模型作为一个快速但粗糙的预报器来有效地产生一个初始预报,然后通过作为一个精细预报器的基于能量的模型的迭代Langevin修正来细化。这种从粗到精的协同显著性预测策略提供了两个方面的最佳选择。此外,我们还提出了一种边恢复边合作学习的策略,将我们的框架推广到弱监督显著性预测场景中,在这种场景中,训练图像的显著性标注部分被观察到。最后,我们证明了所学习的能量函数可以作为一个精化模块,用于精化其他预先训练的显著性预测模型的结果。实验结果表明,我们的生成模型可以达到最先进的性能。我们的代码可在以下网址公开获取:\url{https://github.com/JingZhang617/SalCoopNets}. 摘要:Conventional saliency prediction models typically learn a deterministic mapping from images to the corresponding ground truth saliency maps. In this paper, we study the saliency prediction problem from the perspective of generative models by learning a conditional probability distribution over saliency maps given an image, and treating the prediction as a sampling process. Specifically, we propose a generative cooperative saliency prediction framework based on the generative cooperative networks, where a conditional latent variable model and a conditional energy-based model are jointly trained to predict saliency in a cooperative manner. We call our model the SalCoopNets. The latent variable model serves as a fast but coarse predictor to efficiently produce an initial prediction, which is then refined by the iterative Langevin revision of the energy-based model that serves as a fine predictor. Such a coarse-to-fine cooperative saliency prediction strategy offers the best of both worlds. Moreover, we generalize our framework to the scenario of weakly supervised saliency prediction, where saliency annotation of training images is partially observed, by proposing a cooperative learning while recovering strategy. Lastly, we show that the learned energy function can serve as a refinement module that can refine the results of other pre-trained saliency prediction models. Experimental results show that our generative model can achieve state-of-the-art performance. Our code is publicly available at: \url{https://github.com/JingZhang617/SalCoopNets}.

【5】 Prior Image-Constrained Reconstruction using Style-Based Generative Models 标题:基于样式生成模型的先验图像约束重建

作者:Varun A. Kelkar,Mark A. Anastasio 机构: Recent research in GANs hasachieved state of the art performance in terms of visual 1University of Illinois at Urbana-Champaign 备注:Accepted for publication at the International Conference on Machine Learning (ICML) 2021 链接:https://arxiv.org/abs/2102.12525 摘要:从高度不完全的成像测量中获得对物体有用的估计仍然是成像科学的圣杯。深度学习方法在学习对象的先验知识或约束条件以改善不适定成像反问题的条件化方面显示出良好的前景。在这项研究中,我们提出了一个框架来估计与已知先验图像语义相关的感兴趣对象。在基于风格的生成模型的解纠缠潜在空间中提出了一个优化问题,并利用先验图像的解纠缠潜在表示施加语义上有意义的约束。从理论上分析了利用先验图像进行不完全测量的稳定恢复。数值实验表明,与相关方法相比,该方法具有更好的性能。 摘要:Obtaining a useful estimate of an object from highly incomplete imaging measurements remains a holy grail of imaging science. Deep learning methods have shown promise in learning object priors or constraints to improve the conditioning of an ill-posed imaging inverse problem. In this study, a framework for estimating an object of interest that is semantically related to a known prior image, is proposed. An optimization problem is formulated in the disentangled latent space of a style-based generative model, and semantically meaningful constraints are imposed using the disentangled latent representation of the prior image. Stable recovery from incomplete measurements with the help of a prior image is theoretically analyzed. Numerical experiments demonstrating the superior performance of our approach as compared to related methods are presented.

Attention注意力(2篇)

【1】 Graph Pattern Loss based Diversified Attention Network for Cross-Modal Retrieval 标题:基于图模式丢失的多样化注意力网络跨模态检索

作者:Xueying Chen,Rong Zhang,Yibing Zhan 机构:Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China, Hangzhou Dianzi University 备注:None 链接:https://arxiv.org/abs/2106.13552 摘要:跨模式检索旨在通过结合图像、视频、文本和音频等多媒体数据,实现灵活的检索体验。无监督方法的核心之一是挖掘不同对象表示之间的相关性,从而在不需要昂贵标签的情况下获得满意的检索性能。本文提出了一种基于图形模式丢失的多样化注意网络(GPLDAN),用于无监督跨模态检索,以深入分析表征之间的相关性。首先,我们提出了一个多样化的注意力特征投射器,通过考虑不同表征之间的交互来产生实例的多个表征。然后,我们设计了一个新的图形模式损失来探索不同表示之间的相关性,在这个图中,考虑了不同表示之间所有可能的距离。另外,在融合前加入模态分类器,明确声明特征对应的模态,引导网络增强识别能力。我们在四个公共数据集上测试GPLDAN。实验结果表明,与现有的跨模态检索方法相比,GPLDAN具有良好的性能和竞争力。 摘要:Cross-modal retrieval aims to enable flexible retrieval experience by combining multimedia data such as image, video, text, and audio. One core of unsupervised approaches is to dig the correlations among different object representations to complete satisfied retrieval performance without requiring expensive labels. In this paper, we propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval to deeply analyze correlations among representations. First, we propose a diversified attention feature projector by considering the interaction between different representations to generate multiple representations of an instance. Then, we design a novel graph pattern loss to explore the correlations among different representations, in this graph all possible distances between different representations are considered. In addition, a modality classifier is added to explicitly declare the corresponding modalities of features before fusion and guide the network to enhance discrimination ability. We test GPLDAN on four public datasets. Compared with the state-of-the-art cross-modal retrieval methods, the experimental results demonstrate the performance and competitiveness of GPLDAN.

【2】 Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training 标题:探索中间情态:视觉语言前期训练中的自我注意视觉句法分析

作者:Hongwei Xue,Yupan Huang,Bei Liu,Houwen Peng,Jianlong Fu,Houqiang Li,Jiebo Luo 机构:University of Science and Technology of China, Hefei, China, Sun Yat-sen University, Guangzhou, China, Microsoft Research, Beijing, China, University of Rochester, Rochester, NY 链接:https://arxiv.org/abs/2106.13488 摘要:视觉语言预训练(VLP)旨在从图像-文本对中学习多模态表示,并以微调的方式为下游视觉语言任务服务。主流的VLP模型采用CNN转换器结构,将图像嵌入CNN,然后将图像和文本与转换器对齐。视觉内容之间的视觉关系在图像理解中起着重要的作用,是跨模态对齐学习的基础。然而,CNNs在视觉关系学习方面存在局限性,这是由于局部感受野在建立长程依赖模型方面的弱点。因此,学习视觉关系和模态间对齐这两个目标被封装在同一个Transformer网络中。这样的设计可能会忽略每个目标的特殊特性,从而限制Transformer的模态间对准学习。为了解决这一问题,我们提出了一种用于VLP的全变换视觉嵌入方法,以更好地学习视觉关系,进一步促进模态间的对齐。具体地说,我们提出了一个跨模态流(IMF)指标来衡量视觉和语言模态(即跨模态)之间的相互作用。在Transformer中设计了一种新的掩蔽优化机制,即掩蔽特征回归(MFR),以进一步促进模态间的学习。据我们所知,这是第一项研究,探讨Transformer的好处,视觉特征学习的VLP。我们在广泛的视觉语言任务中验证了我们的方法,包括视觉问答、视觉蕴涵和视觉推理。我们的方法不仅优于最先进的VLP性能,而且在IMF指标上也显示出优势。 摘要:Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.

人脸|人群计数(1篇)

【1】 Projection-wise Disentangling for Fair and Interpretable Representation Learning: Application to 3D Facial Shape Analysis 标题:基于投影解缠的公平可解释表征学习:在三维人脸形状分析中的应用

作者:Xianjing Liu,Bo Li,Esther Bron,Wiro Niessen,Eppo Wolvius,Gennady Roshchupkin 机构:Roshchupkin, Erasmus MC, Rotterdam, The Netherlands, Delft University of Technology, Delft, The Netherlands 备注:Accepted at MICCAI 2021 链接:https://arxiv.org/abs/2106.13734 摘要:混淆偏差是将机器学习应用到实践中的一个关键问题,特别是在临床实践中。我们考虑独立于多重偏差的学习表征问题。在文献中,这主要是通过清除学习表征中的偏差信息来解决的。然而,我们期望这种策略会损害表示中信息的多样性,从而限制其预期用途(例如,解释)。因此,我们建议在保留潜在表征中几乎所有信息的同时,减少偏差,这使我们能够观察和解释它们。为了实现这一点,我们将潜在的特征投影到一个学习向量方向上,并加强偏差和投影特征之间的独立性,而不是所有的学习特征。为了解释投影特征和输入数据之间的映射关系,我们提出了投影分离:沿学习向量方向的采样和重建。通过对三维面部形态和患者特征的分析(N=5011),对所提出的方法进行了评价。实验表明,该方法概念简单,具有良好的预测性能和可解释性,具有广阔的临床应用前景。 摘要:Confounding bias is a crucial problem when applying machine learning to practice, especially in clinical practice. We consider the problem of learning representations independent to multiple biases. In literature, this is mostly solved by purging the bias information from learned representations. We however expect this strategy to harm the diversity of information in the representation, and thus limiting its prospective usage (e.g., interpretation). Therefore, we propose to mitigate the bias while keeping almost all information in the latent representations, which enables us to observe and interpret them as well. To achieve this, we project latent features onto a learned vector direction, and enforce the independence between biases and projected features rather than all learned features. To interpret the mapping between projected features and input data, we propose projection-wise disentangling: a sampling and reconstruction along the learned vector direction. The proposed method was evaluated on the analysis of 3D facial shape and patient characteristics (N=5011). Experiments showed that this conceptually simple method achieved state-of-the-art fair prediction performance and interpretability, showing its great potential for clinical applications.

跟踪(1篇)

【1】 Bayesian Eye Tracking 标题:贝叶斯人眼跟踪

作者:Qiang Ji,Kang Wang 机构:Rensselaer Polytechnic Institute 链接:https://arxiv.org/abs/2106.13387 摘要:基于模型的眼动跟踪由于其不需要任何训练数据和眼动注释,可以推广到不同的对象,已经成为眼动跟踪的主流方法。然而,基于模型的眼睛跟踪容易受到眼睛特征检测错误的影响,特别是在野外进行的眼睛跟踪。为了解决这个问题,我们提出了一个基于模型的眼睛跟踪贝叶斯框架。该系统由一个级联贝叶斯卷积神经网络(c-BCNN)和一个几何眼模型组成,前者用于捕捉眼睛外观与其标志点之间的概率关系,后者用于从眼睛标志点估计眼睛注视。给定一幅被测眼睛图像,贝叶斯框架可以通过贝叶斯推理生成眼睛注视分布,而无需进行明确的地标检测和模型训练,在此基础上不仅可以估计最可能的眼睛注视,还可以估计其不确定性。此外,用贝叶斯推理代替基于点的推理,我们的模型不仅可以更好地推广到不同的子对象、头部姿势和环境,而且对图像噪声和地标检测错误具有鲁棒性。最后,利用估计的注视不确定性,我们可以构造一个级联结构,使我们能够逐步提高注视估计的准确性。与现有的基于模型和基于学习的方法相比,所提出的贝叶斯框架在多个基准数据集上的泛化能力以及在具有挑战性的现实条件下的准确性和鲁棒性都有显著的提高。 摘要:Model-based eye tracking has been a dominant approach for eye gaze tracking because of its ability to generalize to different subjects, without the need of any training data and eye gaze annotations. Model-based eye tracking, however, is susceptible to eye feature detection errors, in particular for eye tracking in the wild. To address this issue, we propose a Bayesian framework for model-based eye tracking. The proposed system consists of a cascade-Bayesian Convolutional Neural Network (c-BCNN) to capture the probabilistic relationships between eye appearance and its landmarks, and a geometric eye model to estimate eye gaze from the eye landmarks. Given a testing eye image, the Bayesian framework can generate, through Bayesian inference, the eye gaze distribution without explicit landmark detection and model training, based on which it not only estimates the most likely eye gaze but also its uncertainty. Furthermore, with Bayesian inference instead of point-based inference, our model can not only generalize better to different sub-jects, head poses, and environments but also is robust to image noise and landmark detection errors. Finally, with the estimated gaze uncertainty, we can construct a cascade architecture that allows us to progressively improve gaze estimation accuracy. Compared to state-of-the-art model-based and learning-based methods, the proposed Bayesian framework demonstrates significant improvement in generalization capability across several benchmark datasets and in accuracy and robustness under challenging real-world conditions.

视觉解释|视频理解VQA|caption等(1篇)

【1】 A Picture May Be Worth a Hundred Words for Visual Question Answering 标题:视觉问答,一张图片可能抵得上一百个字

作者:Yusuke Hirota,Noa Garcia,Mayu Otani,Chenhui Chu,Yuta Nakashima,Ittetsu Taniguchi,Takao Onoye 机构:Osaka University, Osaka, Japan, CyberAgent, Inc., Tokyo, Japan, Kyoto University, Kyoto, Japan 链接:https://arxiv.org/abs/2106.13445 摘要:为了理解图片,我们可以用文本表示法走多远?在图像理解中,使用简洁而详细的图像表示是非常必要的。由视觉模型提取的深层视觉特征,如快速R-CNN,广泛应用于多种任务,尤其是在视觉问答(VQA)中。然而,传统的深层视觉特征可能难以像人类一样传达图像中的所有细节。同时,随着近年来语言模式的发展,描述性文本可能成为解决这一问题的另一种选择。本文探讨了在VQA特定语境下,文本表征对图像理解的有效性。我们建议以描述问题对作为输入,而不是深层视觉特征,并将其输入到纯语言的转换器模型中,简化了过程和计算成本。我们还尝试了数据增强技术,以增加训练集的多样性,避免学习统计偏差。广泛的评估表明,文本表示只需要大约一百个单词就可以与vqa2.0和VQA-cpv2上的深层视觉特征相竞争。 摘要:How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model, simplifying the process and the computational cost. We also experiment with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias. Extensive evaluations have shown that textual representations require only about a hundred words to compete with deep visual features on both VQA 2.0 and VQA-CP v2.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 "Zero Shot" Point Cloud Upsampling

作者:Kaiyue Zhou,Ming Dong,Suzan Arslanturk 机构:Department of Computer Science, Wayne State University 链接:https://arxiv.org/abs/2106.13765 摘要:在过去的几年中,利用深度学习进行点云上采样已经付出了各种努力。最近的有监督深度学习方法局限于训练数据的大小,并且局限于覆盖所有形状的点云。此外,获取如此数量的数据是不现实的,而且网络在看不见的记录上的性能通常不如预期。在本文中,我们提出了一种无监督的点云上采样方法,内部称为“零炮”点云上采样(ZSPU)在整体水平。我们的方法完全基于特定点云提供的内部信息,而不是在自我训练和测试阶段进行修补。这种单流设计通过学习低分辨率(LR)点云和高分辨率(HR)点云之间的关系,显著减少了上采样任务的训练时间。当原始点云作为输入加载时,此关联将提供超分辨率(SR)输出。与其他上采样方法相比,我们在基准点云数据集上展示了具有竞争力的性能。此外,ZSPU在局部细节复杂或曲率较大的形状上取得了较好的定性结果。 摘要:Point cloud upsampling using deep learning has been paid various efforts in the past few years. Recent supervised deep learning methods are restricted to the size of training data and is limited in terms of covering all shapes of point clouds. Besides, the acquisition of such amount of data is unrealistic, and the network generally performs less powerful than expected on unseen records. In this paper, we present an unsupervised approach to upsample point clouds internally referred as "Zero Shot" Point Cloud Upsampling (ZSPU) at holistic level. Our approach is solely based on the internal information provided by a particular point cloud without patching in both self-training and testing phases. This single-stream design significantly reduces the training time of the upsampling task, by learning the relation between low-resolution (LR) point clouds and their high (original) resolution (HR) counterparts. This association will provide super-resolution (SR) outputs when original point clouds are loaded as input. We demonstrate competitive performance on benchmark point cloud datasets when compared to other upsampling methods. Furthermore, ZSPU achieves superior qualitative results on shapes with complex local details or high curvatures.

其他神经网络|深度学习|模型|建模(2篇)

【1】 Federated Noisy Client Learning 标题:联合噪音客户端学习

作者:Li Li,Huazhu Fu,Bo Han,Cheng-Zhong Xu,Ling Shao 机构: Shenzhen Institutes of Advanced Technology, CAS., Inception Institute of Artificial Intelligence, UAE., Department of Computer Science, Hong Kong Baptist University., University of Macau. 链接:https://arxiv.org/abs/2106.13239 摘要:联邦学习(FL)协作聚合一个基于多个本地客户机的共享全局模型,同时保持训练数据的分散性,以保护数据隐私。然而,标准的FL方法忽略了有噪声的客户端问题,这可能会损害聚合模型的整体性能。在本文中,我们首先分析有噪声的客户机语句,然后用不同的噪声分布(如Bernoulli分布和截断高斯分布)对有噪声的客户机进行建模。为了在有噪声的客户机上学习,我们提出了一个简单而有效的FL框架,称为联邦噪声客户机学习(Federated Noised Client Learning,Fed NCL),它是一种即插即用算法,包含两个主要部分:数据质量度量(DQM),用于动态量化每个参与客户机的数据质量,以及噪声鲁棒聚集(NRA),通过综合考虑每个客户机的局部训练数据量和数据质量,自适应地聚集每个客户机的局部模型。我们的Fed-NCL可以很容易地应用于任何标准的FL工作流中,以处理嘈杂的客户问题。在不同数据集上的实验结果表明,我们的算法提高了具有噪声客户端的不同系统的性能。 摘要:Federated learning (FL) collaboratively aggregates a shared global model depending on multiple local clients, while keeping the training data decentralized in order to preserve data privacy. However, standard FL methods ignore the noisy client issue, which may harm the overall performance of the aggregated model. In this paper, we first analyze the noisy client statement, and then model noisy clients with different noise distributions (e.g., Bernoulli and truncated Gaussian distributions). To learn with noisy clients, we propose a simple yet effective FL framework, named Federated Noisy Client Learning (Fed-NCL), which is a plug-and-play algorithm and contains two main components: a data quality measurement (DQM) to dynamically quantify the data quality of each participating client, and a noise robust aggregation (NRA) to adaptively aggregate the local models of each client by jointly considering the amount of local training data and the data quality of each client. Our Fed-NCL can be easily applied in any standard FL workflow to handle the noisy client issue. Experimental results on various datasets demonstrate that our algorithm boosts the performances of different state-of-the-art systems with noisy clients.

【2】 Circumpapillary OCT-Focused Hybrid Learning for Glaucoma Grading Using Tailored Prototypical Neural Networks 标题:定制原型神经网络用于青光眼分级的乳头状OCT聚焦混合学习

作者:Gabriel García,Rocío del Amor,Adrián Colomer,Rafael Verdú-Monedero,Juan Morales-Sánchez,Valery Naranjo 机构: Universitat Politecnica de Valencia, Universidad Polit´ecnica de Cartagena 链接:https://arxiv.org/abs/2106.13551 摘要:青光眼是世界范围内致盲的主要原因之一,光学相干断层扫描(OCT)是检测青光眼的典型成像技术。与大多数最新的青光眼检测研究不同,在本文中,我们首次提出了一种新的基于原始乳头周围B超的青光眼分级框架。特别是,我们提出了一种新的基于OCT的混合网络,它结合了手驱动和深度学习算法。提出了一种OCT特异性描述符来提取与视网膜神经纤维层(RNFL)相关的手工特征。同时,一个创新的CNN被开发使用跳跃连接,包括定制的剩余和注意模块,以细化潜在空间的自动特征。该体系结构作为主干,在静态和动态原型网络的基础上实现了一种新的Few-Shot学习。k-shot范式被重新定义,产生了一个有监督的端到端系统,它提供了区分健康、早期和晚期青光眼样本的实质性改进。从海德堡光谱系统获得的两个融合数据库中,讨论了动态原型网络的训练和评估过程。验证和测试结果对青光眼分级的分类准确率分别为0.9459和0.8788。此外,所提出的青光眼检测模型的高性能值得一提。类激活图的结果直接符合临床医生的观点,因为热图指出RNFL是诊断青光眼最相关的结构。 摘要:Glaucoma is one of the leading causes of blindness worldwide and Optical Coherence Tomography (OCT) is the quintessential imaging technique for its detection. Unlike most of the state-of-the-art studies focused on glaucoma detection, in this paper, we propose, for the first time, a novel framework for glaucoma grading using raw circumpapillary B-scans. In particular, we set out a new OCT-based hybrid network which combines hand-driven and deep learning algorithms. An OCT-specific descriptor is proposed to extract hand-crafted features related to the retinal nerve fibre layer (RNFL). In parallel, an innovative CNN is developed using skip-connections to include tailored residual and attention modules to refine the automatic features of the latent space. The proposed architecture is used as a backbone to conduct a novel few-shot learning based on static and dynamic prototypical networks. The k-shot paradigm is redefined giving rise to a supervised end-to-end system which provides substantial improvements discriminating between healthy, early and advanced glaucoma samples. The training and evaluation processes of the dynamic prototypical network are addressed from two fused databases acquired via Heidelberg Spectralis system. Validation and testing results reach a categorical accuracy of 0.9459 and 0.8788 for glaucoma grading, respectively. Besides, the high performance reported by the proposed model for glaucoma detection deserves a special mention. The findings from the class activation maps are directly in line with the clinicians' opinion since the heatmaps pointed out the RNFL as the most relevant structure for glaucoma diagnosis.

其他(9篇)

【1】 Single Image Texture Translation for Data Augmentation 标题:用于数据增强的单幅图像纹理转换

作者:Boyi Li,Yin Cui,Tsung-Yi Lin,Serge Belongie 机构:Cornell University, Cornell Tech, Google Research, Brain Team 链接:https://arxiv.org/abs/2106.13804 摘要:图像合成的最新进展使人们能够通过学习源域和目标域之间的映射来翻译图像。现有的方法倾向于通过在各种数据集上训练一个模型来学习分布,结果的评估主要以主观的方式进行。然而,这方面的研究相对较少,研究语义图像翻译方法在图像识别任务中的潜在应用。在本文中,我们探讨了使用单一图像纹理转换(SITT)的数据增强。我们首先提出一个轻量级的模型来将纹理转换成基于单一输入的图像,允许快速的训练和测试。在此基础上,探讨了增强数据在长尾和Few-Shot图像分类中的应用。我们发现该方法能够将输入数据转换到目标域,从而提高图像识别性能。最后,我们研究了SITT和相关的图像翻译方法如何为数据高效、增强工程的模型训练方法提供基础。 摘要:Recent advances in image synthesis enables one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with results evaluated largely in a subjective manner. Relatively few works in this area, however, study the potential use of semantic image translation methods for image recognition tasks. In this paper, we explore the use of Single Image Texture Translation (SITT) for data augmentation. We first propose a lightweight model for translating texture to images based on a single input of source texture, allowing for fast training and testing. Based on SITT, we then explore the use of augmented data in long-tailed and few-shot image classification tasks. We find the proposed method is capable of translating input data into a target domain, leading to consistent improved image recognition performance. Finally, we examine how SITT and related image translation methods can provide a basis for a data-efficient, augmentation engineering approach to model training.

【2】 Interactive Multi-level Stroke Control for Neural Style Transfer 标题:用于神经风格转换的交互式多级笔划控制

作者:Max Reimann,Benito Buchheim,Amir Semmo,Jürgen Döllner,Matthias Trapp 机构:Hasso Plattner Institute, University of Potsdam, Germany, Digital Masterpieces GmbH 备注:International Conference on CyberWorlds (CW2021) 链接:https://arxiv.org/abs/2106.13787 摘要:我们介绍了StyleTune,一个用于交互式多层次控制神经风格转换的移动应用程序,它有助于风格元素的创造性调整,并实现高输出保真度。与当前的移动神经风格转移应用程序不同,StyleTune支持用户在全局和局部水平上调整风格元素的大小和方向,例如笔画和纹理块。为此,我们提出了一种新的笔划自适应前馈式传输网络,它能够控制笔划的大小和强度,并允许比现有方法更大范围的编辑。为了提高控制水平,我们提出了一种利用CNNs的旋转方差进行笔划方向调整的网络不可知方法。为了获得高输出保真度,我们进一步添加了一种基于贴片的风格传输方法,使用户能够获得超过2000万像素的输出分辨率。我们的方法使用户能够创建许多新颖的结果,而这些结果是当前移动神经风格传输应用程序所无法实现的。 摘要:We present StyleTune, a mobile app for interactive multi-level control of neural style transfers that facilitates creative adjustments of style elements and enables high output fidelity. In contrast to current mobile neural style transfer apps, StyleTune supports users to adjust both the size and orientation of style elements, such as brushstrokes and texture patches, on a global as well as local level. To this end, we propose a novel stroke-adaptive feed-forward style transfer network, that enables control over stroke size and intensity and allows a larger range of edits than current approaches. For additional level-of-control, we propose a network agnostic method for stroke-orientation adjustment by utilizing the rotation-variance of CNNs. To achieve high output fidelity, we further add a patch-based style transfer method that enables users to obtain output resolutions of more than 20 Megapixel. Our approach empowers users to create many novel results that are not possible with current mobile neural style transfer apps.

【3】 Re-parameterizing VAEs for stability 标题:重新参数化VAE以确保稳定性

作者:David Dehaene,Rémy Brossard 链接:https://arxiv.org/abs/2106.13739 摘要:提出了一种变分自动编码器(VAE)训练数值稳定性的理论方法。我们的工作是由最近的研究推动的,这些研究使VAE能够在复杂的图像数据集上获得最先进的生成结果。这些非常深的VAE架构,以及使用更复杂输出分布的VAE,突出了随意产生高训练梯度以及NaN损失的趋势。尽管有其局限性,但为训练他们而提出的经验修正既没有充分的理论依据,也不足以在实践中普遍适用。在此基础上,我们将问题源定位在模型神经网络与其输出概率分布之间的接口处。我们解释了一个共同的不稳定来源,源于编码正态分布方差的不谨慎公式,并将相同的方法应用于其他不太明显的来源。我们证明,通过对参数化正态分布的方法进行微小的改变,可以安全地训练VAE。 摘要:We propose a theoretical approach towards the training numerical stability of Variational AutoEncoders (VAE). Our work is motivated by recent studies empowering VAEs to reach state of the art generative results on complex image datasets. These very deep VAE architectures, as well as VAEs using more complex output distributions, highlight a tendency to haphazardly produce high training gradients as well as NaN losses. The empirical fixes proposed to train them despite their limitations are neither fully theoretically grounded nor generally sufficient in practice. Building on this, we localize the source of the problem at the interface between the model's neural networks and their output probabilistic distributions. We explain a common source of instability stemming from an incautious formulation of the encoded Normal distribution's variance, and apply the same approach on other, less obvious sources. We show that by implementing small changes to the way we parameterize the Normal distributions on which they rely, VAEs can securely be trained.

【4】 Image-to-image Transformation with Auxiliary Condition 标题:带辅助条件的图像到图像的变换

作者:Robert Leer,Hessi Roma,James Amelia 链接:https://arxiv.org/abs/2106.13696 摘要:由于真实数据与模拟数据的差异,像人体姿态检测这样的图像识别算法,在训练过程中往往会出现性能下降的情况。为了使模拟图像的分布更接近真实图像的分布,目前已有多个基于GAN的图像到图像变换方法,如SimGAN和CycleGAN。然而,这些方法对受试者姿势和形状的各种变化不够敏感,特别是当训练数据不平衡时,例如某些特定的姿势和形状在训练数据中是次要的。为了克服这一问题,我们提出在CycleGAN训练中引入被试的标签信息,如物体的姿态和类型,并引导其获得标签式的变换模型。通过对SVHN到MNIST的数字图像变换和监控摄像机从模拟图像到真实图像的图像变换实验,对本文提出的方法Label-CycleGAN进行了评价。 摘要:The performance of image recognition like human pose detection, trained with simulated images would usually get worse due to the divergence between real and simulated data. To make the distribution of a simulated image close to that of real one, there are several works applying GAN-based image-to-image transformation methods, e.g., SimGAN and CycleGAN. However, these methods would not be sensitive enough to the various change in pose and shape of subjects, especially when the training data are imbalanced, e.g., some particular poses and shapes are minor in the training data. To overcome this problem, we propose to introduce the label information of subjects, e.g., pose and type of objects in the training of CycleGAN, and lead it to obtain label-wise transforamtion models. We evaluate our proposed method called Label-CycleGAN, through experiments on the digit image transformation from SVHN to MNIST and the surveillance camera image transformation from simulated to real images.

【5】 Connecting Sphere Manifolds Hierarchically for Regularization 标题:球面流形的分层连接正则化

作者:Damien Scieur,Youngsung Kim 机构: Montreal 2Samsung Advanced Institute of Technology (SAIT) 链接:https://arxiv.org/abs/2106.13549 摘要:本文研究具有层次组织类的分类问题。我们强制每个类的分类器(超平面)属于一个球体流形,其中心是其超类的分类器。然后,根据各个球流形的层次关系将它们连接起来。我们的技术通过结合一个球形的完全连接层和一个层次结构层来代替神经网络的最后一层。这种正则化可以提高广泛使用的深度神经网络结构(ResNet和DenseNet)在公开数据集(CIFAR100、CUB200、Stanford dogs、Stanford cars和Tiny ImageNet)上的性能。 摘要:This paper considers classification problems with hierarchically organized classes. We force the classifier (hyperplane) of each class to belong to a sphere manifold, whose center is the classifier of its super-class. Then, individual sphere manifolds are connected based on their hierarchical relations. Our technique replaces the last layer of a neural network by combining a spherical fully-connected layer with a hierarchical layer. This regularization is shown to improve the performance of widely used deep neural network architectures (ResNet and DenseNet) on publicly available datasets (CIFAR100, CUB200, Stanford dogs, Stanford cars, and Tiny-ImageNet).

【6】 Building Intelligent Autonomous Navigation Agents 标题:构建智能自主导航代理

作者:Devendra Singh Chaplot 机构:CMU-ML-,-, Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Thesis Committee:, Ruslan Salakhutdinov, Chair, Abhinav Gupta, Deva Ramanan, Jitendra Malik, Submitted in partial fulfillment of the requirements 备注:CMU Ph.D. Thesis, March 2021. For more details see this http URL 链接:https://arxiv.org/abs/2106.13415 摘要:在过去十年中,机器学习的突破导致了“数字智能”,即机器学习模型能够从大量的标记数据中学习,以执行一些数字任务,如语音识别、人脸识别、机器翻译等。本论文的目标是在设计“物理智能”算法方面取得进展,即构建智能自主导航代理,能够学习在物理世界中执行复杂的导航任务,包括视觉感知、自然语言理解、推理、规划,以及顺序决策。尽管经典的导航方法在过去的几十年中取得了一些进展,但是当前的导航代理在长期的语义导航任务中仍然很困难。在论文的第一部分,我们讨论了利用端到端强化学习来解决诸如障碍回避、语义感知、语言基础和推理等问题的短期导航工作。在第二部分中,我们提出了一类新的基于模块化学习和结构化显式地图表示的导航方法,利用经典和端到端学习方法的优点来处理长期的导航任务。结果表明,这些方法能够有效地解决诸如定位、映射、长期规划、探索和语义先验学习等问题。这些模块化学习方法能够对空间和语义进行长期的理解,并在各种导航任务上取得最先进的效果。 摘要:Breakthroughs in machine learning in the last decade have led to `digital intelligence', i.e. machine learning models capable of learning from vast amounts of labeled data to perform several digital tasks such as speech recognition, face recognition, machine translation and so on. The goal of this thesis is to make progress towards designing algorithms capable of `physical intelligence', i.e. building intelligent autonomous navigation agents capable of learning to perform complex navigation tasks in the physical world involving visual perception, natural language understanding, reasoning, planning, and sequential decision making. Despite several advances in classical navigation methods in the last few decades, current navigation agents struggle at long-term semantic navigation tasks. In the first part of the thesis, we discuss our work on short-term navigation using end-to-end reinforcement learning to tackle challenges such as obstacle avoidance, semantic perception, language grounding, and reasoning. In the second part, we present a new class of navigation methods based on modular learning and structured explicit map representations, which leverage the strengths of both classical and end-to-end learning methods, to tackle long-term navigation tasks. We show that these methods are able to effectively tackle challenges such as localization, mapping, long-term planning, exploration and learning semantic priors. These modular learning methods are capable of long-term spatial and semantic understanding and achieve state-of-the-art results on various navigation tasks.

【7】 CausalCity: Complex Simulations with Agency for Causal Discovery and Reasoning 标题:因果城市:具有因果发现和推理机构的复杂模拟

作者:Daniel McDuff,Yale Song,Jiyoung Lee,Vibhav Vineet,Sai Vemprala,Nicholas Gyde,Hadi Salman,Shuang Ma,Kwanghoon Sohn,Ashish Kapoor 机构:Microsoft, Redmond, USA, Yonsei University, South Korea, MIT, Cambridge, USA 链接:https://arxiv.org/abs/2106.13364 摘要:执行因果和反事实推理的能力是人类智力的核心属性。能够进行这类推理的决策系统有可能更具普遍性和可解释性。通过提供系统地改变参数(例如,混淆)的能力和在反事实情况下生成结果的示例,模拟有助于推进这一领域的最新技术。然而,在多智能体场景中模拟复杂的时间因果事件,例如那些存在于驾驶和车辆导航中的事件,是一个挑战。为了帮助解决这个问题,我们提供了一个高保真仿真环境,该环境是为在安全关键环境中开发因果发现和反事实推理算法而设计的。我们工作的一个核心组件是引入\textit{agency},这样使用高级定义定义和创建复杂场景就很简单了。然后,车辆与机构一起运行以完成这些目标,这意味着只有在必要时才需要控制低级别的行为。我们用三种最先进的方法进行实验,以创建基线并强调这种环境的启示。最后,我们强调今后工作的挑战和机遇。 摘要:The ability to perform causal and counterfactual reasoning are central properties of human intelligence. Decision-making systems that can perform these types of reasoning have the potential to be more generalizable and interpretable. Simulations have helped advance the state-of-the-art in this domain, by providing the ability to systematically vary parameters (e.g., confounders) and generate examples of the outcomes in the case of counterfactual scenarios. However, simulating complex temporal causal events in multi-agent scenarios, such as those that exist in driving and vehicle navigation, is challenging. To help address this, we present a high-fidelity simulation environment that is designed for developing algorithms for causal discovery and counterfactual reasoning in the safety-critical context. A core component of our work is to introduce \textit{agency}, such that it is simple to define and create complex scenarios using high-level definitions. The vehicles then operate with agency to complete these objectives, meaning low-level behaviors need only be controlled if necessary. We perform experiments with three state-of-the-art methods to create baselines and highlight the affordances of this environment. Finally, we highlight challenges and opportunities for future work.

【8】 Physics perception in sloshing scenes with guaranteed thermodynamic consistency 标题:保证热力学一致性的晃动场景中的物理感知

作者:Beatriz Moya,Alberto Badias,David Gonzalez,Francisco Chinesta,Elias Cueto 机构:Aragon Institute in Engineering Research, University of Zaragoza, Zaragoza, Spain, ESI Group chair. PIMM Lab., ENSAM Institute of Technology, Paris, France 备注:20 pages, 11 figures 链接:https://arxiv.org/abs/2106.13301 摘要:物理感知经常面临这样一个问题,即只有有限的数据或现场的部分测量数据可用。在这项工作中,我们提出了一个策略,从测量的自由表面来学习液体晃动的完整状态。该方法基于递归神经网络(RNN),将有限的信息投影到降阶流形上,不仅可以重构未知信息,而且能够实时地对未来场景进行流体推理。为了获得物理上一致的预测,我们在降阶流形上训练深层神经网络,通过引入诱导偏差,确保热力学原理的实现。RNN从历史中学习所需的隐藏信息,将有限的信息与模拟发生的潜在空间相关联。最后,解码器将数据返回到高维流形,以增强现实的形式向用户提供有见地的信息。将该算法与计算机视觉系统相结合,利用实际信息对该方法的性能进行测试,从而使系统能够实时地了解和预测被观测流体的未来状态。 摘要:Physics perception very often faces the problem that only limited data or partial measurements on the scene are available. In this work, we propose a strategy to learn the full state of sloshing liquids from measurements of the free surface. Our approach is based on recurrent neural networks (RNN) that project the limited information available to a reduced-order manifold so as to not only reconstruct the unknown information, but also to be capable of performing fluid reasoning about future scenarios in real time. To obtain physically consistent predictions, we train deep neural networks on the reduced-order manifold that, through the employ of inductive biases, ensure the fulfillment of the principles of thermodynamics. RNNs learn from history the required hidden information to correlate the limited information with the latent space where the simulation occurs. Finally, a decoder returns data back to the high-dimensional manifold, so as to provide the user with insightful information in the form of augmented reality. This algorithm is connected to a computer vision system to test the performance of the proposed methodology with real information, resulting in a system capable of understanding and predicting future states of the observed fluid in real-time.

【9】 Free-viewpoint Indoor Neural Relighting from Multi-view Stereo 标题:基于多视点立体的室内自由视点神经重光照

作者:Julien Philip,Sébastien Morgenthaler,Michaël Gharbi,George Drettakis 机构: Université Côte d’Azur and Adobe ResearchSÉBASTIEN MORGENTHALER, Université Côte d’AzurMICHAËL GHARBI, Université Côte d’Azur(a) three of ? input views(b) novel view synthesized by our algorithm(c) the same novel view with modified illuminationFig 备注:None 链接:https://arxiv.org/abs/2106.13299 摘要:我们介绍了一种捕捉室内场景的神经重照明算法,该算法允许交互式自由视点导航。我们的方法允许照明被综合地改变,同时连贯地渲染投射阴影和复杂的光泽材料。我们从场景的多个图像和通过多视图立体(MVS)重建获得的三维网格开始。我们假设照明可以很好地解释为独立于视图的漫反射分量和集中在镜面反射方向周围的依赖于视图的光泽项的总和。我们围绕输入特征图设计了一个卷积网络,有助于学习场景材质和照明的隐式表示,实现重新照明和自由视点导航。我们通过利用基于图像和基于物理的渲染的最佳元素来生成这些输入贴图。我们对输入视图进行采样以估计漫射场景的辐照度,并使用路径跟踪计算由用户指定光源引起的新照度。为了便于网络理解材料和合成合理的光泽反射,我们重新投影视图并计算镜像。我们在一个合成数据集上训练网络,其中每个场景也用MVS重建。我们展示了我们的算法重新照亮真实的室内场景和执行自由视点导航与复杂和真实的光泽反射的结果,这到目前为止仍然是无法达到的视图合成技术。 摘要:We introduce a neural relighting algorithm for captured indoors scenes, that allows interactive free-viewpoint navigation. Our method allows illumination to be changed synthetically, while coherently rendering cast shadows and complex glossy materials. We start with multiple images of the scene and a 3D mesh obtained by multi-view stereo (MVS) reconstruction. We assume that lighting is well-explained as the sum of a view-independent diffuse component and a view-dependent glossy term concentrated around the mirror reflection direction. We design a convolutional network around input feature maps that facilitate learning of an implicit representation of scene materials and illumination, enabling both relighting and free-viewpoint navigation. We generate these input maps by exploiting the best elements of both image-based and physically-based rendering. We sample the input views to estimate diffuse scene irradiance, and compute the new illumination caused by user-specified light sources using path tracing. To facilitate the network's understanding of materials and synthesize plausible glossy reflections, we reproject the views and compute mirror images. We train the network on a synthetic dataset where each scene is also reconstructed with MVS. We show results of our algorithm relighting real indoor scenes and performing free-viewpoint navigation with complex and realistic glossy reflections, which so far remained out of reach for view-synthesis techniques.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-28,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档