计算机视觉学术速递[6.21]

公众号-arXiv每日学术速递

发布于 2021-07-02 17:43:45

1.1K0

发布于 2021-07-02 17:43:45

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CV 方向，今日共计62篇

Transformer(4篇)

【1】 End-to-end Temporal Action Detection with Transformer 标题：基于Transformer的端到端时间动作检测

作者：Xiaolong Liu,Qimeng Wang,Yao Hu,Xu Tang,Song Bai,Xiang Bai 机构：Huazhong University of Science and Technology, Alibaba Group 链接：https://arxiv.org/abs/2106.10271 摘要：时间动作检测（TAD）的目的是确定未剪辑视频中每个动作实例的语义标签和边界。这是视频理解中的一项基本任务，在TAD方面已经取得了重大进展。以前的方法涉及多个阶段或网络以及手工设计的规则或操作，缺乏效率和灵活性。在这里，我们为Transformer上的TAD构建了一个端到端框架，称为\textit{TadTR}，它将所有动作实例同时预测为一组标签和并行的时间位置。TadTR能够通过选择性地关注视频中的一些片段，自适应地提取做出动作预测所需的时间上下文信息。它大大简化了TAD的流水线，并且比以前的检测器运行得快得多。我们的方法在HACS段和THUMOS14上实现了最先进的性能，在ActivityNet-1.3上实现了有竞争力的性能。我们的代码将在\url提供{https://github.com/xlliu7/TadTR}. 摘要：Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental task in video understanding and significant progress has been made in TAD. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. Here, we construct an end-to-end framework for TAD upon Transformer, termed \textit{TadTR}, which simultaneously predicts all action instances as a set of labels and temporal locations in parallel. TadTR is able to adaptively extract temporal context information needed for making action predictions, by selectively attending to a number of snippets in a video. It greatly simplifies the pipeline of TAD and runs much faster than previous detectors. Ourmethod achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3. Our code will be made available at \url{https://github.com/xlliu7/TadTR}.

【2】 How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers 标题：如何训练你的VIT？视觉变换器中的数据、增强和正则化

作者：Andreas Steiner,Alexander Kolesnikov,Xiaohua Zhai,Ross Wightman,Jakob Uszkoreit,Lucas Beyer 机构：Google Research, Brain Team; †independent researcher 备注：Andreas, Alex, Xiaohua and Lucas contributed equally. We release more than 50'000 ViT models trained under diverse settings on various datasets. We believe this to be a treasure trove for model analysis. Available at this https URL and this https URL 链接：https://arxiv.org/abs/2106.10270 摘要：视觉变换器（ViT）在图像分类、目标检测和语义图像分割等领域具有很强的竞争力。与卷积神经网络相比，在较小的训练数据集上训练时，视觉变换器较弱的感应偏差通常会导致对模型正则化或数据增强（简称“AugReg”）的依赖性增加。为了更好地理解训练数据量、AugReg、模型大小和计算预算之间的相互作用，我们进行了系统的实证研究。作为这项研究的一个结果，我们发现，增加计算和AugReg的组合可以产生与在一个数量级以上的训练数据上训练的模型具有相同性能的模型：我们在公共ImageNet-21k数据集上训练各种大小的ViT模型，这些模型与在更大的数据集上训练的对应模型相匹配或优于它们，但JFT-300M数据集尚未公开。摘要：Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (``AugReg'' for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.

【3】 All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers 标题：您可以嵌入的所有内容：基于自然语言的时空转换器车辆检索

作者：Carmelo Scribano,Davide Sapienza,Giorgia Franchini,Micaela Verucchi,Marko Bertogna 机构：†University of Modena and Reggio Emilia ‡University of Ferrara University of Parma 备注：CVPR 2021 AI CITY CHALLENGE Natural Language-Based Vehicle Retrieval 链接：https://arxiv.org/abs/2106.10153 摘要：将自然语言与视觉相结合是人工智能领域一个独特而有趣的挑战。基于自然语言的车辆检索的人工智能城市挑战赛第5赛道侧重于结合视觉和文本信息的问题，应用于智能城市用例。在本文中，我们提出了所有你可以嵌入（AYCE），一个模块化的解决方案，以关联单一车辆跟踪序列与自然语言。提出的架构的主要组成部分是（i）BERT提供文本描述的嵌入，（ii）卷积主干以及Transformer模型来嵌入视觉信息。对于检索模型的训练，提出了一种三元组边缘损失的变化来学习视觉嵌入和语言嵌入之间的距离度量。该代码在https://github.com/cscribano/AYCE_2021. 摘要：Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.

【4】 Efficient Self-supervised Vision Transformers for Representation Learning 标题：用于表征学习的高效自监督视觉转换器

作者：Chunyuan Li,Jianwei Yang,Pengchuan Zhang,Mei Gao,Bin Xiao,Xiyang Dai,Lu Yuan,Jianfeng Gao 机构：Microsoft Research at Redmond, Microsoft Cloud + AI 备注：24 pages, 12 figures, file size 13.6MB 链接：https://arxiv.org/abs/2106.09785 摘要：研究了两种用于视觉表征学习的高效自监督视觉变换器（EsViT）。首先，我们通过一个全面的实证研究表明，具有稀疏自关注的多阶段体系结构可以显著降低建模复杂度，但代价是失去捕获图像区域间细粒度对应关系的能力。其次，我们提出了一个新的区域匹配预训练任务，使得模型能够捕捉到细粒度的区域依赖关系，从而显著提高了学习视觉表示的质量。我们的结果表明，结合这两种技术，EsViT在ImageNet线性探针评估中达到81.3%的top-1，在大约一个数量级的更高吞吐量下优于现有技术。当转移到下游线性分类任务时，EsViT在18个数据集中的17个数据集上优于其监督的同类。代码和模型将公开。摘要：This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.

检测相关(4篇)

【1】 Bridging the Gap Between Object Detection and User Intent via Query-Modulation 标题：通过查询调制弥合对象检测和用户意图之间的差距

作者：Marco Fornoni,Chaochao Yan,Liangchen Luo,Kimberly Wilber,Alex Stark,Yin Cui,Boqing Gong,Andrew Howard 机构：Google Research, University of Texas at Arlington 链接：https://arxiv.org/abs/2106.10258 摘要：当用户通过相机或图片与对象交互时，往往有特定的意图。例如，他们可能希望执行视觉搜索。然而，大多数目标检测模型忽略了用户的意图，依赖于图像像素作为其唯一的输入。这通常会导致不正确的结果，例如对感兴趣的对象缺乏高置信度检测，或者使用错误的类标签进行检测。在本文中，我们研究的技术，以调整标准对象检测器显式地解释用户的意图，表示为一个简单的查询嵌入。与标准对象检测器相比，查询调制检测器在检测给定感兴趣标签的对象时表现出更高的性能。由于从标准对象检测注释合成的大规模训练数据，查询调制检测器也可以优于专门的引用表达式识别系统。此外，它们可以同时训练来求解查询调制检测和标准目标检测。摘要：When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection.

【2】 Toward Fault Detection in Industrial Welding Processes with Deep Learning and Data Augmentation 标题：基于深度学习和数据增强的工业焊接过程故障检测

作者：Jibinraj Antony,Dr. Florian Schlather,Georgij Safronov,Markus Schmitz,Prof. Dr. Kristof Van Laerhoven 机构： Laerhoven [a][a] University of Siegen 链接：https://arxiv.org/abs/2106.10160 摘要：随着计算机视觉领域中深度学习模型的兴起，它们在工业过程中应用的新可能性被证明带来了巨大的收益。然而，机器学习对于高度标准化的工业过程的实际适用性仍在争论之中。以激光束焊接质量控制为例，阐述了人工智能工具在工业实现中面临的挑战。我们使用来自TensorFlow对象检测API的对象检测算法，并使用转移学习使它们适应我们的用例。我们开发的基线模型用作基准，并与经过数据集缩放和超参数调整的模型进行评估和比较。我们发现，通过图像增强对数据集进行适度的缩放可以提高联合交集（IoU）和召回率，而高水平的增强和缩放可能会导致结果的恶化。最后，我们将结果放在底层用例的视角中，并评估它们的适合性。摘要：With the rise of deep learning models in the field of computer vision, new possibilities for their application in industrial processes proves to return great benefits. Nevertheless, the actual fit of machine learning for highly standardised industrial processes is still under debate. This paper addresses the challenges on the industrial realization of the AI tools, considering the use case of Laser Beam Welding quality control as an example. We use object detection algorithms from the TensorFlow object detection API and adapt them to our use case using transfer learning. The baseline models we develop are used as benchmarks and evaluated and compared to models that undergo dataset scaling and hyperparameter tuning. We find that moderate scaling of the dataset via image augmentation leads to improvements in intersection over union (IoU) and recall, whereas high levels of augmentation and scaling may lead to deterioration of results. Finally, we put our results into perspective of the underlying use case and evaluate their fit.

【3】 Shape Prior Non-Uniform Sampling Guided Real-time Stereo 3D Object Detection 标题：形状先验非均匀采样引导的实时立体三维目标检测

作者：A. Gao,J. Cao,Y. Pang 机构：of Brain-Inspired Intelligence Technology, School of Electrical and In-, formation Engineering, Tianjin University, Tianjin , China (e-mail: 备注：9 pages, 7 figures 链接：https://arxiv.org/abs/2106.10013 摘要：基于伪激光雷达的三维目标探测器由于其高精度而得到广泛应用。然而，这些方法需要密集的深度监控，速度较慢。为了解决这两个问题，最近推出的RTS3D构建了一个有效的4D特征一致性嵌入（FCE）空间，用于对象的中间表示，而不需要深度监控。FCE空间将整个目标区域分割成三维均匀网格的潜在空间进行特征采样点的生成，忽略了不同目标区域的重要性。然而，我们认为，与内部区域相比，外部区域对于精确的三维检测起着更重要的作用。为了从外部区域编码更多的信息，我们提出了一种形状优先的非均匀采样策略，在外部区域进行密集采样，在内部区域进行稀疏采样。结果，从外部区域抽取更多的点，并提取更多有用的特征用于三维检测。此外，为了增强每个采样点的特征识别能力，我们提出了一个高级语义增强FCE模块，以利用更多的上下文信息，更好地抑制噪声。在KITTI数据集上的实验表明了该方法的有效性。与基线RTS3D相比，该方法在不增加网络参数的情况下，AP3d性能提高了2.57%。此外，我们提出的方法在没有额外监控的情况下，在实时速度上优于现有的方法。摘要：Pseudo-LiDAR based 3D object detectors have gained popularity due to their high accuracy. However, these methods need dense depth supervision and suffer from inferior speed. To solve these two issues, a recently introduced RTS3D builds an efficient 4D Feature-Consistency Embedding (FCE) space for the intermediate representation of object without depth supervision. FCE space splits the entire object region into 3D uniform grid latent space for feature sampling point generation, which ignores the importance of different object regions. However, we argue that, compared with the inner region, the outer region plays a more important role for accurate 3D detection. To encode more information from the outer region, we propose a shape prior non-uniform sampling strategy that performs dense sampling in outer region and sparse sampling in inner region. As a result, more points are sampled from the outer region and more useful features are extracted for 3D detection. Further, to enhance the feature discrimination of each sampling point, we propose a high-level semantic enhanced FCE module to exploit more contextual information and suppress noise better. Experiments on the KITTI dataset are performed to show the effectiveness of the proposed method. Compared with the baseline RTS3D, our proposed method has 2.57% improvement on AP3d almost without extra network parameters. Moreover, our proposed method outperforms the state-of-the-art methods without extra supervision at a real-time speed.

【4】 Novelty Detection via Contrastive Learning with Negative Data Augmentation 标题：基于负数据增强对比学习的新颖性检测

作者：Chengwei Chen,Yuan Xie,Shaohui Lin,Ruizhi Qiao,Jian Zhou,Xin Tan,Yi Zhang,Lizhuang Ma 机构：East China Normal University, Tencent Youtu Lab, Shanghai Jiao Tong University, Zhejiang Lab 备注：None 链接：https://arxiv.org/abs/2106.09958 摘要：新颖性检测是确定查询示例是否与所学习的训练分布不同的过程。以往的方法试图通过生成性对抗网络（generative敌对网络，GANs）来学习正态样本的表示。然而，他们会受到不稳定训练、模式下降和低辨别能力的影响。最近，各种各样的借口任务（如旋转预测和聚类）被提出用于新颖性检测中的自监督学习。然而，学习到的潜在特征仍然是低分辨的。我们通过引入一个新的解码器-编码器框架来克服这些问题。首先，生成网络（又称解码器）通过将初始化的潜在向量映射到图像来学习表示。特别地，该向量通过考虑训练数据的整体分布来初始化，避免了模式丢失的问题。其次，对比网络（又称编码器）的目标是通过互信息估计来“学习比较”，这直接帮助生成网络通过使用负数据扩充策略获得更具区分性的表示。大量实验表明，该模型比现有的新颖性检测方法具有明显的优越性，并在CIFAR10和DCASE等新颖性检测基准上取得了最新的结果。此外，与其他基于对抗的新颖性检测方法相比，我们的模型在非对抗性训练中更稳定。摘要：Novelty detection is the process of determining whether a query example differs from the learned training distribution. Previous methods attempt to learn the representation of the normal samples via generative adversarial networks (GANs). However, they will suffer from instability training, mode dropping, and low discriminative ability. Recently, various pretext tasks (e.g. rotation prediction and clustering) have been proposed for self-supervised learning in novelty detection. However, the learned latent features are still low discriminative. We overcome such problems by introducing a novel decoder-encoder framework. Firstly, a generative network (a.k.a. decoder) learns the representation by mapping the initialized latent vector to an image. In particular, this vector is initialized by considering the entire distribution of training data to avoid the problem of mode-dropping. Secondly, a contrastive network (a.k.a. encoder) aims to ``learn to compare'' through mutual information estimation, which directly helps the generative network to obtain a more discriminative representation by using a negative data augmentation strategy. Extensive experiments show that our model has significant superiority over cutting-edge novelty detectors and achieves new state-of-the-art results on some novelty detection benchmarks, e.g. CIFAR10 and DCASE. Moreover, our model is more stable for training in a non-adversarial manner, compared to other adversarial based novelty detection methods.

分类|识别相关(3篇)

【1】 hSMAL: Detailed Horse Shape and Pose Reconstruction for Motion Pattern Recognition 标题：hSMAL：面向运动模式识别的马细节形状和姿势重建

作者：Ci Li,Nima Ghorbani,Sofia Broomé,Maheen Rashid,Michael J. Black,Elin Hernlund,Hedvig Kjellström,Silvia Zuffi 机构：Sofia Broom´e, Hedvig Kjellstr¨om, Silo AI, Sweden 备注：CV4Animals Workshop in CVPR 2021 链接：https://arxiv.org/abs/2106.10102 摘要：在本文中，我们提出了我们的初步工作，基于模型的行为分析马的运动。我们的方法是基于SMAL模型，一个三维关节的动物形状统计模型。我们定义了一个新的SMAL模型的基础上，一个新的模板，骨架和形状空间学习从37美元的马玩具。我们测试了我们的hSMAL模型在从3D mocap数据和图像重建马的准确性。将hSMAL模型应用于视频跛足检测问题，将该模型与图像相匹配，恢复三维姿态，并在姿态数据上训练ST-GCN网络。与在mocap点上训练的同一网络的比较说明了我们的方法的好处。摘要：In this paper we present our preliminary work on model-based behavioral analysis of horse motion. Our approach is based on the SMAL model, a 3D articulated statistical model of animal shape. We define a novel SMAL model for horses based on a new template, skeleton and shape space learned from $37$ horse toys. We test the accuracy of our hSMAL model in reconstructing a horse from 3D mocap data and images. We apply the hSMAL model to the problem of lameness detection from video, where we fit the model to images to recover 3D pose and train an ST-GCN network on pose data. A comparison with the same network trained on mocap points illustrates the benefit of our approach.

【2】 Combined Person Classification with Airborne Optical Sectioning 标题：结合机载光学分割的人员分类

作者：Indrajit Kurmi,David C. Schedl,Oliver Bimber 机构：S 备注：9 Pages, 7 Figures, 1 Table. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 链接：https://arxiv.org/abs/2106.10077 摘要：完全自主的无人机已经被证明可以在强烈遮挡的森林树冠下寻找失踪或受伤的人员。机载光学切片（AOS）是一种新的合成孔径成像技术，结合基于深度学习的分类技术，在真实的搜索和救援条件下具有较高的检测率。我们证明，将多个AOS分类相结合，而不是单个积分图像，可以显著地抑制错误检测，提高真实检测。这提高了分类率，尤其是在存在遮挡的情况下。为了实现这一点，我们修改了AOS成像过程，以支持后续积分之间的大重叠，实现了实时和车载扫描和处理速度高达10米/秒的地速。摘要：Fully autonomous drones have been demonstrated to find lost or injured persons under strongly occluding forest canopy. Airborne Optical Sectioning (AOS), a novel synthetic aperture imaging technique, together with deep-learning-based classification enables high detection rates under realistic search-and-rescue conditions. We demonstrate that false detections can be significantly suppressed and true detections boosted by combining classifications from multiple AOS rather than single integral images. This improves classification rates especially in the presence of occlusion. To make this possible, we modified the AOS imaging process to support large overlaps between subsequent integrals, enabling real-time and on-board scanning and processing of groundspeeds up to 10 m/s.

【3】 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021: Team M3EM Technical Report 标题：EPIC-厨房-100个无监督领域适应行动认可挑战2021：Team M3EM技术报告

作者：Lijin Yang,Yifei Huang,Yusuke Sugano,Yoichi Sato 机构：Institute of Industrial Science, the University of Tokyo, Tokyo, Japan 链接：https://arxiv.org/abs/2106.10026 摘要：在这份报告中，我们描述了我们提交给2021 EPIC-KITCHENS-100无监督领域适应挑战的技术细节，以获得行动识别。利用多种模式已被证明有利于无监督领域适应（UDA）任务。在这项工作中，我们提出了多模态互增强模块（M3EM），一个深度模块，用于联合考虑来自多模态的信息，以找到跨域的最可转移表示。我们通过实现两个子模块来实现这一点，使用其他模式的上下文来增强每个模式。第一个子模块通过语义空间在模式间交换信息，第二个子模块根据所有模式的一致性找到最可转移的空间区域。摘要：In this report, we describe the technical details of our submission to the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition. Leveraging multiple modalities has been proved to benefit the Unsupervised Domain Adaptation (UDA) task. In this work, we present Multi-Modal Mutual Enhancement Module (M3EM), a deep module for jointly considering information from multiple modalities to find the most transferable representations across domains. We achieve this by implementing two sub-modules for enhancing each modality using the context of other modalities. The first sub-module exchanges information across modalities through the semantic space, while the second sub-module finds the most transferable spatial region based on the consensus of all modalities.

分割|语义相关(8篇)

【1】 A Coarse-to-Fine Instance Segmentation Network with Learning Boundary Representation 标题：一种具有学习边界表示的由粗到精的实例分割网络

作者：Feng Luo,Bin-Bin Gao,Jiangpeng Yan,Xiu Li 机构：Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, Youtu Lab, Tencent, Shenzhen, China, Department of Automation, Tsinghua University, Beijing, China 备注：8 pages, Accepted by IJCNN 2021 链接：https://arxiv.org/abs/2106.10213 摘要：基于边界的实例分割以其高效的特点受到了广泛的关注。然而，现有的方法存在着长距离回归的困难。在本文中，我们提出了一个由粗到精的模块来解决这个问题。在粗化阶段生成近似的边界点，然后对这些点的特征进行采样，并将其输入到精化回归器中进行精细预测。由于差分采样操作在模块中得到很好的支持，因此它是端到端可训练的。此外，我们设计了一个整体的边界感知分支，并引入实例不可知监督来辅助回归。利用ResNet-101，通过单尺度训练和测试，我们的方法在COCO数据集上获得了31.7%的mask-AP，在附加参数和GFLOPs小于1%的情况下优于基线1.3%的mask-AP。实验还表明，与现有的基于边界的方法相比，该方法具有轻量级的设计和简单的流水线结构。摘要：Boundary-based instance segmentation has drawn much attention since of its attractive efficiency. However, existing methods suffer from the difficulty in long-distance regression. In this paper, we propose a coarse-to-fine module to address the problem. Approximate boundary points are generated at the coarse stage and then features of these points are sampled and fed to a refined regressor for fine prediction. It is end-to-end trainable since differential sampling operation is well supported in the module. Furthermore, we design a holistic boundary-aware branch and introduce instance-agnostic supervision to assist regression. Equipped with ResNet-101, our approach achieves 31.7\% mask AP on COCO dataset with single-scale training and testing, outperforming the baseline 1.3\% mask AP with less than 1\% additional parameters and GFLOPs. Experiments also show that our proposed method achieves competitive performance compared to existing boundary-based methods with a lightweight design and a simple pipeline.

【2】 Virtual Temporal Samples for Recurrent Neural Networks: applied to semantic segmentation in agriculture 标题：基于递归神经网络的虚拟时间样本在农业语义分割中的应用

作者：Alireza Ahmadi,Michael Halstead,Chris McCool 机构：University of Bonn, Nussallee , Bonn , Germany 链接：https://arxiv.org/abs/2106.10118 摘要：本文探讨了在没有时间标记数据的农业机器人环境下进行时间语义分割的可能性。我们建议从标记的静止图像生成虚拟时间样本来实现这一点。这使得我们无需额外的注释工作，就可以生成虚拟标记的时间序列。通常，为了训练递归神经网络（RNN），需要从视频（时间）序列中提取标记样本，这是一项费力的工作，并且阻碍了这方面的工作。通过生成虚拟时间样本，我们证明了训练一个轻量级RNN对两个具有挑战性的农业数据集进行语义分割是可能的。我们的结果表明，通过使用虚拟样本训练时间语义切分器，我们可以在甜椒和甜菜数据集上分别提高4.6和4.9的绝对性能。这表明我们的虚拟数据增强技术能够准确地对农业图像进行时间分类，而不需要使用复杂的合成数据生成技术，也不需要标记大量的时间序列。摘要：This paper explores the potential for performing temporal semantic segmentation in the context of agricultural robotics without temporally labelled data. We achieve this by proposing to generate virtual temporal samples from labelled still images. This allows us, with no extra annotation effort, to generate virtually labelled temporal sequences. Normally, to train a recurrent neural network (RNN), labelled samples from a video (temporal) sequence are required which is laborious and has stymied work in this direction. By generating virtual temporal samples, we demonstrate that it is possible to train a lightweight RNN to perform semantic segmentation on two challenging agricultural datasets. Our results show that by training a temporal semantic segmenter using virtual samples we can increase the performance by an absolute amount of 4.6 and 4.9 on sweet pepper and sugar beet datasets, respectively. This indicates that our virtual data augmentation technique is able to accurately classify agricultural images temporally without the use of complicated synthetic data generation techniques nor with the overhead of labelling large amounts of temporal sequences.

【3】 HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping 标题：HifiFace：3D形状和语义先验引导的高保真人脸交换

作者：Yuhan Wang,Xu Chen,Junwei Zhu,Wenqing Chu,Ying Tai,Chengjie Wang,Jilin Li,Yongjian Wu,Feiyue Huang,Rongrong Ji 机构：Youtu Lab, Tencent, Zhejiang University, Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Institute of Artificial Intelligence, Xiamen University 备注：Accepted to IJCAI 2021, project website: this https URL 链接：https://arxiv.org/abs/2106.09965 摘要：在这项工作中，我们提出了一种高保真度的人脸交换方法，称为HifiFace，它可以很好地保留源人脸的形状，并产生真实感的结果。与现有的人脸交换算法仅利用人脸识别模型来保持身份相似性不同，本文提出了基于三维形状感知的身份控制算法，利用三维人脸模型的几何监督和三维人脸重建方法来控制人脸的形状。同时，引入语义人脸融合模块，对编码和解码特征进行优化组合，并进行自适应融合，使融合结果更加逼真。在野外对人脸进行的大量实验表明，该方法能更好地保持人脸的身份，尤其是在人脸形状上，并能产生比现有方法更逼真的结果。摘要：In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods.

【4】 Medical Matting: A New Perspective on Medical Segmentation with Uncertainty 标题：医学铺垫：不确定性医学分割的新视角

作者：Lin Wang,Lie Ju,Donghao Zhang,Xin Wang,Wanji He,Yelin Huang,Zhiwen Yang,Xuan Yao,Xin Zhao,Xiufen Ye,Zongyuan Ge 机构： Harbin Engineering University, Harbin Heilongjiang , China, Monash Medical AI Group, Monash University, Clayton VIC , Australia, Airdoc Co., Ltd, Beijing , China 链接：https://arxiv.org/abs/2106.09887 摘要：在医学图像分割中，二值化掩模很难准确地标记出模糊区域，特别是在处理小病灶时。因此，在多注释的情况下，如何使用二值掩模来达成共识是放射学家面临的一个挑战。然而，这些区域可能包含有利于诊断的解剖结构。引入不确定性来研究这些情况。然而，不确定度通常是通过多次试验的方式通过预测之间的差异来衡量的。这不是直观的，图像中也没有精确的对应关系。受图像抠图的启发，本文将抠图作为一种软分割方法引入到医学场景中，为医学场景中不确定区域的处理和表示提供了一个新的视角，即医学抠图。更具体地说，因为没有可用的医疗床垫数据集，我们首先用alpha matte标记两个医疗数据集。其次，针对自然图像的遮片方法不适合医学场景的特点，提出了一种新的二值遮片和alpha遮片生成结构。第三，引入不确定性映射来突出二值化结果中的模糊区域，提高了去噪性能。通过对这些数据集的评估，本文提出的模型在很大程度上优于最新的matting算法，alpha-matte被证明是一种比二进制掩模更有效的标记形式。摘要：In medical image segmentation, it is difficult to mark ambiguous areas accurately with binary masks, especially when dealing with small lesions. Therefore, it is a challenge for radiologists to reach a consensus by using binary masks under the condition of multiple annotations. However, these areas may contain anatomical structures that are conducive to diagnosis. Uncertainty is introduced to study these situations. Nevertheless, the uncertainty is usually measured by the variances between predictions in a multiple trial way. It is not intuitive, and there is no exact correspondence in the image. Inspired by image matting, we introduce matting as a soft segmentation method and a new perspective to deal with and represent uncertain regions into medical scenes, namely medical matting. More specifically, because there is no available medical matting dataset, we first labeled two medical datasets with alpha matte. Secondly, the matting method applied to the natural image is not suitable for the medical scene, so we propose a new architecture to generate binary masks and alpha matte in a row. Thirdly, the uncertainty map is introduced to highlight the ambiguous regions from the binary results and improve the matting performance. Evaluated on these datasets, the proposed model outperformed state-of-the-art matting algorithms by a large margin, and alpha matte is proved to be a more efficient labeling form than a binary mask.

【5】 Analyzing Adversarial Robustness of Deep Neural Networks in Pixel Space: a Semantic Perspective 标题：像素空间中深层神经网络的对抗鲁棒性分析：语义视角

作者：Lina Wang,Xingshu Chen,Yulong Wang,Yawei Yue,Yi Zhu,Xuemei Zeng,Wei Wang 备注：13 pages, 6figures 链接：https://arxiv.org/abs/2106.09872 摘要：深层神经网络容易受到敌方例子的攻击，这些例子是通过用不可察觉的扰动修改输入来恶意制造的，从而误导网络产生不正确的输出，揭示了其鲁棒性的不足，并带来了安全问题。以往的工作主要是研究图像分类器在图像层次上的对抗鲁棒性，不分青红皂白地利用图像中的所有像素信息，缺乏对图像像素空间中具有不同语义的区域的探索。在这项工作中，我们填补了这一空白，并通过提出一种在分割图像的不同区域逐像素寻找可能的扰动的算法来探索敌方图像的像素空间。在CIFAR-10和ImageNet上的大量实验结果表明，只在图像的某些像素上搜索修改后的像素，可以在不需要整个图像所有像素的情况下成功地发起单像素对抗攻击，并且在图像的不同区域存在多个分散的弱点。我们还证明了图像上不同区域的对抗鲁棒性随包含的语义信息量的不同而不同。摘要：The vulnerability of deep neural networks to adversarial examples, which are crafted maliciously by modifying the inputs with imperceptible perturbations to misled the network produce incorrect outputs, reveals the lack of robustness and poses security concerns. Previous works study the adversarial robustness of image classifiers on image level and use all the pixel information in an image indiscriminately, lacking of exploration of regions with different semantic meanings in the pixel space of an image. In this work, we fill this gap and explore the pixel space of the adversarial image by proposing an algorithm to looking for possible perturbations pixel by pixel in different regions of the segmented image. The extensive experimental results on CIFAR-10 and ImageNet verify that searching for the modified pixel in only some pixels of an image can successfully launch the one-pixel adversarial attacks without requiring all the pixels of the entire image, and there exist multiple vulnerable points scattered in different regions of an image. We also demonstrate that the adversarial robustness of different regions on the image varies with the amount of semantic information contained.

【6】 CT Image Synthesis Using Weakly Supervised Segmentation and Geometric Inter-Label Relations For COVID Image Analysis 标题：用于COVID图像分析的基于弱监督分割和几何标签关系的CT图像合成

作者：Dwarikanath Mahapatra,Ankur Singh 机构： Inception Institute of Artificial Intelligence, Abu Dhabi, UAE, Indian Institute of Technology, Kanpur, India 备注：arXiv admin note: substantial text overlap with arXiv:2003.14119; text overlap with arXiv:1908.10555, arXiv:2004.14133 by other authors 链接：https://arxiv.org/abs/2106.10230 摘要：医学图像分割是计算机辅助诊断的一项重要任务，而对像素级人工标注的高专业性要求使其成为一项具有挑战性且耗时的任务。由于传统的数据扩充并不能完全表示训练集的基本分布，训练模型在不同来源的图像上进行测试时表现出不同的性能。以往的数据增强图像合成工作大多忽略了不同解剖标签之间的交叉几何关系。我们通过学习不同解剖标签之间的关系，提出了对以往基于GAN的医学图像合成方法的改进。采用弱监督分割方法得到像素级的图像语义标签图，利用该图可以了解语义标签之间几何和形状的内在关系。潜在空间变量采样从基础图像产生不同的图像，并提高鲁棒性。我们使用我们方法中的合成图像训练网络，从肺部CT图像中分割COVID-19感染区域。在公共数据集上，该方法优于现有的分割方法。烧蚀研究也证明了整合几何学和多样性的好处。摘要：While medical image segmentation is an important task for computer aided diagnosis, the high expertise requirement for pixelwise manual annotations makes it a challenging and time consuming task. Since conventional data augmentations do not fully represent the underlying distribution of the training set, the trained models have varying performance when tested on images captured from different sources. Most prior work on image synthesis for data augmentation ignore the interleaved geometric relationship between different anatomical labels. We propose improvements over previous GAN-based medical image synthesis methods by learning the relationship between different anatomical labels. We use a weakly supervised segmentation method to obtain pixel level semantic label map of images which is used learn the intrinsic relationship of geometry and shape across semantic labels. Latent space variable sampling results in diverse generated images from a base image and improves robustness. We use the synthetic images from our method to train networks for segmenting COVID-19 infected areas from lung CT images. The proposed method outperforms state-of-the-art segmentation methods on a public dataset. Ablation studies also demonstrate benefits of integrating geometry and diversity.

【7】 Hybrid graph convolutional neural networks for landmark-based anatomical segmentation 标题：基于地标的混合图卷积神经网络解剖分割

作者：Nicolás Gaggion,Lucas Mansilla,Diego Milone,Enzo Ferrante 机构：Research Institute for Signals, Systems and Computational Intelligence, sinc(i), CONICET, Universidad Nacional del Litoral, Santa Fe, Argentina 备注：Accepted for publication at MICCAI 2021 链接：https://arxiv.org/abs/2106.09832 摘要：在这项工作中，我们解决的问题，地标为基础的分割解剖结构。我们提出HybridGNet，一种编码器-解码器神经结构，它结合了用于图像特征编码的标准卷积和用于解码解剖结构合理表示的图卷积神经网络。我们在考虑了其他标准地标和基于像素的胸部x射线图像解剖分割模型的基础上，对所提出的结构进行了测试，发现hybridgenet对图像遮挡更具鲁棒性。我们还表明，它可以用来构建基于地标的分割从像素级注释。我们的实验结果表明，HybridGNet通过在解码过程中通过频谱卷积自然地结合形状约束，从而产生准确且在解剖学上合理的基于路标的分割。摘要：In this work we address the problem of landmark-based segmentation for anatomical structures. We propose HybridGNet, an encoder-decoder neural architecture which combines standard convolutions for image feature encoding, with graph convolutional neural networks to decode plausible representations of anatomical structures. We benchmark the proposed architecture considering other standard landmark and pixel-based models for anatomical segmentation in chest x-ray images, and found that HybridGNet is more robust to image occlusions. We also show that it can be used to construct landmark-based segmentations from pixel level annotations. Our experimental results suggest that HybridGNet produces accurate and anatomically plausible landmark-based segmentations, by naturally incorporating shape constraints within the decoding process via spectral convolutions.

【8】 AtrialGeneral: Domain Generalization for Left Atrial Segmentation of Multi-Center LGE MRIs 标题：AtrialGeneral：多中心LGE MRI左心房分割的域泛化

作者：Lei Li,Veronika A. Zimmer,Julia A. Schnabel,Xiahai Zhuang 机构： School of Data Science, Fudan University, Shanghai, China, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China, School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK 备注：10 pages, 4 figures, MICCAI2021 链接：https://arxiv.org/abs/2106.08727 摘要：从晚期钆增强磁共振成像（LGE-MRI）中分割左心房（LA）是计划心房颤动治疗的关键步骤。然而，由于LGE-MRI图像质量差、LA形状变异性大、LA边界不清晰等原因，自动分割LA仍然是一个挑战。虽然基于深度学习的方法可以提供很有前途的LA分割结果，但是它们通常很难推广到不可见的领域，例如来自不同扫描仪和/或站点的数据。在这项工作中，我们收集了210个来自不同中心的不同图像质量水平的LGE-mri。为了评估模型在LA切分任务中的领域泛化能力，我们采用了四种常用的语义切分网络对多中心LGE-MRIs进行LA切分。此外，我们还研究了直方图匹配、基于互信息的非纠缠表示和随机风格转换三种领域泛化策略，其中简单的直方图匹配是最有效的。摘要：Left atrial (LA) segmentation from late gadolinium enhanced magnetic resonance imaging (LGE MRI) is a crucial step needed for planning the treatment of atrial fibrillation. However, automatic LA segmentation from LGE MRI is still challenging, due to the poor image quality, high variability in LA shapes, and unclear LA boundary. Though deep learning-based methods can provide promising LA segmentation results, they often generalize poorly to unseen domains, such as data from different scanners and/or sites. In this work, we collect 210 LGE MRIs from different centers with different levels of image quality. To evaluate the domain generalization ability of models on the LA segmentation task, we employ four commonly used semantic segmentation networks for the LA segmentation from multi-center LGE MRIs. Besides, we investigate three domain generalization strategies, i.e., histogram matching, mutual information based disentangled representation, and random style transfer, where a simple histogram matching is proved to be most effective.

Zero/Few Shot|迁移|域适配|自适应(1篇)

【1】 Guided Integrated Gradients: An Adaptive Path Method for Removing Noise 标题：导引积分梯度：一种自适应路径去噪方法

作者：Andrei Kapishnikov,Subhashini Venugopalan,Besim Avci,Ben Wedin,Michael Terry,Tolga Bolukbasi 机构：Google Research 备注：None 链接：https://arxiv.org/abs/2106.09788 摘要：积分梯度（IG）是一种常用的深部神经网络特征归属方法。虽然IG具有许多令人满意的特性，但当应用于视觉模型时，该方法通常在与预测类无关的区域中产生虚假/噪声像素属性。虽然之前已经提到过这一点，但大多数现有的解决方案都旨在通过显式地降低结果属性中的噪声来解决症状。在这项工作中，我们证明了问题的原因之一是沿IG路径的噪声累积。为了最小化这种噪声源的影响，我们建议对属性路径本身进行调整——不仅对图像，而且对所解释的模型进行调整。我们引入自适应路径方法（APMs）作为路径方法的推广，并将IG作为APM的一个具体实例。根据经验，引导免疫球蛋白创建显着地图更好地与模型的预测和输入图像是解释一致。我们通过定性和定量实验表明，引导免疫算法在几乎所有实验中都优于其他相关方法。摘要：Integrated Gradients (IG) is a commonly used feature attribution method for deep neural networks. While IG has many desirable properties, the method often produces spurious/noisy pixel attributions in regions that are not related to the predicted class when applied to visual models. While this has been previously noted, most existing solutions are aimed at addressing the symptoms by explicitly reducing the noise in the resulting attributions. In this work, we show that one of the causes of the problem is the accumulation of noise along the IG path. To minimize the effect of this source of noise, we propose adapting the attribution path itself -- conditioning the path not just on the image but also on the model being explained. We introduce Adaptive Path Methods (APMs) as a generalization of path methods, and Guided IG as a specific instance of an APM. Empirically, Guided IG creates saliency maps better aligned with the model's prediction and the input image that is being explained. We show through qualitative and quantitative experiments that Guided IG outperforms other, related methods in nearly every experiment.

半弱无监督|主动学习|不确定性(1篇)

【1】 Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting 标题：基于跨流原型对比的自监督视频表示学习

作者：Martine Toering,Ioannis Gatopoulos,Maarten Stol,Vincent Tao Hu 机构：University of Amsterdam, BrainCreators B.V. 链接：https://arxiv.org/abs/2106.10137 摘要：实例级对比学习技术在视觉表征学习领域取得了巨大的成功，它依赖于数据的扩充和对比损失函数。然而，它们不适合利用视频丰富的动态结构，因为操作是在许多增强实例上进行的。在本文中，我们提出了“视频交叉流原型对比”，这是一种新的方法，从RGB和光流视图预测一致的原型分配，操作样本集。具体来说，我们交替优化过程；在优化其中一个流时，所有视图都映射到一组流原型向量。除了与预测匹配的视图外，每个指定都使用所有视图进行预测，从而使表示更接近其指定的原型。结果，学习更有效的视频嵌入与根深蒂固的运动信息，而不需要显式的光流计算过程中推理。在最近邻视频检索和动作识别方面，我们取得了最新的结果，在使用S3D主干的UCF101上，我们的性能比以前的best提高了+3.2%（90.5%Top-1 acc），在使用R（2+1）D主干的UCF101上，我们的性能比以前的best提高了+7.2%，在使用HMDB51上，我们的性能比以前的best提高了+15.1%。摘要：Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion information are learned, without the explicit need for optical flow computation during inference. We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition, outperforming previous best by +3.2% on UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and +15.1% on HMDB51 using the R(2+1)D backbone.

时序|行为识别|姿态|视频|运动估计(1篇)

【1】 Discerning Generic Event Boundaries in Long-Form Wild Videos 标题：在冗长的Wild视频中识别通用事件边界

作者：Ayush K Rai,Tarun Krishna,Julia Dietlmeier,Kevin McGuinness,Alan F Smeaton,Noel E O'Connor 机构：Alan F. Smeaton, Noel E. O’Connor, Insight Centre for Data Analytics, Dublin City University, Ireland 备注：Technical Report for Generic Event Boundary Challenge - LOVEU Challenge (CVPR 2021) 链接：https://arxiv.org/abs/2106.10090 摘要：检测泛型的、无分类的事件边界代表了对视频理解的一大进步。本文提出了一种基于平面三维卷积结构的双流伪造事件边界检测技术，该技术可以从视频中学习时空特征。我们的工作受到了一般事件边界检测挑战（CVPR2021长格式视频理解-LOVEU研讨会的一部分）的启发。在本文中，我们对所进行的实验进行了深入的分析，并对获得的结果进行了解释。摘要：Detecting generic, taxonomy-free event boundaries invideos represents a major stride forward towards holisticvideo understanding. In this paper we present a technique forgeneric event boundary detection based on a two stream in-flated 3D convolutions architecture, which can learn spatio-temporal features from videos. Our work is inspired from theGeneric Event Boundary Detection Challenge (part of CVPR2021 Long Form Video Understanding- LOVEU Workshop).Throughout the paper we provide an in-depth analysis ofthe experiments performed along with an interpretation ofthe results obtained.

医学相关(4篇)

【1】 Development of a conversing and body temperature scanning autonomously navigating robot to help screen for COVID-19 标题：用于冠状病毒筛查的反转体温自主导航机器人的研制

作者：Ryan Kim 链接：https://arxiv.org/abs/2106.09894 摘要：在整个COVID-19大流行期间，患者表现出的最常见症状是发烧，这导致使用温度扫描作为检测潜在病毒携带者的先发制人措施。手持式温度计的人类雇员已经被用来完成这项任务，但是这使他们处于危险之中，因为他们无法与外界保持距离，而且这种方法的连续性导致了极大的不便和低效。提出的解决方案是一种自主导航机器人，能够转换和扫描人的体温，以检测发烧并帮助筛选COVID-19。为了实现这一目标，机器人必须能够（1）自主导航，（2）检测和跟踪人，（3）获取个体的体温读数，当温度超过38{\deg}C时与之交谈。一个自主导航的移动机器人由一个由人脸跟踪算法控制的机械手和一个由热摄像机、智能手机和聊天机器人组成的末端执行器组成。我们的目标是开发一个功能强大的解决方案来执行上述任务。此外，还将介绍遇到的技术挑战及其工程解决方案，并就在接近商业化时可纳入的增强功能提出建议。摘要：Throughout the COVID-19 pandemic, the most common symptom displayed by patients has been a fever, leading to the use of temperature scanning as a preemptive measure to detect potential carriers of the virus. Human employees with handheld thermometers have been used to fulfill this task, however this puts them at risk as they cannot be physically distanced and the sequential nature of this method leads to great inconveniences and inefficiency. The proposed solution is an autonomously navigating robot capable of conversing and scanning people's temperature to detect fevers and help screen for COVID-19. To satisfy this objective, the robot must be able to (1) navigate autonomously, (2) detect and track people, and (3) get individuals' temperature reading and converse with them if it exceeds 38{\deg}C. An autonomously navigating mobile robot is used with a manipulator controlled using a face tracking algorithm, and an end effector consisting of a thermal camera, smartphone, and chatbot. The goal is to develop a functioning solution that performs the above tasks. In addition, technical challenges encountered and their engineering solutions will be presented, and recommendations will be made for enhancements that could be incorporated when approaching commercialization.

【2】 Medical Image Analysis on Left Atrial LGE MRI for Atrial Fibrillation Studies: A Review 标题：左房LGE MRI在心房颤动研究中的医学图像分析

作者：Lei Li,Veronika A. Zimmer,Julia A. Schnabel,Xiahai Zhuang 机构：School of Data Science, Fudan University, Shanghai, China, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China, School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK 备注：23 pages 链接：https://arxiv.org/abs/2106.09862 摘要：晚期钆增强磁共振成像（LGE-MRI）常用于左房（LA）瘢痕的可视化和定量研究。瘢痕的位置和范围为心房颤动（AF）的病理生理和进展提供了重要信息。因此，LGE-MRI对LA瘢痕的分割和量化可用于AF患者的计算机辅助诊断和治疗分层。由于手动描绘可能耗时且受专家内和专家间变化的影响，因此高度期望自动化该计算，然而这仍然是具有挑战性的且研究不足。本文对LGE-MRI对LA腔、壁、瘢痕和消融间隙的分割和量化的计算方法以及AF研究的相关文献进行了系统的综述。具体来说，我们首先总结房颤相关的成像技术，特别是LGE-MRI。然后，我们详细回顾了四个计算任务的方法，并总结了每个任务中应用的验证策略。最后，展望了未来可能的发展趋势，并对上述方法的潜在临床应用作了简要综述。综述表明，对这一课题的研究尚处于起步阶段。虽然已经提出了几种方法，特别是对于LA分割，但是由于增强外观的高度可变性和图像采集的差异性，仍然有很大的算法发展空间。摘要：Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is commonly used to visualize and quantify left atrial (LA) scars. The position and extent of scars provide important information of the pathophysiology and progression of atrial fibrillation (AF). Hence, LA scar segmentation and quantification from LGE MRI can be useful in computer-assisted diagnosis and treatment stratification of AF patients. Since manual delineation can be time-consuming and subject to intra- and inter-expert variability, automating this computing is highly desired, which nevertheless is still challenging and under-researched. This paper aims to provide a systematic review on computing methods for LA cavity, wall, scar and ablation gap segmentation and quantification from LGE MRI, and the related literature for AF studies. Specifically, we first summarize AF-related imaging techniques, particularly LGE MRI. Then, we review the methodologies of the four computing tasks in detail, and summarize the validation strategies applied in each task. Finally, the possible future developments are outlined, with a brief survey on the potential clinical applications of the aforementioned methods. The review shows that the research into this topic is still in early stages. Although several methods have been proposed, especially for LA segmentation, there is still large scope for further algorithmic developments due to performance issues related to the high variability of enhancement appearance and differences in image acquisition.

【3】 Deep reinforcement learning with automated label extraction from clinical reports accurately classifies 3D MRI brain volumes 标题：从临床报告中自动提取标签的深度强化学习精确地对3D MRI脑体积进行分类

作者：Joseph Stember,Hrithwik Shalu 机构：Memorial Sloan Kettering Cancer Center, New York, NY, US, Indian Institute of Technology, Madras, Chennai, India 链接：https://arxiv.org/abs/2106.09812 摘要：目的：图像分类可能是成像人工智能中最基本的任务。但是，为图像添加标签既费时又繁琐。我们最近证明了强化学习（RL）可以对MRI脑图像的2D切片进行高精度的分类。在这里，我们做了两个重要的步骤来加速图像分类：首先，我们自动从临床报告中提取类别标签。其次，我们将先前的二维分类工作扩展到我们机构的全三维图像体。因此，我们按照以下步骤进行：在第1部分中，我们使用SBERT自然语言处理方法自动从报表中提取标签。然后，在第2部分中，我们使用这些标签和RL来训练三维图像体的Deep-Q分类网络（DQN）。方法：第一部分，我们用90份放射报告来训练SBERT。然后，我们使用训练好的SBERT来预测第二部分中使用的类别标签。在第二部分中，我们应用了多步图像分类，以允许使用3D卷积和TD（0）Q学习的联合深度Q学习。我们训练了一组90张图片。我们在一组单独的61幅图像上进行了测试，同样使用了第1部分中训练过的SBERT从患者报告中预测的类别。为了进行比较，我们还在使用相同标签的同一组训练和测试图像上训练和测试了一个有监督的深度学习分类网络。结果：第1部分：通过对放射报告语料库的训练，SBERT模型对正常扫描和含转移扫描的准确率均为100%。第2部分：然后，使用这些标签，尽管监督方法很快过拟合训练数据，并且正如预期的那样在测试集上表现不佳（66%的准确率，只是超过随机猜测），但是强化学习方法达到了92%的准确率。结果具有统计学意义，p值为3.1x10^-5。摘要：Purpose: Image classification is perhaps the most fundamental task in imaging AI. However, labeling images is time-consuming and tedious. We have recently demonstrated that reinforcement learning (RL) can classify 2D slices of MRI brain images with high accuracy. Here we make two important steps toward speeding image classification: Firstly, we automatically extract class labels from the clinical reports. Secondly, we extend our prior 2D classification work to fully 3D image volumes from our institution. Hence, we proceed as follows: in Part 1, we extract labels from reports automatically using the SBERT natural language processing approach. Then, in Part 2, we use these labels with RL to train a classification Deep-Q Network (DQN) for 3D image volumes. Methods: For Part 1, we trained SBERT with 90 radiology report impressions. We then used the trained SBERT to predict class labels for use in Part 2. In Part 2, we applied multi-step image classification to allow for combined Deep-Q learning using 3D convolutions and TD(0) Q learning. We trained on a set of 90 images. We tested on a separate set of 61 images, again using the classes predicted from patient reports by the trained SBERT in Part 1. For comparison, we also trained and tested a supervised deep learning classification network on the same set of training and testing images using the same labels. Results: Part 1: Upon training with the corpus of radiology reports, the SBERT model had 100% accuracy for both normal and metastasis-containing scans. Part 2: Then, using these labels, whereas the supervised approach quickly overfit the training data and as expected performed poorly on the testing set (66% accuracy, just over random guessing), the reinforcement learning approach achieved an accuracy of 92%. The results were found to be statistically significant, with a p-value of 3.1 x 10^-5.

【4】 Synthetic COVID-19 Chest X-ray Dataset for Computer-Aided Diagnosis 标题：用于计算机辅助诊断的合成冠状病毒胸片数据集

作者：Hasib Zunair,A. Ben Hamza 机构： including tailored convolutional neural 1Concordia University 链接：https://arxiv.org/abs/2106.09759 摘要：我们引入了一个新的数据集，称为合成COVID-19胸部X射线数据集，用于训练机器学习模型。该数据集由21295张用于计算机辅助诊断的COVID-19胸片组成。这些图像，通过无监督域自适应方法生成，具有高质量。我们发现，合成图像不仅可以提高各种深度学习结构在严重不平衡条件下作为额外训练数据时的性能，而且可以高置信度地检测目标类。我们还发现，当只对合成图像进行训练时，也可以获得类似的性能。此外，合成COVID-19图像的显著特征表明，该分布与非COVID-19类显著不同，从而实现了适当的决策边界。我们希望COVID-19的这种高保真胸部X射线图像的可用性将促进诊断和/或管理工具的发展。摘要：We introduce a new dataset called Synthetic COVID-19 Chest X-ray Dataset for training machine learning models. The dataset consists of 21,295 synthetic COVID-19 chest X-ray images to be used for computer-aided diagnosis. These images, generated via an unsupervised domain adaptation approach, are of high quality. We find that the synthetic images not only improve performance of various deep learning architectures when used as additional training data under heavy imbalance conditions, but also detect the target class with high confidence. We also find that comparable performance can also be achieved when trained only on synthetic images. Further, salient features of the synthetic COVID-19 images indicate that the distribution is significantly different from Non-COVID-19 classes, enabling a proper decision boundary. We hope the availability of such high fidelity chest X-ray images of COVID-19 will encourage advances in the development of diagnostic and/or management tools.

GAN|对抗|攻击|生成相关(7篇)

【1】 Residual Error: a New Performance Measure for Adversarial Robustness 标题：残差：一种新的对抗鲁棒性性能度量

作者：Hossein Aboutalebi,Mohammad Javad Shafiee,Michelle Karg,Christian Scharfenberger,Alexander Wong 机构：Waterloo AI Institute, University of Waterloo, Waterloo, Ontario, Canada, ADC Automotive Distance Control Systems GmbH, Continental, Germany, DarwinAI Corp., Canada 链接：https://arxiv.org/abs/2106.10212 摘要：尽管在过去十年中，深度学习取得了重大进展，但限制深度学习广泛应用的一个主要挑战是，深度学习在对抗性攻击中的脆弱性。在存在不利扰动数据的情况下，这种对错误预测的敏感性使得深层神经网络很难用于某些现实世界的任务关键型应用。虽然大部分的研究重点都围绕着对抗性例子的创建和对抗性强化，但是评估对抗性稳健性的性能度量的领域还没有得到很好的探索。基于此，本研究提出了残差的概念，残差是一种新的性能指标，不仅可以在个体样本水平上评估深层神经网络的对抗鲁棒性，还可以用来区分对抗性和非对抗性样本，以便于对抗性样本检测。此外，我们还引入了一个混合模型来逼近残差。以图像分类为例的实验结果表明，所提出的残差度量方法对于评价几种常见的深度神经网络结构是有效的。这些结果表明，所提出的方法不仅可用于评估任务关键场景中使用的深度神经网络的鲁棒性，而且可用于设计对抗鲁棒模型。摘要：Despite the significant advances in deep learning over the past decade, a major challenge that limits the wide-spread adoption of deep learning has been their fragility to adversarial attacks. This sensitivity to making erroneous predictions in the presence of adversarially perturbed data makes deep neural networks difficult to adopt for certain real-world, mission-critical applications. While much of the research focus has revolved around adversarial example creation and adversarial hardening, the area of performance measures for assessing adversarial robustness is not well explored. Motivated by this, this study presents the concept of residual error, a new performance measure for not only assessing the adversarial robustness of a deep neural network at the individual sample level, but also can be used to differentiate between adversarial and non-adversarial examples to facilitate for adversarial example detection. Furthermore, we introduce a hybrid model for approximating the residual error in a tractable manner. Experimental results using the case of image classification demonstrates the effectiveness and efficacy of the proposed residual error metric for assessing several well-known deep neural network architectures. These results thus illustrate that the proposed measure could be a useful tool for not only assessing the robustness of deep neural networks used in mission-critical scenarios, but also in the design of adversarially robust models.

【2】 World-GAN: a Generative Model for Minecraft Worlds 标题：World-GAN：“我的世界”的产生式模型

作者：Maren Awiszus,Frederik Schubert,Bodo Rosenhahn 机构：Institut f¨ur Informationsverarbeitung, Leibniz University Hannover, Hannover, Germany 备注：8 pages, 8 figures, IEEE Conference on Games (CoG) 2021 链接：https://arxiv.org/abs/2106.10155 摘要：这项工作介绍了World-GAN，第一种通过机器学习在Minecraft中通过单个示例执行数据驱动的程序内容生成的方法。基于三维生成对抗网络（GAN）架构，我们可以从给定的样本中创建任意大小的世界片段。我们评估我们的方法，从社区的创作以及结构生成的地雷世界发电机。我们的方法是基于word2vec[1]引入的自然语言处理（NLP）中的密集表示。提出的block2vec表示使世界GAN独立于不同块的数量，而不同块的数量在Minecraft中可能会有很大的变化，并且能够生成更大的级别。最后，我们证明了改变这个新的表示空间可以改变已经训练过的生成器的生成样式。World GAN使其用户能够根据他们的部分作品生成雷击世界。摘要：This work introduces World-GAN, the first method to perform data-driven Procedural Content Generation via Machine Learning in Minecraft from a single example. Based on a 3D Generative Adversarial Network (GAN) architecture, we are able to create arbitrarily sized world snippets from a given sample. We evaluate our approach on creations from the community as well as structures generated with the Minecraft World Generator. Our method is motivated by the dense representations used in Natural Language Processing (NLP) introduced with word2vec [1]. The proposed block2vec representations make World-GAN independent from the number of different blocks, which can vary a lot in Minecraft, and enable the generation of larger levels. Finally, we demonstrate that changing this new representation space allows us to change the generated style of an already trained generator. World-GAN enables its users to generate Minecraft worlds based on parts of their creations.

【3】 Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples 标题：攻击失败的指标：对抗性实例的调试和改进优化

作者：Maura Pintor,Luca Demetrio,Angelo Sotgiu,Giovanni Manca,Ambra Demontis,Nicholas Carlini,Battista Biggio,Fabio Roli 机构：University of Cagliari, Italy, Pluribus One, Google 链接：https://arxiv.org/abs/2106.09947 摘要：评估机器学习模型对对抗性例子的鲁棒性是一个具有挑战性的问题。许多防御已经被证明通过导致基于梯度的攻击失败而提供虚假的安全感，并且在更严格的评估下它们已经被破坏。尽管已经提出了一些指导方针和最佳做法来改进目前的对抗性稳健性评估，但由于缺乏自动测试和调试工具，很难系统地应用这些建议。在这项工作中，我们通过（i）定义一组定量指标来克服这些限制，这些指标揭示了基于梯度的攻击优化中的常见故障，以及（ii）在系统评估协议中提出具体的缓解策略。我们广泛的实验分析表明，提出的失效指标可以用于可视化、调试和改进当前的对抗性稳健性评估，为实现当前对抗性稳健性评估的自动化和系统化迈出了具体的第一步。我们的开放源代码可从以下网址获得：https://github.com/pralab/IndicatorsOfAttackFailure. 摘要：Evaluating robustness of machine-learning models to adversarial examples is a challenging problem. Many defenses have been shown to provide a false sense of security by causing gradient-based attacks to fail, and they have been broken under more rigorous evaluations. Although guidelines and best practices have been suggested to improve current adversarial robustness evaluations, the lack of automatic testing and debugging tools makes it difficult to apply these recommendations in a systematic manner. In this work, we overcome these limitations by (i) defining a set of quantitative indicators which unveil common failures in the optimization of gradient-based attacks, and (ii) proposing specific mitigation strategies within a systematic evaluation protocol. Our extensive experimental analysis shows that the proposed indicators of failure can be used to visualize, debug and improve current adversarial robustness evaluations, providing a first concrete step towards automatizing and systematizing current adversarial robustness evaluations. Our open-source code is available at: https://github.com/pralab/IndicatorsOfAttackFailure.

【4】 Evolving GANs: When Contradictions Turn into Compliance 标题：演变中的甘斯：当矛盾转变为顺从

作者：Sauptik Dhar,Javad Heydari,Samarth Tripathi,Unmesh Kurup,Mohak Shah 机构：America Research Lab, LG Electronics, Great America Pkwy, Santa Clara, CA, USA 备注：Generative Adversarial Networks, Universum Learning, Semi-Supervised Learning 链接：https://arxiv.org/abs/2106.09946 摘要：标记数据的有限可用性使得任何监督学习问题都具有挑战性。半监督学习和universum学习等替代学习设置减轻了对标记数据的依赖，但仍然需要大量的未标记数据，这些数据可能不可用或获取成本高昂。基于GAN的合成数据生成方法最近通过生成合成样本来改进手头的任务，显示出了良好的前景。但是，这些样品不能用于其他目的。在本文中，我们提出了一个GAN游戏，在有限的数据设置下提供了改进的鉴别器精度，同时生成真实的合成数据。这提供了一个额外的优势，即现在生成的数据可以用于其他类似的任务。我们提供了理论保证和实证结果来支持我们的方法。摘要：Limited availability of labeled-data makes any supervised learning problem challenging. Alternative learning settings like semi-supervised and universum learning alleviate the dependency on labeled data, but still require a large amount of unlabeled data, which may be unavailable or expensive to acquire. GAN-based synthetic data generation methods have recently shown promise by generating synthetic samples to improve task at hand. However, these samples cannot be used for other purposes. In this paper, we propose a GAN game which provides improved discriminator accuracy under limited data settings, while generating realistic synthetic data. This provides the added advantage that now the generated data can be used for other similar tasks. We provide the theoretical guarantees and empirical results in support of our approach.

【5】 A Unified Generative Adversarial Network Training via Self-Labeling and Self-Attention 标题：基于自我标记和自我注意的统一生成性对抗性网络训练

作者：Tomoki Watanabe,Paolo Favaro 机构： The first network is trained toThis work has been done while the first author was visitingthe University of Bern, Japan 2University of Bern 链接：https://arxiv.org/abs/2106.09914 摘要：我们提出了一种新颖的GAN训练方案，可以统一处理任何级别的标记。我们的方案引入了一种人工标签的形式，可以包含手动定义的标签（如果可用），并诱导它们之间的对齐。为了定义人工标签，我们假设神经网络生成器可以更容易地被训练来将附近的潜在向量映射到语义相似的数据，而不是跨不同的类别。我们使用生成的数据样本和相应的人工条件标签来训练分类器。然后使用分类器自标记真实数据。为了提高自标记的精度，我们还使用了指数移动平均的分类器。然而，由于分类器仍然可能出错，特别是在训练开始时，我们还通过自我注意来细化标签，只在分类器输出高分类概率分数时才使用真实数据样本的标签。我们在CIFAR-10、STL-10和SVHN上对我们的方法进行了评估，结果表明，自标记和自注意都能持续地提高生成数据的质量。更令人惊讶的是，我们发现所提出的方案甚至可以优于类条件GANs。摘要：We propose a novel GAN training scheme that can handle any level of labeling in a unified manner. Our scheme introduces a form of artificial labeling that can incorporate manually defined labels, when available, and induce an alignment between them. To define the artificial labels, we exploit the assumption that neural network generators can be trained more easily to map nearby latent vectors to data with semantic similarities, than across separate categories. We use generated data samples and their corresponding artificial conditioning labels to train a classifier. The classifier is then used to self-label real data. To boost the accuracy of the self-labeling, we also use the exponential moving average of the classifier. However, because the classifier might still make mistakes, especially at the beginning of the training, we also refine the labels through self-attention, by using the labeling of real data samples only when the classifier outputs a high classification probability score. We evaluate our approach on CIFAR-10, STL-10 and SVHN, and show that both self-labeling and self-attention consistently improve the quality of generated data. More surprisingly, we find that the proposed scheme can even outperform class-conditional GANs.

【6】 Light Lies: Optical Adversarial Attack 标题：轻谎言：光学对抗性攻击

作者：Kyu-Lim Kim,Jeong-Soo Kim,Seung-Ri Song,Jun-Ho Choi,Chul-Min Joo,Jong-Seok Lee 机构：Department of Artificial Intelligence, Yonsei University, South Korea†Department of Mechanical Engineering, South Korea‡School of Integrated Technology 备注：11 pages, 4 figures 链接：https://arxiv.org/abs/2106.09908 摘要：为了降低deep模型的图像分类性能，人们在对抗性攻击方面做了大量的工作。然而，现有的大多数研究都考虑了数字（像素）域中的攻击，在该域中，通过采样和量化的图像传感器获取的图像已经被记录。本文首次提出了一种光学对抗攻击，通过物理方式改变光场信息到达图像传感器，使得分类模型产生误分类。更具体地说，我们使用放置在摄影系统中的空间光调制器在傅里叶域中调制光的相位。调制器的工作参数通过基于梯度的优化来获得，以最大化交叉熵和最小化失真。我们在仿真和实际硬件光学系统的基础上进行了实验，验证了所提出的光学攻击方案的可行性。实验还证明，该方法在微扰模式和分类结果上完全不同于常见的光学域畸变，如球差、离焦和像散。摘要：A significant amount of work has been done on adversarial attacks that inject imperceptible noise to images to deteriorate the image classification performance of deep models. However, most of the existing studies consider attacks in the digital (pixel) domain where an image acquired by an image sensor with sampling and quantization has been recorded. This paper, for the first time, introduces an optical adversarial attack, which physically alters the light field information arriving at the image sensor so that the classification model yields misclassification. More specifically, we modulate the phase of the light in the Fourier domain using a spatial light modulator placed in the photographic system. The operative parameters of the modulator are obtained by gradient-based optimization to maximize cross-entropy and minimize distortions. We present experiments based on both simulation and a real hardware optical system, from which the feasibility of the proposed optical attack is demonstrated. It is also verified that the proposed attack is completely different from common optical-domain distortions such as spherical aberration, defocus, and astigmatism in terms of both perturbation patterns and classification results.

【7】 Dual-Teacher Class-Incremental Learning With Data-Free Generative Replay 标题：双师课堂--无数据生成性回放的增量学习

作者：Yoojin Choi,Mostafa El-Khamy,Jungwon Lee 机构：SoC R&D, Samsung Semiconductor Inc., System LSI, Samsung Electronics, San Diego, CA , USA, South Korea 备注：CVPR 2021 Workshop on Continual Learning in Computer Vision (CLVision) 链接：https://arxiv.org/abs/2106.09835 摘要：提出了两种新的课堂增量学习知识转移技术。首先，我们提出了无数据生成重放（DF-GR）方法，利用生成模型中的合成样本来缓解CIL中的灾难性遗忘。在传统的生成重放中，生成模型是针对旧数据进行预训练，并在额外的内存中共享以供以后的增量学习。在我们提出的DF-GR中，我们在以往的预训练分类模型的基础上，不使用任何训练数据，从零开始训练生成模型，从而降低了共享预训练生成模型的成本。第二，我们引入双教师信息提取（DT-ID）来实现从两个教师到一个学生的知识提取。在CIL中，我们使用DT-ID在旧类的预训练模型和新类的新数据的另一个模型（预训练）的基础上增量地学习新类。我们在最先进的CIL方法的基础上实现了所提出的方案，并在CIFAR-100和ImageNet数据集上显示了性能的改进。摘要：This paper proposes two novel knowledge transfer techniques for class-incremental learning (CIL). First, we propose data-free generative replay (DF-GR) to mitigate catastrophic forgetting in CIL by using synthetic samples from a generative model. In the conventional generative replay, the generative model is pre-trained for old data and shared in extra memory for later incremental learning. In our proposed DF-GR, we train a generative model from scratch without using any training data, based on the pre-trained classification model from the past, so we curtail the cost of sharing pre-trained generative models. Second, we introduce dual-teacher information distillation (DT-ID) for knowledge distillation from two teachers to one student. In CIL, we use DT-ID to learn new classes incrementally based on the pre-trained model for old classes and another model (pre-)trained on the new data for new classes. We implemented the proposed schemes on top of one of the state-of-the-art CIL methods and showed the performance improvement on CIFAR-100 and ImageNet datasets.

自动驾驶|车辆|车道检测等(2篇)

【1】 A Dynamic Spatial-temporal Attention Network for Early Anticipation of Traffic Accidents 标题：一种交通事故早期预测的动态时空注意网络

作者：Muhammad Monjurul Karim,Yu Li,Ruwen Qin,Zhaozheng Yin 机构： Stony Brook University, Zhaozheng Yin is with the Department of Biomedical Informatics, and AI Institute 备注：10 pages, 4 figures, submitted to a journal 链接：https://arxiv.org/abs/2106.10197 摘要：最近，自动驾驶车辆和配备高级驾驶员辅助系统（ADAS）的车辆正在出现。它们与完全由人类驾驶的常规车辆共用一条路。为了确保乘客和其他道路使用者的安全，自动驾驶车辆和自动驾驶辅助系统必须从自然驾驶场景中预测交通事故。交通代理的动态时空交互是复杂的，用于预测未来事故的视觉线索深深嵌入到仪表盘视频数据中。因此，交通事故的早期预测仍然是一个挑战。为此，本文提出了一种基于动态时空注意力（DSTA）的交通事故预警网络。提出的DSTA网络通过一个名为动态时间注意（DTA）的模块学习选择视频序列中有区别的时间段。它还通过另一个名为动态空间注意（DSA）的模块学习如何关注帧的信息空间区域。利用选通递归单元（GRU）网络联合学习事故的时空关系特征和场景外观特征。在两个基准数据集上对DSTA网络的实验评估证实，它已经超过了最先进的性能。一项彻底的消融研究评估了DSTA网络的各个组成部分的贡献，揭示了网络是如何实现这种性能的。此外，本文还提出了一种融合两个互补模型预测得分的新策略，并验证了该策略在进一步提高早期事故预测性能方面的有效性。摘要：Recently, autonomous vehicles and those equipped with an Advanced Driver Assistance System (ADAS) are emerging. They share the road with regular ones operated by human drivers entirely. To ensure guaranteed safety for passengers and other road users, it becomes essential for autonomous vehicles and ADAS to anticipate traffic accidents from natural driving scenes. The dynamic spatial-temporal interaction of the traffic agents is complex, and visual cues for predicting a future accident are embedded deeply in dashcam video data. Therefore, early anticipation of traffic accidents remains a challenge. To this end, the paper presents a dynamic spatial-temporal attention (DSTA) network for early anticipation of traffic accidents from dashcam videos. The proposed DSTA-network learns to select discriminative temporal segments of a video sequence with a module named Dynamic Temporal Attention (DTA). It also learns to focus on the informative spatial regions of frames with another module named Dynamic Spatial Attention (DSA). The spatial-temporal relational features of accidents, along with scene appearance features, are learned jointly with a Gated Recurrent Unit (GRU) network. The experimental evaluation of the DSTA-network on two benchmark datasets confirms that it has exceeded the state-of-the-art performance. A thorough ablation study evaluates the contributions of individual components of the DSTA-network, revealing how the network achieves such performance. Furthermore, this paper proposes a new strategy that fuses the prediction scores from two complementary models and verifies its effectiveness in further boosting the performance of early accident anticipation.

【2】 A Framework for Real-time Traffic Trajectory Tracking, Speed Estimation, and Driver Behavior Calibration at Urban Intersections Using Virtual Traffic Lanes 标题：基于虚拟车道的城市交叉口实时交通轨迹跟踪、速度估计和驾驶员行为校正框架

作者：Awad Abdelhalim,Montasir Abbas,Bhavi Bharat Kotha,Alfred Wicks 机构：Department of Civil and Environmental Engineering, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA, Department of Mechanical Engineering 链接：https://arxiv.org/abs/2106.09932 摘要：在以前的研究中，我们提出了VT-Lane，一个用于城市交叉口车辆实时检测、跟踪和转弯运动分类的三步框架。在这项研究中，我们提出了一个案例研究，结合通过VT车道获得的高精度轨迹和运动分类，用于城市交叉口交通的速度估计和驾驶员行为校准。首先，我们使用高度仪器化的车辆来验证从视频推断得到的估计速度。速度验证结果表明，该方法可以实时估计出检测车辆的平均行驶速度，误差为0.19m/s，相当于研究交叉口平均观测行驶速度的2%。在自由流动和拥挤交通条件下，瞬时速度（分辨率为30hz）的平均误差分别为0.21m/sec和0.86m/sec。然后，我们使用估计的速度来校准研究区域内车辆的驾驶员行为模型的参数。结果表明，标定后的模型复制了驾驶行为，平均误差为0.45米/秒，表明利用该框架对路边交通视频数据的汽车跟驰模型进行大规模自动标定具有很大的潜力，这可以通过微观仿真大大改善交通建模。摘要：In a previous study, we presented VT-Lane, a three-step framework for real-time vehicle detection, tracking, and turn movement classification at urban intersections. In this study, we present a case study incorporating the highly accurate trajectories and movement classification obtained via VT-Lane for the purpose of speed estimation and driver behavior calibration for traffic at urban intersections. First, we use a highly instrumented vehicle to verify the estimated speeds obtained from video inference. The results of the speed validation show that our method can estimate the average travel speed of detected vehicles in real-time with an error of 0.19 m/sec, which is equivalent to 2% of the average observed travel speeds in the intersection of the study. Instantaneous speeds (at the resolution of 30 Hz) were found to be estimated with an average error of 0.21 m/sec and 0.86 m/sec respectively for free-flowing and congested traffic conditions. We then use the estimated speeds to calibrate the parameters of a driver behavior model for the vehicles in the area of study. The results show that the calibrated model replicates the driving behavior with an average error of 0.45 m/sec, indicating the high potential for using this framework for automated, large-scale calibration of car-following models from roadside traffic video data, which can lead to substantial improvements in traffic modeling via microscopic simulation.

Attention注意力(1篇)

【1】 Multi-Granularity Network with Modal Attention for Dense Affective Understanding 标题：面向密集情感理解的带模态关注的多粒度网络

作者：Baoming Yan,Lin Wang,Ke Gao,Bo Gao,Xiao Liu,Chao Ban,Jiang Yang,Xiaobo Li 机构：Alibaba Group 备注：Oral presentation at AUVi Workshop - CVPR 2021 链接：https://arxiv.org/abs/2106.09964 摘要：视频情感理解是视频创作和推荐所需要的，其目的是通过视频内容来预测所诱发的表情。在最近的EEV挑战中，提出了一个密集的情感理解任务，需要进行帧级情感预测。本文提出了一种多粒度的模态注意网络（MGN-MA），该网络利用多粒度特征来更好地描述目标帧。具体来说，多粒度特征可以分为帧级、剪辑级和视频级特征，分别对应于视觉显著内容、语义上下文和视频主题信息。然后设计了模态注意融合模块，对多粒度特征进行融合，强调更多的情感相关模态。最后，将融合后的特征输入混合专家（MOE）分类器进行预测。进一步采用模型集成后处理，该方法在EEV挑战中获得了0.02292的相关分数。摘要：Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions. Further employing model-ensemble post-processing, the proposed method achieves the correlation score of 0.02292 in the EEV challenge.

跟踪(1篇)

【1】 Towards Distraction-Robust Active Visual Tracking 标题：朝向分散注意力的鲁棒主动视觉跟踪

作者：Fangwei Zhong,Peng Sun,Wenhan Luo,Tingyun Yan,Yizhou Wang 备注：To appear in ICML2021 链接：https://arxiv.org/abs/2106.10110 摘要：在主动视觉跟踪中，分心物体的出现是众所周知的困难，因为分心物常常通过遮挡目标或带来混乱的外观来误导跟踪器。为了解决这个问题，我们提出了一个混合的合作竞争多智能体博弈，其中一个目标和多个干扰者组成一个协作团队，与一个跟踪器对抗，使其无法跟踪。通过在游戏中的学习，分心者的各种分心行为自然出现，从而暴露了跟踪器的弱点，增强了跟踪器的分心鲁棒性。为了有效的学习，我们提出了一系列实用的方法，包括分心者的奖励函数、跨模式的师生学习策略和跟踪器的重复注意机制。实验结果表明，该跟踪器具有良好的分心鲁棒主动视觉跟踪性能，并能很好地推广到不可见环境中。我们还证明了多智能体博弈可以用来对抗地测试跟踪器的鲁棒性。摘要：In active visual tracking, it is notoriously difficult when distracting objects appear, as distractors often mislead the tracker by occluding the target or bringing a confusing appearance. To address this issue, we propose a mixed cooperative-competitive multi-agent game, where a target and multiple distractors form a collaborative team to play against a tracker and make it fail to follow. Through learning in our game, diverse distracting behaviors of the distractors naturally emerge, thereby exposing the tracker's weakness, which helps enhance the distraction-robustness of the tracker. For effective learning, we then present a bunch of practical methods, including a reward function for distractors, a cross-modal teacher-student learning strategy, and a recurrent attention mechanism for the tracker. The experimental results show that our tracker performs desired distraction-robust active visual tracking and can be well generalized to unseen environments. We also show that the multi-agent game can be used to adversarially test the robustness of trackers.

图像视频检索|Re-id相关(1篇)

【1】 Non-Iterative Phase Retrieval With Cascaded Neural Networks 标题：基于级联神经网络的非迭代相位提取

作者：Tobias Uelwer,Tobias Hoffmann,Stefan Harmeling 机构：Department of Computer Science, Heinrich Heine University D¨usseldorf, Germany 备注：Accepted at the 30th International Conference on Artificial Neural Networks (ICANN 2021) 链接：https://arxiv.org/abs/2106.10195 摘要：傅里叶相位恢复是一个问题，重建一个信号，只给予其傅里叶变换的幅度。基于优化的方法，如成熟的Gerchberg-Saxton或混合输入输出算法，难以从没有过采样的幅度重建图像。这激发了学习方法的应用，它允许在学习阶段后从非过采样量测量值重建。在本文中，我们希望通过一个深层的神经网络级联来克服这些学习方法的局限性，该级联从图像的非过采样Fourier幅度在不同分辨率上依次重建图像。我们在四个不同的数据集（MNIST、EMNIST、时尚MNIST和KMNIST）上评估了我们的方法，并证明它比其他非迭代方法和基于优化的方法具有更好的性能。摘要：Fourier phase retrieval is the problem of reconstructing a signal given only the magnitude of its Fourier transformation. Optimization-based approaches, like the well-established Gerchberg-Saxton or the hybrid input output algorithm, struggle at reconstructing images from magnitudes that are not oversampled. This motivates the application of learned methods, which allow reconstruction from non-oversampled magnitude measurements after a learning phase. In this paper, we want to push the limits of these learned methods by means of a deep neural network cascade that reconstructs the image successively on different resolutions from its non-oversampled Fourier magnitude. We evaluate our method on four different datasets (MNIST, EMNIST, Fashion-MNIST, and KMNIST) and demonstrate that it yields improved performance over other non-iterative methods and optimization-based methods.

裁剪|量化|加速|压缩相关(1篇)

【1】 Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration 标题：基于{-1，+1}编码分解和加速的量化神经网络

作者：Qigong Sun,Xiufang Li,Fanhua Shang,Hongying Liu,Kang Yang,Licheng Jiao,Zhouchen Lin 备注：arXiv admin note: substantial text overlap with arXiv:1905.13389 链接：https://arxiv.org/abs/2106.09886 摘要：深度神经网络（DNNs）的训练需要大量的计算和数据存储资源。因此，DNNs不能有效地应用于手机和嵌入式设备，这严重限制了其在工业应用中的适用性。为了解决这一问题，我们提出了一种新的编码方案，利用{1，+1}将量化神经网络分解为多分支二进制网络，并通过位运算（即xnor和位计数）来实现模型压缩、计算加速和资源节约。利用该方法，用户可以根据自己的需求和硬件资源任意获得不同的编码精度。该机制在数据存储和计算方面非常适合FPGA和ASIC的应用，为智能芯片的设计提供了一种可行的思路。在大规模图像分类（如ImageNet）、目标检测和语义分割任务中验证了该方法的有效性。特别是，我们的低比特编码方法仍然可以达到几乎相同的性能，其高比特对应。摘要：The training of deep neural networks (DNNs) always requires intensive resources for both computation and data storage. Thus, DNNs cannot be efficiently applied to mobile phones and embedded devices, which severely limits their applicability in industrial applications. To address this issue, we propose a novel encoding scheme using {-1, +1} to decompose quantized neural networks (QNNs) into multi-branch binary networks, which can be efficiently implemented by bitwise operations (i.e., xnor and bitcount) to achieve model compression, computational acceleration, and resource saving. By using our method, users can achieve different encoding precisions arbitrarily according to their requirements and hardware resources. The proposed mechanism is highly suitable for the use of FPGA and ASIC in terms of data storage and computation, which provides a feasible idea for smart chips. We validate the effectiveness of our method on large-scale image classification (e.g., ImageNet), object detection, and semantic segmentation tasks. In particular, our method with low-bit encoding can still achieve almost the same performance as its high-bit counterparts.

表征学习(1篇)

【1】 Equivariance-bridged SO(2)-Invariant Representation Learning using Graph Convolutional Network 标题：基于图卷积网络的等方差桥式SO(2)不变表示学习

作者：Sungwon Hwang,Hyungtae Lim,Hyun Myung 机构：Korea Advanced Institute of Science, and Technology (KAIST), Daejeon, Korea 链接：https://arxiv.org/abs/2106.09996 摘要：训练卷积神经网络（CNN）对旋转的鲁棒性主要是通过数据增强来实现的。在本文中，另一个进步的研究方向是强调视觉鼓励较少依赖于数据扩充，实现结构旋转不变性的网络。提出了深等变桥联SO（2）不变网络来回应这种观点。首先，由于自加权最近邻图卷积网络（SWN-GCN）比基于谱图卷积的方法更适合于构造更深层次的网络，因此提出了在图像的图表示上实现图卷积网络（GCN）以获得旋转等变表示。然后，在从SWN-GCN检索到的等变顶点集上，使用适合于聚合高维表示的置换不变性操作——全局平均池（GAP）最终获得不变性表示。我们的方法在旋转MNIST和CIFAR-10图像上实现了最先进的图像分类性能，其中模型只使用非增广数据集进行训练。表征不变性的定量验证也证明了SWN-GCN的深度表征在旋转上的强不变性。摘要：Training a Convolutional Neural Network (CNN) to be robust against rotation has mostly been done with data augmentation. In this paper, another progressive vision of research direction is highlighted to encourage less dependence on data augmentation by achieving structural rotational invariance of a network. The deep equivariance-bridged SO(2) invariant network is proposed to echo such vision. First, Self-Weighted Nearest Neighbors Graph Convolutional Network (SWN-GCN) is proposed to implement Graph Convolutional Network (GCN) on the graph representation of an image to acquire rotationally equivariant representation, as GCN is more suitable for constructing deeper network than spectral graph convolution-based approaches. Then, invariant representation is eventually obtained with Global Average Pooling (GAP), a permutation-invariant operation suitable for aggregating high-dimensional representations, over the equivariant set of vertices retrieved from SWN-GCN. Our method achieves the state-of-the-art image classification performance on rotated MNIST and CIFAR-10 images, where the models are trained with a non-augmented dataset only. Quantitative validations over invariance of the representations also demonstrate strong invariance of deep representations of SWN-GCN over rotations.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Residual Contrastive Learning for Joint Demosaicking and Denoising 标题：残差对比学习在联合去马赛克和去噪中的应用

作者：Nanqing Dong,Matteo Maggioni,Yongxin Yang,Eduardo Pérez-Pellitero,Ales Leonardis,Steven McDonagh 机构： Department of Computer Science, University of Oxford, Huawei Noah’s Ark Lab 链接：https://arxiv.org/abs/2106.10070 摘要：对比学习（CL）的突破推动了自监督学习（SSL）在RGB图像高级视觉任务中的应用。然而，对于低层次的视觉任务，如联合去噪和去噪（JDD），CL在原始域中的定义仍然不明确。为了弥补这种方法上的差距，我们提出了一种新的基于原始图像的对比学习方法，即残差对比学习（RCL），旨在学习JDD的有意义表征。我们的工作建立在每个原始图像中包含的噪声与信号相关的假设之上，因此来自同一原始图像的两个压缩比来自不同原始图像的两个压缩具有更相似的噪声分布。我们使用残差作为判别特征，并以地球移动器的距离来衡量对比损失的分布散度。为了评估所提出的CL策略，我们模拟了一系列无监督JDD实验，实验中大量数据被合成的信号相关噪声破坏，我们为具有未知（随机）噪声方差的无监督JDD任务设置了一个新的基准。我们的实证研究不仅验证了CL可以应用于分布（c.f.特征），而且揭示了以往的非ML和SSL-JDD方法在噪声统计未知的情况下缺乏鲁棒性，从而对信号相关噪声问题提供了一些进一步的见解。摘要：The breakthrough of contrastive learning (CL) has fueled the recent success of self-supervised learning (SSL) in high-level vision tasks on RGB images. However, CL is still ill-defined for low-level vision tasks, such as joint demosaicking and denoising (JDD), in the RAW domain. To bridge this methodological gap, we present a novel CL approach on RAW images, residual contrastive learning (RCL), which aims to learn meaningful representations for JDD. Our work is built on the assumption that noise contained in each RAW image is signal-dependent, thus two crops from the same RAW image should have more similar noise distribution than two crops from different RAW images. We use residuals as a discriminative feature and the earth mover's distance to measure the distribution divergence for the contrastive loss. To evaluate the proposed CL strategy, we simulate a series of unsupervised JDD experiments with large-scale data corrupted by synthetic signal-dependent noise, where we set a new benchmark for unsupervised JDD tasks with unknown (random) noise variance. Our empirical study not only validates that CL can be applied on distributions (c.f. features), but also exposes the lack of robustness of previous non-ML and SSL JDD methods when the statistics of the noise are unknown, thus providing some further insight into signal-dependent noise problems.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Improved Radar Localization on Lidar Maps Using Shared Embedding 标题：利用共享嵌入改进激光雷达地图上的雷达定位

作者：Huan Yin,Yue Wang,Rong Xiong 机构： m + pos(F) − neg(F))( 1)The authors are with the State Key Laboratory of Industrial Control Tech-nology and Institute of Cyber-Systems and Control, Zhejiang University 备注：Extended abstract. Spotlight Talk at Radar Perception for All-Weather Autonomy Workshop of ICRA 2021 链接：https://arxiv.org/abs/2106.10000 摘要：我们提出了一个异构定位框架，用于解决预先构建的lidar地图上的雷达全局定位和姿态跟踪问题。为了弥补传感模式之间的差距，构建了深度神经网络，为雷达扫描和激光雷达地图创建共享的嵌入空间。在这里学习到的特征嵌入支持相似性度量，从而分别提高地图检索和数据匹配。在RobotCar和MulRan数据集上，通过与Scan-Context和RaLL的比较，证明了该框架的有效性。此外，与原RaLL相比，本文提出的位姿跟踪流水线具有较少的神经网络。摘要：We present a heterogeneous localization framework for solving radar global localization and pose tracking on pre-built lidar maps. To bridge the gap of sensing modalities, deep neural networks are constructed to create shared embedding space for radar scans and lidar maps. Herein learned feature embeddings are supportive for similarity measurement, thus improving map retrieval and data matching respectively. In RobotCar and MulRan datasets, we demonstrate the effectiveness of the proposed framework with the comparison to Scan Context and RaLL. In addition, the proposed pose tracking pipeline is with less neural networks compared to the original RaLL.

多模态(1篇)

【1】 GEM: A General Evaluation Benchmark for Multimodal Tasks 标题：GEM：一种通用的多通道任务评价基准

作者：Lin Su,Nan Duan,Edward Cui,Lei Ji,Chenfei Wu,Huaishao Luo,Yongfei Liu,Ming Zhong,Taroon Bharti,Arun Sacheti 机构： Bing Multimedia Team, Microsoft, China, Natural Language Computing, Microsoft Research Asia, China, Southwest Jiaotong University, China , ShanghaiTech University, China, Bing Multimedia Team, Microsoft, United States 备注：Accepted by Findings of ACL 2021 链接：https://arxiv.org/abs/2106.09889 摘要：在本文中，我们提出创业板作为一个通用的评估基准多模态任务。不同于现有的GLUE、SuperGLUE、XGLUE和XTREME等主要关注自然语言任务的数据集，GEM是一个大规模的视觉语言基准，它由GEM-I和GEM-V组成，分别用于图像语言任务和视频语言任务。与现有的图像语言任务的MSCOCO、Flicker30K、视频语言任务的YouCook2、MSR-VTT等多模态数据集相比，GEM不仅是同时覆盖图像语言任务和视频语言任务的最大的视觉语言数据集，而且具有多种语言的标记。我们还为此基准提供了两个基准模型。我们将发布数据集、代码和基线模型，旨在推动多语种多模态研究的发展。摘要：In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.

其他神经网络|深度学习|模型|建模(7篇)

【1】 Steerable Partial Differential Operators for Equivariant Neural Networks 标题：等变神经网络的可控偏微分算子

作者：Erik Jenner,Maurice Weiler 机构：University of Amsterdam 备注：43 pages, 4 figures, code available at this https URL 链接：https://arxiv.org/abs/2106.10163 摘要：最近在等变深度学习方面的工作与物理学有很强的相似性。基空间上的场是两个主题中的基本实体，这些场之间的等变映射也是如此。然而，在深度学习中，这些映射通常是由核卷积定义的，而在物理学中它们是偏微分算子（pdo）。在深度学习的背景下发展等变PDO理论可以使这些学科更加紧密地联系在一起，并导致更强大的思想流动。在这项工作中，我们推导了一个$G$-可操纵性约束，它完全刻画了任意对称群$G$的特征向量场之间的PDO是等变的。然后，我们完全解决了几个重要群体的这个约束。我们使用我们的解决方案作为卷积层的等变量替换，并在该角色中对它们进行基准测试。最后，我们发展了一个基于Schwartz分布的等变映射框架，它统一了经典卷积和微分算子，并给出了它们之间的关系。摘要：Recent work in equivariant deep learning bears strong similarities to physics. Fields over a base space are fundamental entities in both subjects, as are equivariant maps between these fields. In deep learning, however, these maps are usually defined by convolutions with a kernel, whereas they are partial differential operators (PDOs) in physics. Developing the theory of equivariant PDOs in the context of deep learning could bring these subjects even closer together and lead to a stronger flow of ideas. In this work, we derive a $G$-steerability constraint that completely characterizes when a PDO between feature vector fields is equivariant, for arbitrary symmetry groups $G$. We then fully solve this constraint for several important groups. We use our solutions as equivariant drop-in replacements for convolutional layers and benchmark them in that role. Finally, we develop a framework for equivariant maps based on Schwartz distributions that unifies classical convolutions and differential operators and gives insight about the relation between the two.

【2】 Contrastive Learning of Generalized Game Representations 标题：广义游戏表征的对比学习

作者：Chintan Trivedi,Antonios Liapis,Georgios N. Yannakakis 机构：Institute of Digital Games, University of Malta, Msida, Malta 备注：8 pages, 7 figures, CoG 链接：https://arxiv.org/abs/2106.10060 摘要：通过像素表示游戏为构建通用和多功能的游戏模型提供了一种很有前景的方法。虽然游戏不仅仅是图像，但基于游戏像素训练的神经网络模型往往捕捉图像视觉风格的差异，而不是游戏内容的差异。因此，即使在同类游戏中，这样的模型也不能很好地推广。本文以对比学习的最新进展为基础，展示了它对游戏表征学习的益处。学会对比游戏图像不仅能更有效地对游戏进行分类；它还产生了一些模型，这些模型通过忽略视觉风格，而将注意力集中在游戏内容上，从而以一种更有意义的方式将游戏分开。我们在一个大型体育视频游戏数据集上的研究结果表明，对比学习比传统的监督学习更适合于广义游戏表征的学习。这项研究的结果使我们更接近通用的游戏视觉编码器，它可以在以前看不到的游戏中重用，而不需要重新训练或微调。摘要：Representing games through their pixels offers a promising approach for building general-purpose and versatile game models. While games are not merely images, neural network models trained on game pixels often capture differences of the visual style of the image rather than the content of the game. As a result, such models cannot generalize well even within similar games of the same genre. In this paper we build on recent advances in contrastive learning and showcase its benefits for representation learning in games. Learning to contrast images of games not only classifies games in a more efficient manner; it also yields models that separate games in a more meaningful fashion by ignoring the visual style and focusing, instead, on their content. Our results in a large dataset of sports video games containing 100k images across 175 games and 10 game genres suggest that contrastive learning is better suited for learning generalized game representations compared to conventional supervised learning. The findings of this study bring us closer to universal visual encoders for games that can be reused across previously unseen games without requiring retraining or fine-tuning.

【3】 Training or Architecture? How to Incorporate Invariance in Neural Networks 标题：训练还是建筑？如何在神经网络中融入不变性

作者：Kanchana Vaishnavi Gandikota,Jonas Geiping,Zorah Lähner,Adam Czapliński,Michael Moeller 机构：University of Siegen 链接：https://arxiv.org/abs/2106.10044 摘要：许多应用要求神经网络对输入数据的某些变换具有鲁棒性，或理想情况下具有不变性。最常见的是，通过增加训练数据、使用对抗性训练或定义自动包含所需不变性的网络体系结构来解决这一需求。不幸的是，后者通常依赖于征集所有可能变换的能力，这使得这种方法在很大程度上不适用于无限变换集，例如任意旋转或缩放。在这项工作中，我们提出了一种方法，通过从一个（可能是连续的）轨道中选择一个元素，基于一个固定的准则，来证明网络结构相对于组动作是不变的。简而言之，我们打算在将数据输入到实际网络之前“撤消”任何可能的转换。我们分析了这些方法的性质，将它们推广到等变网络，并通过几个数值例子说明了它们在鲁棒性和计算效率方面的优势。特别地，我们研究了三维点云分类对图像旋转的鲁棒性（这可能仅适用于离散化伪影）以及可证明的旋转和缩放不变性。摘要：Many applications require the robustness, or ideally the invariance, of a neural network to certain transformations of input data. Most commonly, this requirement is addressed by either augmenting the training data, using adversarial training, or defining network architectures that include the desired invariance automatically. Unfortunately, the latter often relies on the ability to enlist all possible transformations, which make such approaches largely infeasible for infinite sets of transformations, such as arbitrary rotations or scaling. In this work, we propose a method for provably invariant network architectures with respect to group actions by choosing one element from a (possibly continuous) orbit based on a fixed criterion. In a nutshell, we intend to 'undo' any possible transformation before feeding the data into the actual network. We analyze properties of such approaches, extend them to equivariant networks, and demonstrate their advantages in terms of robustness as well as computational efficiency in several numerical examples. In particular, we investigate the robustness with respect to rotations of images (which can possibly hold up to discretization artifacts only) as well as the provable rotational and scaling invariance of 3D point cloud classification.

【4】 Learning and Meshing from Deep Implicit Surface Networks Using an Efficient Implementation of Analytic Marching 标题：基于解析行进算法的深层隐式曲面网络学习与网格化

作者：Jiabao Lei,Kui Jia,Yi Ma 备注：arXiv admin note: text overlap with arXiv:2002.06597 链接：https://arxiv.org/abs/2106.10031 摘要：物体或场景表面的重建在计算机视觉、计算机图形学和机器人学中有着广泛的应用。在本文中，我们研究了一个基本的问题，在这种情况下恢复表面网格从一个隐式场函数的零水平集捕获的基础表面。为了达到这一目的，现有的方法依赖于传统的网格划分算法；然而，由于在行进立方体中使用离散空间采样，在隐式曲面网络中学习的精度会受到损失。假设一个具有修正线性单元（ReLU）激活的MLP将其输入空间划分为若干个线性区域，我们就有动机将这个局部线性与多边形网格的期望结果所具有的相同性质联系起来。更具体地说，我们从线性区域（由基于MLP的隐函数划分）中识别与函数的零级等值面相关的分析单元和分析面。我们证明了在温和的条件下，确定的分析面是保证连接和形成一个封闭的，分片平面。在此基础上，提出了一种解析步进算法，即在解析单元间进行步进，以精确地恢复隐式曲面网络捕捉到的网格。我们还证明了我们的理论和算法同样适用于具有快捷连接和最大池的高级mlp。考虑到解析行进的并行性，我们提供了一个软件包AnalyticMesh，它支持通过CUDA并行计算对隐式曲面网络进行有效的网格划分，并通过网格简化实现高效的下游处理。我们将我们的方法应用于使用隐式曲面网络的生成形状建模的不同设置。大量的实验表明，我们的方法在啮合精度和效率方面都优于现有的方法。摘要：Reconstruction of object or scene surfaces has tremendous applications in computer vision, computer graphics, and robotics. In this paper, we study a fundamental problem in this context about recovering a surface mesh from an implicit field function whose zero-level set captures the underlying surface. To achieve the goal, existing methods rely on traditional meshing algorithms; while promising, they suffer from loss of precision learned in the implicit surface networks, due to the use of discrete space sampling in marching cubes. Given that an MLP with activations of Rectified Linear Unit (ReLU) partitions its input space into a number of linear regions, we are motivated to connect this local linearity with a same property owned by the desired result of polygon mesh. More specifically, we identify from the linear regions, partitioned by an MLP based implicit function, the analytic cells and analytic faces that are associated with the function's zero-level isosurface. We prove that under mild conditions, the identified analytic faces are guaranteed to connect and form a closed, piecewise planar surface. Based on the theorem, we propose an algorithm of analytic marching, which marches among analytic cells to exactly recover the mesh captured by an implicit surface network. We also show that our theory and algorithm are equally applicable to advanced MLPs with shortcut connections and max pooling. Given the parallel nature of analytic marching, we contribute AnalyticMesh, a software package that supports efficient meshing of implicit surface networks via CUDA parallel computing, and mesh simplification for efficient downstream processing. We apply our method to different settings of generative shape modeling using implicit surface networks. Extensive experiments demonstrate our advantages over existing methods in terms of both meshing accuracy and efficiency.

【5】 RSG: A Simple but Effective Module for Learning Imbalanced Datasets 标题：RSG：一种简单而有效的不平衡数据集学习模型

作者：Jianfeng Wang,Thomas Lukasiewicz,Xiaolin Hu,Jianfei Cai,Zhenghua Xu 机构：University of Oxford, Tsinghua University, Monash University, Hebei University of Technology 备注：To appear at CVPR 2021. We propose a flexible data generation/data augmentation module for long-tailed classification. Codes are available at: this https URL 链接：https://arxiv.org/abs/2106.09859 摘要：不平衡数据集广泛存在于实践中，对训练具有良好泛化能力的深层神经模型提出了很大的挑战。在这项工作中，我们提出一个新的稀有类样本产生器来解决这个问题。RSG的目的是在训练过程中为稀有类生成一些新的样本，它特别具有以下优点：（1）使用方便，通用性强，因为它可以很容易地集成到任何一种卷积神经网络中，并且与不同的损失函数结合在一起时效果很好，（2）它只在训练阶段使用，因此在测试阶段对深层神经网络没有额外的负担。在大量的实验评估中，我们验证了RSG的有效性。此外，通过利用RSG，我们在不平衡CIFAR上获得了有竞争力的结果，在Places LT、ImageNet LT和iNaturalist 2018上获得了最新的结果https://github.com/Jianf-Wang/RSG. 摘要：Imbalanced datasets widely exist in practice and area great challenge for training deep neural models with agood generalization on infrequent classes. In this work, wepropose a new rare-class sample generator (RSG) to solvethis problem. RSG aims to generate some new samplesfor rare classes during training, and it has in particularthe following advantages: (1) it is convenient to use andhighly versatile, because it can be easily integrated intoany kind of convolutional neural network, and it works wellwhen combined with different loss functions, and (2) it isonly used during the training phase, and therefore, no ad-ditional burden is imposed on deep neural networks duringthe testing phase. In extensive experimental evaluations, weverify the effectiveness of RSG. Furthermore, by leveragingRSG, we obtain competitive results on Imbalanced CIFARand new state-of-the-art results on Places-LT, ImageNet-LT, and iNaturalist 2018. The source code is available at https://github.com/Jianf-Wang/RSG.

【6】 Effective Model Sparsification by Scheduled Grow-and-Prune Methods 标题：基于调度生长修剪方法的有效模型稀疏

作者：Xiaolong Ma,Minghai Qin,Fei Sun,Zejiang Hou,Kun Yuan,Yi Xu,Yanzhi Wang,Yen-Kuang Chen,Rong Jin,Yuan Xie 机构： DAMO Academy, Alibaba Group, Northeastern University, Princeton University 链接：https://arxiv.org/abs/2106.09857 摘要：深度神经网络（DNNs）是解决现实问题的有效方法。较大的DNN模型通常表现出更好的质量（如精度），但其过多的计算会导致较长的训练和推理时间。模型稀疏化可以在保持模型质量的同时减少计算量和内存开销。现有的稀疏化算法大多是单向地去除权值，而其他算法则是随机或贪婪地在每一层中寻找权值的一小部分。算法的低效性降低了可实现的稀疏性水平。此外，许多算法仍然需要预先训练密集模型，因此内存占用大，训练时间长。本文提出了一种新的无需对稠密模型进行预训练的计划增长与剪枝（GaP）方法。它解决了以往工作的不足之处，通过反复增长一个子集的层密集，然后修剪回稀疏后，一些训练。实验表明，在图像分类、目标检测、三维物体分割和平移等多种任务中，该模型在80%的稀疏度下都能达到或超过高度优化的稠密模型的质量。它们也优于其他最先进的剪枝方法，包括从预先训练的密集模型中剪枝。例如，通过GaP获得的90%稀疏ResNet-50在ImageNet上达到77.9%的top-1精度，使SOTA结果提高了1.5%。摘要：Deep neural networks (DNNs) are effective in solving many real-world problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but their excessive computation results in long training and inference time. Model sparsification can reduce the computation and memory cost while maintaining model quality. Most existing sparsification algorithms unidirectionally remove weights, while others randomly or greedily explore a small subset of weights in each layer. The inefficiency of the algorithms reduces the achievable sparsity level. In addition, many algorithms still require pre-trained dense models and thus suffer from large memory footprint and long training time. In this paper, we propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. It addresses the shortcomings of the previous works by repeatedly growing a subset of layers to dense and then pruning back to sparse after some training. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) pruning methods, including pruning from pre-trained dense models. As an example, a 90% sparse ResNet-50 obtained via GaP achieves 77.9% top-1 accuracy on ImageNet, improving the SOTA results by 1.5%.

【7】 PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python 标题：PyKale：基于Python的多源知识机器学习

作者：Haiping Lu,Xianyuan Liu,Robert Turner,Peizhen Bai,Raivo E Koot,Shuo Zhou,Mustafa Chasmai,Lawrence Schobs 机构：The University of Sheffield, Sheffield, United Kingdom, Indian Institute of Technology, Delhi, New Delhi, India 备注：This library is available at this https URL 链接：https://arxiv.org/abs/2106.09756 摘要：机器学习是一种多学科交叉研究的通用技术。然而，当大多数机器学习工具在不同领域分别开发时，在跨越学科界限方面存在着明显的障碍。我们介绍Pykale-一个Python库，用于图形、图像、文本和视频的知识感知机器学习，以支持和加速跨学科研究。我们在标准软件工程实践的基础上制定了新的绿色机器学习准则，并提出了一种新的基于流水线的应用程序编程接口（API）。PyKale专注于利用来自多个来源的知识进行准确和可解释的预测，从而通过最新的深度学习和降维模型支持多模式学习和迁移学习（特别是领域适应）。我们在Pytork上建立PyKale，并利用丰富的Pytork生态系统。我们基于管道的API设计加强了标准化和极简主义，通过减少重复和冗余、重用现有资源和跨领域回收学习模型，拥抱绿色机器学习概念。我们通过生物信息学、知识图、图像/视频识别和医学成像的例子来展示它的跨学科性质。摘要：Machine learning is a general-purpose technology holding promises for many interdisciplinary research problems. However, significant barriers exist in crossing disciplinary boundaries when most machine learning tools are developed in different areas separately. We present Pykale - a Python library for knowledge-aware machine learning on graphs, images, texts, and videos to enable and accelerate interdisciplinary research. We formulate new green machine learning guidelines based on standard software engineering practices and propose a novel pipeline-based application programming interface (API). PyKale focuses on leveraging knowledge from multiple sources for accurate and interpretable prediction, thus supporting multimodal learning and transfer learning (particularly domain adaptation) with latest deep learning and dimensionality reduction models. We build PyKale on PyTorch and leverage the rich PyTorch ecosystem. Our pipeline-based API design enforces standardization and minimalism, embracing green machine learning concepts via reducing repetitions and redundancy, reusing existing resources, and recycling learning models across areas. We demonstrate its interdisciplinary nature via examples in bioinformatics, knowledge graph, image/video recognition, and medical imaging.

其他(12篇)

【1】 VSAC: Efficient and Accurate Estimator for H and F 标题：VSAC：高效、准确的H和F估计器

作者：Maksym Ivashechkin,Daniel Barath,Jiri Matas 机构： Centre for Machine Perception, Czech Technical University in Prague, Czech Republic, Machine Perception Research Laboratory, MTA SZTAKI, Budapest, Hungary 链接：https://arxiv.org/abs/2106.10240 摘要：我们提出了VSAC，一个RANSAC型鲁棒估计与一些新颖性。它得益于引入独立内联线的概念，该概念显著提高了主平面处理的效率，并且允许几乎无错误地拒绝不正确的模型，而不会出现误报。改进了局部优化过程及其应用，使其平均只运行一次。进一步的技术改进包括自适应序贯假设验证和通过高斯消去的有效模型估计。在四个标准数据集上的实验表明，VSAC的速度明显快于它的所有前辈，平均运行时间为1-2ms。它比MAGSAC++快了两个数量级，但精度与MAGSAC++一样高，后者是目前两视图几何体最精确的估计器。在EVD、HPatches、photototourism和Kusvod2数据集上的反复运行中，它从未失败过。摘要：We present VSAC, a RANSAC-type robust estimator with a number of novelties. It benefits from the introduction of the concept of independent inliers that improves significantly the efficacy of the dominant plane handling and, also, allows near error-free rejection of incorrect models, without false positives. The local optimization process and its application is improved so that it is run on average only once. Further technical improvements include adaptive sequential hypothesis verification and efficient model estimation via Gaussian elimination. Experiments on four standard datasets show that VSAC is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU. It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry. In the repeated runs on EVD, HPatches, PhotoTourism, and Kusvod2 datasets, it never failed.

【2】 Light Pollution Reduction in Nighttime Photography 标题：减少夜间摄影中的光污染

作者：Chang Liu,Xiaolin Wu 机构：Shanghai Jiao Tong University, McMaster University 备注：8 pages 链接：https://arxiv.org/abs/2106.10046 摘要：夜间摄影师经常被不需要的人造光的光污染所困扰。人造光在被大气中的气溶胶散射后，会淹没星光，降低对比度和动态范围，造成朦胧，从而降低夜间图像的质量。本文提出了一种基于物理的光污染抑制（LPR）算法，该算法能有效地缓解上述感知质量的下降，恢复夜空的原始状态。提出的LPR算法成功的关键是采用反演方法来估计地面人工光的空间辐射分布和光谱特征。我们进行了大量的实验来评估LPR算法的有效性和局限性。摘要：Nighttime photographers are often troubled by light pollution of unwanted artificial lights. Artificial lights, after scattered by aerosols in the atmosphere, can inundate the starlight and degrade the quality of nighttime images, by reducing contrast and dynamic range and causing hazes. In this paper we develop a physically-based light pollution reduction (LPR) algorithm that can substantially alleviate the aforementioned degradations of perceptual quality and restore the pristine state of night sky. The key to the success of the proposed LPR algorithm is an inverse method to estimate the spatial radiance distribution and spectral signature of ground artificial lights. Extensive experiments are carried out to evaluate the efficacy and limitations of the LPR algorithm.

【3】 Accumulative Poisoning Attacks on Real-time Data 标题：针对实时数据的累积中毒攻击

作者：Tianyu Pang,Xiao Yang,Yinpeng Dong,Hang Su,Jun Zhu 机构：Department of Computer Science & Technology, Tsinghua University 链接：https://arxiv.org/abs/2106.09993 摘要：从不可信的来源收集训练数据会使机器学习服务暴露给恶意操作训练数据以降低模型精度的对手。当对离线数据集进行训练时，中毒对手必须在训练前提前注入中毒数据，并且将这些中毒数据批输入模型的顺序是随机的。与此相反，实际系统通常是在顺序捕获的实时数据上进行训练/微调，在这种情况下，中毒对手可以根据当前模型状态动态地对每个数据批进行中毒。在本文中，我们着眼于实时设置，提出了一种新的攻击策略，它将中毒攻击与累积阶段相关联，在不影响准确度的前提下，秘密地放大（中毒）触发批次的破坏效果。通过模拟CIFAR-10上的在线学习和联邦学习，我们发现在累积阶段之后，触发批上的一个更新步骤将显著降低模型的精度。我们的工作验证了一个设计良好但简单的攻击策略可以显著地放大中毒效应，而不需要探索复杂的技术。摘要：Collecting training data from untrusted sources exposes machine learning services to poisoning adversaries, who maliciously manipulate training data to degrade the model accuracy. When trained on offline datasets, poisoning adversaries have to inject the poisoned data in advance before training, and the order of feeding these poisoned batches into the model is stochastic. In contrast, practical systems are more usually trained/fine-tuned on sequentially captured real-time data, in which case poisoning adversaries could dynamically poison each data batch according to the current model state. In this paper, we focus on the real-time settings and propose a new attacking strategy, which affiliates an accumulative phase with poisoning attacks to secretly (i.e., without affecting accuracy) magnify the destructive effect of a (poisoned) trigger batch. By mimicking online learning and federated learning on CIFAR-10, we show that the model accuracy will significantly drop by a single update step on the trigger batch after the accumulative phase. Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects, with no need to explore complex techniques.

【4】 Advanced Hough-based method for on-device document localization 标题：一种改进的基于Hough的设备上文档本地化方法

作者：D. V. Tropin,A. M. Ershov,D. P. Nikolaev,V. V. Arlazarov 机构： 5 1 Moscow Institute of Physics and Technology (National Research University), 3 Moscow State University, 4 Institute for Information Transmission Problems of the RAS (Kharkevich Institute) 备注：This is a preprint of the article submitted for publication in the journal "Computer Optics" 链接：https://arxiv.org/abs/2106.09987 摘要：随着越来越严格的隐私和安全要求的出现，对设备上文档识别系统的需求也随之增加。在这样的系统中，不存在从终端设备到第三方信息处理服务器的数据传输。响应时间对设备上文档识别的用户体验至关重要。再加上在智能手机等消费级终端设备上无法使用离散的gpu、功能强大的cpu或大的RAM容量，时间限制对设备上执行的应用算法的计算复杂性提出了重大限制。在这项工作中，我们考虑文档在图像中的位置，而不需要事先了解文档内容或其内部结构。根据已发表的文献，至少有5个系统提供了设备上文档定位的解决方案。所有这些系统都使用了一种基于Hough的定位方法。这种系统的精度似乎低于那些没有考虑到有限计算资源的最新解决方案。提出了一种改进的Hough方法。与其他方法相比，它考虑了中心投影模型的几何不变性，并结合了边缘和颜色特征进行文档边界检测。该方法对SmartDoc数据集的精度要求次优，优于U-net类神经网络。在一个更具挑战性的MIDV-500数据集上进行评估时，该算法保证了与已发表的方法相比的最佳精度。我们的方法保留了对设备上计算的适用性。摘要：The demand for on-device document recognition systems increases in conjunction with the emergence of more strict privacy and security requirements. In such systems, there is no data transfer from the end device to a third-party information processing servers. The response time is vital to the user experience of on-device document recognition. Combined with the unavailability of discrete GPUs, powerful CPUs, or a large RAM capacity on consumer-grade end devices such as smartphones, the time limitations put significant constraints on the computational complexity of the applied algorithms for on-device execution. In this work, we consider document location in an image without prior knowledge of the document content or its internal structure. In accordance with the published works, at least 5 systems offer solutions for on-device document location. All these systems use a location method which can be considered Hough-based. The precision of such systems seems to be lower than that of the state-of-the-art solutions which were not designed to account for the limited computational resources. We propose an advanced Hough-based method. In contrast with other approaches, it accounts for the geometric invariants of the central projection model and combines both edge and color features for document boundary detection. The proposed method allowed for the second best result for SmartDoc dataset in terms of precision, surpassed by U-net like neural network. When evaluated on a more challenging MIDV-500 dataset, the proposed algorithm guaranteed the best precision compared to published methods. Our method retained the applicability to on-device computations.

【5】 Towards interpreting computer vision based on transformation invariant optimization 标题：基于变换不变量优化的计算机视觉解释

作者：Chen Li,Jinzhe Jiang,Xin Zhang,Tonghuan Zhang,Yaqian Zhao,Dongdong Jiang,RenGang Li 机构：. State Key Laboratory of High-end Server & Storage Technology, Beijing, China, . Shandong Hailiang Information Technology Institutes, Jinan, China, . State Key Laboratory of High-End Server & Storage Technology, Jinan, China 备注：15 pages, 7 figures 链接：https://arxiv.org/abs/2106.09982 摘要：解释深度神经网络（DNNs）如何进行预测是人工智能的一个重要领域，它阻碍了DNNs的广泛应用。学习表征的可视化有助于我们人类理解DNNs的视觉。在本研究中，藉由反向传播方法产生视觉化的影像，以激活神经网路至目标类别。在图像生成过程中，通过旋转和缩放操作引入了变换不变性，显著提高了图像的可视化效果。最后，我们举例说明这种方法可以帮助我们深入了解神经网络。摘要：Interpreting how does deep neural networks (DNNs) make predictions is a vital field in artificial intelligence, which hinders wide applications of DNNs. Visualization of learned representations helps we humans understand the vision of DNNs. In this work, visualized images that can activate the neural network to the target classes are generated by back-propagation method. Here, rotation and scaling operations are applied to introduce the transformation invariance in the image generating process, which we find a significant improvement on visualization effect. Finally, we show some cases that such method can help us to gain insight into neural networks.

【6】 Smoothed Multi-View Subspace Clustering 标题：平滑多视图子空间聚类

作者：Peng Chen,Liang Liu,Zhengrui Ma,Zhao Kang 机构： Jangsu Automation Research Institute, Lianyungang, Jiangsu, China, University of Electronic Science and Technology of China, Chengdu, Sichuan, China, Trusted Cloud Computing and Big Data Key Laboratory of Sichuan Province 备注：Accepted by International Conference on Neural Computing for Advanced Applications 2021 链接：https://arxiv.org/abs/2106.09875 摘要：近年来，多视点子空间聚类由于利用了多视点间的互补信息，取得了令人瞩目的效果。然而，多视图数据可能非常复杂，并且在实际应用中不容易聚类。现有的大多数方法都是对原始数据进行处理，可能得不到最优解。在这项工作中，我们提出了一种新的多视图聚类方法，称为平滑多视图子空间聚类（SMVSC），通过使用一种新的技术，即图过滤，来获得每个视图的平滑表示，其中相似的数据点具有相似的特征值。具体来说，它通过应用低通滤波器来保留图形的几何特征。因此，它产生了一个“聚类友好”的表示，极大地促进了下游的聚类任务。在基准数据集上的大量实验验证了该方法的优越性。分析表明，图过滤提高了类的可分性。摘要：In recent years, multi-view subspace clustering has achieved impressive performance due to the exploitation of complementary imformation across multiple views. However, multi-view data can be very complicated and are not easy to cluster in real-world applications. Most existing methods operate on raw data and may not obtain the optimal solution. In this work, we propose a novel multi-view clustering method named smoothed multi-view subspace clustering (SMVSC) by employing a novel technique, i.e., graph filtering, to obtain a smooth representation for each view, in which similar data points have similar feature values. Specifically, it retains the graph geometric features through applying a low-pass filter. Consequently, it produces a ``clustering-friendly" representation and greatly facilitates the downstream clustering task. Extensive experiments on benchmark datasets validate the superiority of our approach. Analysis shows that graph filtering increases the separability of classes.

【7】 Towards Clustering-friendly Representations: Subspace Clustering via Graph Filtering 标题：面向聚类友好表示：基于图滤波子空间聚类

作者：Zhengrui Ma,Zhao Kang,Guangchun Luo,Ling Tian 机构：School of Computer Science and, Engineering, University of Electronic, Science and Technology of China, School of Information and Software 备注：Published in ACM Multimedia 2020 链接：https://arxiv.org/abs/2106.09874 摘要：在许多应用中，为特定的任务找到一个合适的数据表示是至关重要的。子空间聚类的成功与否取决于能否将数据划分成不同的子空间。然而，这个简单的假设并不总是成立的，因为原始数据可能不可分为子空间。为了恢复“聚类友好”的表示并便于后续的聚类，我们提出了一种图过滤方法，通过这种方法可以获得平滑的表示。具体地说，它通过应用一个低通滤波器将图的相似性注入到数据特征中，以提取有用的数据表示用于聚类。对图像和文档聚类数据集的大量实验表明，该方法改进了现有的子空间聚类技术。特别是，它与深度学习方法的可比性强调了简单图滤波方案在许多实际应用中的有效性。研究表明，图滤波可以去除噪声，保持图像的结构，提高分类的可分性。摘要：Finding a suitable data representation for a specific task has been shown to be crucial in many applications. The success of subspace clustering depends on the assumption that the data can be separated into different subspaces. However, this simple assumption does not always hold since the raw data might not be separable into subspaces. To recover the ``clustering-friendly'' representation and facilitate the subsequent clustering, we propose a graph filtering approach by which a smooth representation is achieved. Specifically, it injects graph similarity into data features by applying a low-pass filter to extract useful data representations for clustering. Extensive experiments on image and document clustering datasets demonstrate that our method improves upon state-of-the-art subspace clustering techniques. Especially, its comparable performance with deep learning methods emphasizes the effectiveness of the simple graph filtering scheme for many real-world applications. An ablation study shows that graph filtering can remove noise, preserve structure in the image, and increase the separability of classes.

【8】 A Distance-based Separability Measure for Internal Cluster Validation 标题：一种用于内部聚类验证的基于距离的可分性度量

作者：Shuyue Guan,Murray Loew 机构：Department of Biomedical Engineering, The George Washington University, nd St NW, Washington, DC , USA 备注：It is an extended version of the paper: arXiv:2009.01328 链接：https://arxiv.org/abs/2106.09794 摘要：聚类结果的评价是聚类分析的重要组成部分。由于在典型的无监督学习中没有真正的类标签进行聚类，许多内部聚类有效性指数（CVIs）是使用预测的标签和数据建立的。没有真正的标签，设计一个有效的CVI就像创建一个聚类方法一样困难。拥有更多的CVI是至关重要的，因为没有通用的CVI可以用来测量所有的数据集，也没有为没有真正标签的集群选择合适的CVI的具体方法。因此，应用多种CVIs对聚类结果进行评价是必要的。在本文中，我们提出了一种新的基于数据可分性测度的内部CVI——基于距离的可分性指数（DSI）。我们使用5种聚类算法在12个真实数据集和97个合成数据集上的聚类结果，将DSI与8种内部CVI进行了比较，包括从早期Dunn（1974）到最近的CVDD（2019）的研究，以及一种外部CVI作为基本事实。结果表明，DSI是一种有效的、独特的、有竞争力的CVI。我们还总结了评价CVIs的一般过程，并建立了用于比较CVIs结果的秩差度量。摘要：To evaluate clustering results is a significant part of cluster analysis. Since there are no true class labels for clustering in typical unsupervised learning, many internal cluster validity indices (CVIs), which use predicted labels and data, have been created. Without true labels, to design an effective CVI is as difficult as to create a clustering method. And it is crucial to have more CVIs because there are no universal CVIs that can be used to measure all datasets and no specific methods of selecting a proper CVI for clusters without true labels. Therefore, to apply a variety of CVIs to evaluate clustering results is necessary. In this paper, we propose a novel internal CVI -- the Distance-based Separability Index (DSI), based on a data separability measure. We compared the DSI with eight internal CVIs including studies from early Dunn (1974) to most recent CVDD (2019) and an external CVI as ground truth, by using clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. We also summarized the general process to evaluate CVIs and created the rank-difference metric for comparison of CVIs' results.

【9】 Discovering Relationships between Object Categories via Universal Canonical Maps 标题：通过泛正则映射发现对象类别之间的关系

作者：Natalia Neverova,Artsiom Sanakoyeu,Patrick Labatut,David Novotny,Andrea Vedaldi 机构：Facebook AI Research, image-to-mesh cycle,(nul)(nul)(nul)(nul), mesh-to-mesh cycle 备注：Accepted at CVPR 2021; Project page: this https URL 链接：https://arxiv.org/abs/2106.09758 摘要：研究了多类可变形物体的几何联合学习问题。最近的工作已经表明，它是可能的学习一个统一的密集姿态预测数类相关的对象。然而，训练此类模型需要手动初始化类别间的对应关系。这是次优的，当学习单个类别时，生成的模型无法保持正确的对应关系。在本文中，我们证明了改进的对应可以自动学习，作为学习特定类别密集姿态预测器的自然副产品。为了做到这一点，我们使用统一的嵌入来表达不同类别之间以及图像和类别之间的对应关系。然后，我们使用后者来实施两个约束：对称类间循环一致性和新的非对称图像到类循环一致性。在没有任何类别间对应的手动注释的情况下，我们获得了最先进的对齐结果，优于用于匹配3D形状的专用方法。此外，新模型在密集位姿预测方面也优于以往的工作。摘要：We tackle the problem of learning the geometry of multiple categories of deformable objects jointly. Recent work has shown that it is possible to learn a unified dense pose predictor for several categories of related objects. However, training such models requires to initialize inter-category correspondences by hand. This is suboptimal and the resulting models fail to maintain correct correspondences as individual categories are learned. In this paper, we show that improved correspondences can be learned automatically as a natural byproduct of learning category-specific dense pose predictors. To do this, we express correspondences between different categories and between images and categories using a unified embedding. Then, we use the latter to enforce two constraints: symmetric inter-category cycle consistency and a new asymmetric image-to-category cycle consistency. Without any manual annotations for the inter-category correspondences, we obtain state-of-the-art alignment results, outperforming dedicated methods for matching 3D shapes. Moreover, the new model is also better at the task of dense pose prediction than prior work.

【10】 DeepLab2: A TensorFlow Library for Deep Labeling 标题：DeepLab2：用于深度标记的TensorFlow库

作者：Mark Weber,Huiyu Wang,Siyuan Qiao,Jun Xie,Maxwell D. Collins,Yukun Zhu,Liangzhe Yuan,Dahun Kim,Qihang Yu,Daniel Cremers,Laura Leal-Taixe,Alan L. Yuille,Florian Schroff,Hartwig Adam,Liang-Chieh Chen 机构：Laura Leal-Taix´e, Technical University Munich, Johns Hopkins University, KAIST, Google Research 备注：4-page technical report. The first three authors contributed equally to this work 链接：https://arxiv.org/abs/2106.09748 摘要：DeepLab2是一个用于深度标记的TensorFlow库，旨在为计算机视觉中常见的密集像素预测问题提供一个最先进且易于使用的TensorFlow代码库。DeepLab2包括我们最近开发的所有DeepLab模型变体，带有预先训练的检查点以及模型训练和评估代码，允许社区复制和进一步改进最先进的系统。为了展示DeepLab2的有效性，我们的全景DeepLab采用轴向SWideRNet作为网络主干，在cityscapes验证集上实现68.0%的PQ或83.5%的mIoU，只有单尺度推断和ImageNet-1K预训练检查点。我们希望，公开分享我们的图书馆可以促进未来对密集像素标记任务的研究，并展望这项技术的新应用。代码在\url公开{https://github.com/google-research/deeplab2}. 摘要：DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision. DeepLab2 includes all our recently developed DeepLab model variants with pretrained checkpoints as well as model training and evaluation code, allowing the community to reproduce and further improve upon the state-of-art systems. To showcase the effectiveness of DeepLab2, our Panoptic-DeepLab employing Axial-SWideRNet as network backbone achieves 68.0% PQ or 83.5% mIoU on Cityscaspes validation set, with only single-scale inference and ImageNet-1K pretrained checkpoints. We hope that publicly sharing our library could facilitate future research on dense pixel labeling tasks and envision new applications of this technology. Code is made publicly available at \url{https://github.com/google-research/deeplab2}.

【11】 Debiased Subjective Assessment of Real-World Image Enhancement 标题：真实世界图像增强的无偏主观评价

作者：Cao Peibei. Wang Zhangyang,Ma Kede 机构： and Kede Ma 1 1 City University of Hong Kong, 2 University of Texas at Austinpeibeicao 2-c 链接：https://arxiv.org/abs/2106.10080 摘要：在实际图像增强中，获取地面真实数据往往是一个挑战（如果不是不可能的话），这就妨碍了采用距离度量进行客观质量评估。因此，人们常常求助于主观质量评估，这是评估图像增强最直接和可靠的方法。传统的主观测试需要人工预选一小部分视觉样本，这可能会产生三种偏差：1）由于所选样本在图像空间的分布极为稀疏，导致采样偏差；2）由于所选样本的潜在过拟合导致的算法偏差；3）由于进一步潜在的樱桃采摘试验结果而产生的主观偏见。这最终使得现实世界中的图像增强领域更像是一门艺术，而不是一门科学。在这里，我们采取步骤，通过自动采样一组自适应和多样性的图像进行后续测试，来削弱传统的主观评估。这是通过将样本选择转化为增强子之间的差异和所选输入图像之间的多样性的联合最大化来实现的。仔细的视觉检查得到的增强图像提供了一个增强算法的排名。我们展示了我们的主观评估方法使用三个流行的和实际要求的图像增强任务：去杂波，超分辨率和弱光增强。摘要：In real-world image enhancement, it is often challenging (if not impossible) to acquire ground-truth data, preventing the adoption of distance metrics for objective quality assessment. As a result, one often resorts to subjective quality assessment, the most straightforward and reliable means of evaluating image enhancement. Conventional subjective testing requires manually pre-selecting a small set of visual examples, which may suffer from three sources of biases: 1) sampling bias due to the extremely sparse distribution of the selected samples in the image space; 2) algorithmic bias due to potential overfitting the selected samples; 3) subjective bias due to further potential cherry-picking test results. This eventually makes the field of real-world image enhancement more of an art than a science. Here we take steps towards debiasing conventional subjective assessment by automatically sampling a set of adaptive and diverse images for subsequent testing. This is achieved by casting sample selection into a joint maximization of the discrepancy between the enhancers and the diversity among the selected input images. Careful visual inspection on the resulting enhanced images provides a debiased ranking of the enhancement algorithms. We demonstrate our subjective assessment method using three popular and practically demanding image enhancement tasks: dehazing, super-resolution, and low-light enhancement.

【12】 AI-Enabled Ultra-Low-Dose CT Reconstruction 标题：基于人工智能的超低剂量CT重建

作者：Weiwen Wu,Chuang Niu,Shadi Ebrahimian,Hengyong Yu,Mannu Kalra,Ge Wang 机构：Biomedical Imaging Center, Center for Biotechnology and Interdisciplinary Studies, Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston 备注：19 pages, 10 figures, 1 table, 44 references 链接：https://arxiv.org/abs/2106.09834 摘要：根据ALARA（尽可能低的合理实现）原则，超低剂量CT重建是一个圣杯，以尽量减少癌症风险和遗传损伤，特别是对儿童。随着医学CT技术的发展，迭代算法被广泛应用于低剂量CT图像的重建。近年来，人工智能技术在进一步降低CT辐射剂量方面显示出巨大的潜力。在这篇论文中，我们证明了人工智能供电的CT重建提供了诊断图像质量在超低剂量水平相当于射线照相。具体地说，我们开发了一个分裂展开的网格状替代重建（SUGAR）网络，其中集成了深度学习、物理建模和图像先验。临床数据的重建结果表明，用36个投影的SUGAR重建图像效果良好。这种方法有可能改变未来的医疗保健。摘要：By the ALARA (As Low As Reasonably Achievable) principle, ultra-low-dose CT reconstruction is a holy grail to minimize cancer risks and genetic damages, especially for children. With the development of medical CT technologies, the iterative algorithms are widely used to reconstruct decent CT images from a low-dose scan. Recently, artificial intelligence (AI) techniques have shown a great promise in further reducing CT radiation dose to the next level. In this paper, we demonstrate that AI-powered CT reconstruction offers diagnostic image quality at an ultra-low-dose level comparable to that of radiography. Specifically, here we develop a Split Unrolled Grid-like Alternative Reconstruction (SUGAR) network, in which deep learning, physical modeling and image prior are integrated. The reconstruction results from clinical datasets show that excellent images can be reconstructed using SUGAR from 36 projections. This approach has a potential to change future healthcare.

机器翻译，仅供参考

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-06-21，如有侵权请联系 cloudcommunity@tencent.com 删除

linux