计算机视觉学术速递[6.18]

公众号-arXiv每日学术速递

发布于 2021-07-02 19:05:46

1.3K0

发布于 2021-07-02 19:05:46

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CV 方向，今日共计66篇

Transformer(5篇)

【1】 XCiT: Cross-Covariance Image Transformers 标题：XCiT：互协方差图像转换器

作者：Alaaeldin El-Nouby,Hugo Touvron,Mathilde Caron,Piotr Bojanowski,Matthijs Douze,Armand Joulin,Ivan Laptev,Natalia Neverova,Gabriel Synnaeve,Jakob Verbeek,Hervé Jegou 机构：Hervé Jégou, Facebook AI, Inria, Sorbonne University 链接：https://arxiv.org/abs/2106.09681 摘要：随着在自然语言处理方面的成功，《Transformer》最近在计算机视觉方面显示出了巨大的前景。Transformer下的自我注意操作产生了所有标记（即单词或图像块）之间的全局交互，并使图像数据的灵活建模超越了卷积的局部交互。然而，这种灵活性带来了时间和内存的二次复杂性，妨碍了对长序列和高分辨率图像的应用。我们提出了一个“转置”版本的自我注意，它跨特征通道而不是令牌进行操作，其中交互基于键和查询之间的互协方差矩阵。由此产生的交叉协方差注意（XCA）在令牌数量上具有线性复杂度，并且允许高效地处理高分辨率图像。我们的互协方差图像变换器（XCiT）是建立在XCA之上的。它结合了传统Transformer的精度和卷积结构的可扩展性。通过在ImageNet-1k上的图像分类和自监督特征学习、COCO上的目标检测和实例分割、ADE20k上的语义分割等多个视觉基准上的实验，验证了XCiT的有效性和通用性。摘要：Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.

【2】 Semi-Autoregressive Transformer for Image Captioning 标题：用于图像字幕的半自回归变换器

作者：Yuanen Zhou,Yong Zhang,Zhenzhen Hu,Meng Wang 机构：School of Computer Science and Information Engineering, Hefei University of Technology, Tencent AI Lab, Key Laboratory of Knowledge Engineering with Big Data (Ministry of Education) 链接：https://arxiv.org/abs/2106.09436 摘要：当前最先进的图像字幕模型采用了自回归解码器，即它们通过对先前生成的单词进行条件化来生成每个单词，这导致了推理过程中的严重延迟。为了解决这个问题，最近提出了非自回归图像字幕模型，通过并行生成所有单词来显著加快推理速度。然而，这些非自回归模型由于过度去除了词的依赖性，不可避免地会出现较大的生成质量下降。为了在速度和质量之间做出更好的权衡，我们引入了一种半自回归的图像字幕模型SATIC，该模型在全局上保持了自回归特性，但在局部上产生了并行的单词。基于Transformer，实现SATIC只需要一些修改。在MSCOCO图像字幕基准上进行的大量实验表明，SATIC可以在没有任何干扰的情况下实现更好的折衷。代码位于{\color{magenta}\url{https://github.com/YuanEZhou/satic}}. 摘要：Current state-of-the-art image captioning models adopt autoregressive decoders, \ie they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue, non-autoregressive image captioning models have recently been proposed to significantly accelerate the speed of inference by generating all words in parallel. However, these non-autoregressive models inevitably suffer from large generation quality degradation since they remove words dependence excessively. To make a better trade-off between speed and quality, we introduce a semi-autoregressive model for image captioning~(dubbed as SATIC), which keeps the autoregressive property in global but generates words parallelly in local. Based on Transformer, there are only a few modifications needed to implement SATIC. Extensive experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles. Code is available at {\color{magenta}\url{https://github.com/YuanEZhou/satic}}.

【3】 THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers 标题：THUNDR：基于Transformer的带标记的三维人体重建

作者：Mihai Zanfir,Andrei Zanfir,Eduard Gabriel Bazavan,William T. Freeman,Rahul Sukthankar,Cristian Sminchisescu 机构：Google Research 链接：https://arxiv.org/abs/2106.09336 摘要：本文提出了一种基于Transformer的深度神经网络方法THUNDR，在给定单目RGB图像的情况下重建人体的三维姿态和形状。我们的方法的关键是中间的3d标记表示，我们的目标是将无模型输出架构的预测能力与统计人体表面模型（如GHUM）的正则化、人体测量学保护特性相结合，GHUM是最近推出的一种表现性全身统计3d人体模型，端到端训练。我们新的基于Transformer的预测管道可以聚焦于与任务相关的图像区域，支持自我监督机制，并确保解决方案与人体测量一致。我们在Human3.6M和3DPW上展示了全监督和自监督模型的最新结果，用于推断3d人体形状、关节位置和全局平移。此外，我们观察到非常坚实的三维重建性能的困难人体姿势收集在野外。摘要：We present THUNDR, a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people, given monocular RGB images. Key to our methodology is an intermediate 3d marker representation, where we aim to combine the predictive power of model-free-output architectures and the regularizing, anthropometrically-preserving properties of a statistical human surface model like GHUM -- a recently introduced, expressive full body statistical 3d human model, trained end-to-end. Our novel transformer-based prediction pipeline can focus on image regions relevant to the task, supports self-supervised regimes, and ensures that solutions are consistent with human anthropometry. We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models, for the task of inferring 3d human shape, joint positions, and global translation. Moreover, we observe very solid 3d reconstruction performance for difficult human poses collected in the wild.

【4】 Long-Short Temporal Contrastive Learning of Video Transformers 标题：视频Transformer的长短时间对比学习

作者：Jue Wang,Gedas Bertasius,Du Tran,Lorenzo Torresani 机构： Facebook AI, Dartmouth College 备注：Technical report 链接：https://arxiv.org/abs/2106.09212 摘要：最近，视频Transformer作为3D CNN的竞争替代品出现在视频理解领域。然而，由于这些模型具有大量的参数和减少的诱导偏差，因此需要对大规模图像数据集进行有监督的预训练以获得最佳性能。在本文中，我们实证地证明了在纯视频数据集上对视频变换器进行自我监督预训练可以得到与在大规模图像数据集上，甚至像ImageNet-21K这样的大规模图像数据集上进行监督预训练相当或更好的动作识别结果。由于基于转换器的模型能够有效地捕获扩展时间跨度上的依赖关系，因此我们提出了一个简单的学习过程，强制模型将同一视频的长期视图与短期视图相匹配。我们的方法称为长-短时间对比学习（LSTCL），通过预测从较长时间范围内捕获的时间上下文，使视频转换器能够学习有效的片段级表示。为了证明我们的研究结果的普遍性，我们在三种不同的自我监督对比学习框架（mocov3，BYOL，SimSiam）下使用了两种不同的视频转换器结构，包括一种改进的Swin转换器，它增加了时空注意。我们进行了一项深入的消融研究，结果表明，LSTCL在多个视频基准上取得了有竞争力的性能，是有监督的基于图像的预训练的一个令人信服的替代方案。摘要：Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

【5】 Probing Image-Language Transformers for Verb Understanding 标题：探索用于动词理解的图像语言转换器

作者：Lisa Anne Hendricks,Aida Nematzadeh 机构：DeepMind 链接：https://arxiv.org/abs/2106.09141 摘要：多模态图像语言转换器在依赖于微调的各种任务（例如，视觉问答和图像检索）上取得了令人印象深刻的结果。我们感兴趣的是揭示它们的预训练表征的质量——特别是，如果这些模型能够区分不同类型的动词，或者如果它们仅仅依赖于给定句子中的名词。为了做到这一点，我们收集了一个由421个动词组成的图像句子对（英语）数据集，这些动词要么是视觉的，要么是在训练前的数据中常见的（即概念性字幕数据集）。我们使用这个数据集来评估预训练的图像语言转换器，发现它们在需要动词理解的情况下比其他词类更失败。我们还调查了哪些类别的动词特别具有挑战性。摘要：Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations -- in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.

检测相关(3篇)

【1】 Dynamic Knowledge Distillation with A Single Stream Structure for RGB-DSalient Object Detection 标题：基于单流结构的动态知识抽取在RGB-DS目标检测中的应用

作者：Guangyu Ren,Tania Stathaki 链接：https://arxiv.org/abs/2106.09517 摘要：RGB-D显著目标检测（SOD）由于在数据中引入了额外的深度信息，在复杂环境下的检测中显示出其优越性。为了从深度图像中提取特征，不可避免地引入了一个独立的流，导致了额外的计算量和参数。这种以牺牲模型尺寸来提高检测精度的方法可能会阻碍超氧化物歧化酶问题的实际应用。为了解决这个难题，我们提出了一个动态蒸馏方法和一个轻量级的框架，大大减少了参数。该方法考虑了教师和学生在训练阶段的表现因素，动态地分配蒸馏权重，而不是对学生模型应用固定的权重。在5个公共数据集上进行了大量的实验，结果表明，通过78.2MB的轻量级结构，我们的方法可以获得与之前10种方法相比具有竞争力的性能。摘要：RGB-D salient object detection(SOD) demonstrates its superiority on detecting in complex environments due to the additional depth information introduced in the data. Inevitably, an independent stream is introduced to extract features from depth images, leading to extra computation and parameters. This methodology which sacrifices the model size to improve the detection accuracy may impede the practical application of SOD problems. To tackle this dilemma, we propose a dynamic distillation method along with a lightweight framework, which significantly reduces the parameters. This method considers the factors of both teacher and student performance within the training stage and dynamically assigns the distillation weight instead of applying a fixed weight on the student model. Extensive experiments are conducted on five public datasets to demonstrate that our method can achieve competitive performance compared to 10 prior methods through a 78.2MB lightweight structure.

【2】 Wavelet-Packet Powered Deepfake Image Detection 标题：基于小波包的深伪图像检测

作者：Moritz Wolter,Felix Blanke,Charles Tapley Hoyt,Jochen Garcke 机构：Fraunhofer Center for Machine Learning, and Fraunhofer SCAI, Harvard Medical School, Institute for Numerical Simulation, University of Bonn 备注：Source code is available at this https URL 链接：https://arxiv.org/abs/2106.09369 摘要：随着神经网络越来越能够生成逼真的人工图像，它们有可能改进电影、音乐、视频游戏，并使互联网成为一个更具创造性和启发性的地方。然而，与此同时，最新的技术可能使新的数字方式撒谎。作为回应，需要一个多样化和可靠的工具箱来识别人工图像和其他内容。以往的工作主要依赖于像素空间CNN或傅立叶变换。据我们所知，目前还没有基于小波变换的gan分析和检测方法。本文旨在填补这一空白，提出了一种基于小波变换的gan图像分析与检测方法。我们评估我们的方法对FFHQ，CelebA，和LSUN源识别问题，并发现改进或竞争的性能。摘要：As neural networks become more able to generate realistic artificial images, they have the potential to improve movies, music, video games and make the internet an even more creative and inspiring place. Yet, at the same time, the latest technology potentially enables new digital ways to lie. In response, the need for a diverse and reliable toolbox arises to identify artificial images and other content. Previous work primarily relies on pixel-space CNN or the Fourier transform. To the best of our knowledge, wavelet-based gan analysis and detection methods have been absent thus far. This paper aims to fill this gap and describes a wavelet-based approach to gan-generated image analysis and detection. We evaluate our method on FFHQ, CelebA, and LSUN source identification problems and find improved or competitive performance.

【3】 The Fishnet Open Images Database: A Dataset for Fish Detection and Fine-Grained Categorization in Fisheries 标题：渔网开放图像数据库：用于渔业鱼类检测和细粒度分类的数据集

作者：Justin Kay,Matt Merrifield 机构：Ai.Fish, The Nature Conservancy 备注：In 8th Workshop on Fine-Grained Visual Categorization at CVPR 2021 链接：https://arxiv.org/abs/2106.09178 摘要：基于摄像机的电子监测系统越来越多地部署在商业渔船上，以收集渔业管理和管制所需的基本数据。这些系统产生大量的视频数据，必须由人类专家在陆地上进行审查。计算机视觉可以通过自动检测和分类鱼种来辅助这一过程，但是这一领域现有公共数据的缺乏阻碍了这一进程。为了解决这个问题，我们提出了鱼网开放图像数据库，一个大型的EM图像数据集，用于商业渔船上的鱼类检测和细粒度分类。该数据集由86029幅图像组成，包含34个对象类，是迄今为止最大、最多样化的渔业EM图像公共数据集。它包括EM数据的许多特征性挑战：物种间的视觉相似性、倾斜的类分布、恶劣的天气条件和混乱的船员活动。我们评估了现有的检测和分类算法的性能，并证明该数据集可以作为渔业计算机视觉算法发展的一个具有挑战性的基准。数据集位于https://www.fishnet.ai/. 摘要：Camera-based electronic monitoring (EM) systems are increasingly being deployed onboard commercial fishing vessels to collect essential data for fisheries management and regulation. These systems generate large quantities of video data which must be reviewed on land by human experts. Computer vision can assist this process by automatically detecting and classifying fish species, however the lack of existing public data in this domain has hindered progress. To address this, we present the Fishnet Open Images Database, a large dataset of EM imagery for fish detection and fine-grained categorization onboard commercial fishing vessels. The dataset consists of 86,029 images containing 34 object classes, making it the largest and most diverse public dataset of fisheries EM imagery to-date. It includes many of the characteristic challenges of EM data: visual similarity between species, skewed class distributions, harsh weather conditions, and chaotic crew activity. We evaluate the performance of existing detection and classification algorithms and demonstrate that the dataset can serve as a challenging benchmark for development of computer vision algorithms in fisheries. The dataset is available at https://www.fishnet.ai/.

分类|识别相关(5篇)

【1】 IFCNet: A Benchmark Dataset for IFC Entity Classification 标题：IFCNet：一种用于IFC实体分类的基准数据集

作者：Christoph Emunds,Nicolas Pauen,Veronika Richter,Jérôme Frisch,Christoph van Treeck 机构：Institute of Energy Efficiency and Sustainable Building E,D, RWTH Aachen University, Germany 备注：To be presented at EG-ICE 2021 链接：https://arxiv.org/abs/2106.09712 摘要：在建筑、工程、建筑和运营行业中，加强BIM领域特定软件产品之间的互操作性和信息交换是一个重要方面。最近的研究开始从机器和深度学习的角度研究BIM模型的语义丰富方法。然而，这些机器学习算法的训练和评估需要足够大和全面的数据集。这项工作提出了IFCNet，一个数据集的单一实体的国际金融公司的文件跨越了广泛的国际金融公司类包含几何和语义信息。实验表明，三种不同的深度学习模型仅利用目标的几何信息就可以获得较好的分类效果。摘要：Enhancing interoperability and information exchange between domain-specific software products for BIM is an important aspect in the Architecture, Engineering, Construction and Operations industry. Recent research started investigating methods from the areas of machine and deep learning for semantic enrichment of BIM models. However, training and evaluation of these machine learning algorithms requires sufficiently large and comprehensive datasets. This work presents IFCNet, a dataset of single-entity IFC files spanning a broad range of IFC classes containing both geometric and semantic information. Using only the geometric information of objects, the experiments show that three different deep learning models are able to achieve good classification performance.

【2】 AttDLNet: Attention-based DL Network for 3D LiDAR Place Recognition 标题：AttDLNet：基于注意力的三维激光雷达位置识别DL网络

作者：Tiago Barros,Luís Garrote,Ricardo Pereira,Cristiano Premebida,Urbano J. Nunes 机构： which is an advantageThe authors are with the University of Coimbra, Institute of Systems andRobotics, Department of Electrical and Computer Engineering 链接：https://arxiv.org/abs/2106.09637 摘要：深度网络已逐渐适应新的传感器模式，即三维激光雷达，这导致了前所未有的成就，自主车辆相关的应用，如位置识别。深度模型原位识别的一个主要挑战是提取基于相似性的与地点相关的有效的描述性特征表示。为了解决利用激光雷达数据进行位置识别的问题，提出了一种新的基于三维激光雷达的深度学习网络（AttDLNet），该网络包括一个编码网络，并利用注意机制选择性地关注远程上下文和特征间的关系。该网络在KITTI数据集上进行训练和验证，使用余弦损失进行训练，使用基于检索的位置识别管道进行验证。此外，烧蚀研究提出了评估最佳网络配置。结果表明，编码器网络的特征已经非常具有描述性，但对网络的关注进一步提高了性能。从烧蚀研究，结果表明，中间的编码器层具有最高的平均性能，而较深的层更稳健的方向变化。该代码可在项目网站上公开获取：https://github.com/Cybonic/ AttDLNet公司摘要：Deep networks have been progressively adapted to new sensor modalities, namely to 3D LiDAR, which led to unprecedented achievements in autonomous vehicle-related applications such as place recognition. One of the main challenges of deep models in place recognition is to extract efficient and descriptive feature representations that relate places based on their similarity. To address the problem of place recognition using LiDAR data, this paper proposes a novel 3D LiDAR-based deep learning network (named AttDLNet) that comprises an encoder network and exploits an attention mechanism to selectively focus on long-range context and interfeature relationships. The proposed network is trained and validated on the KITTI dataset, using the cosine loss for training and a retrieval-based place recognition pipeline for validation. Additionally, an ablation study is presented to assess the best network configuration. Results show that the encoder network features are already very descriptive, but adding attention to the network further improves performance. From the ablation study, results indicate that the middle encoder layers have the highest mean performance, while deeper layers are more robust to orientation change. The code is publicly available on the project website: https://github.com/Cybonic/ AttDLNet

【3】 Class Balancing GAN with a Classifier in the Loop 标题：在回路中使用分类器的类平衡GAN

作者：Harsh Rangwani,Konda Reddy Mopuri,R. Venkatesh Babu 机构：Indian Institute of Science, Bengaluru, Indian Institute of Technology Tirupati 备注：UAI 2021 链接：https://arxiv.org/abs/2106.09402 摘要：生成性对抗网络（generativediscountarial Networks，GANs）已经迅速演化为模拟日益复杂的图像分布。然而，大多数的发展集中在GANs在平衡数据集上的性能上。我们发现，在不平衡（即长尾）数据集的情况下，现有的GANs及其训练机制在平衡数据集上效果很好。在这项工作中，我们介绍了一种新的理论上激励类平衡正则化训练机构。我们的正则化器利用来自预先训练的分类器的知识来确保数据集中所有类的均衡学习。这是通过建立基于神经网络中观察到的指数遗忘的有效类频率模型，并鼓励GAN关注代表性不足的类来实现的。通过在多个数据集上取得比现有方法更好的性能，我们证明了正则化器在长尾分布学习表示中的实用性。具体来说，当应用于无条件GAN时，它将长尾不自然列表（2019美元数据集）上的FID从13.03美元提高到9.01美元。摘要：Generative Adversarial Networks (GANs) have swiftly evolved to imitate increasingly complex image distributions. However, majority of the developments focus on performance of GANs on balanced datasets. We find that the existing GANs and their training regimes which work well on balanced datasets fail to be effective in case of imbalanced (i.e. long-tailed) datasets. In this work we introduce a novel theoretically motivated Class Balancing regularizer for training GANs. Our regularizer makes use of the knowledge from a pre-trained classifier to ensure balanced learning of all the classes in the dataset. This is achieved via modelling the effective class frequency based on the exponential forgetting observed in neural networks and encouraging the GAN to focus on underrepresented classes. We demonstrate the utility of our regularizer in learning representations for long-tailed distributions via achieving better performance than existing approaches over multiple datasets. Specifically, when applied to an unconditional GAN, it improves the FID from $13.03$ to $9.01$ on the long-tailed iNaturalist-$2019$ dataset.

【4】 Deep Subdomain Adaptation Network for Image Classification 标题：用于图像分类的深子域自适应网络

作者：Yongchun Zhu,Fuzhen Zhuang,Jindong Wang,Guolin Ke,Jingwu Chen,Jiang Bian,Hui Xiong,Qing He 机构：Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing , China, University of Chinese Academy of Sciences, Beijing , China, Microsoft Research, ByteDance 备注：published on TNNLS 链接：https://arxiv.org/abs/2106.09388 摘要：对于标记数据不可用的目标任务，域适应可以将学习者从不同的源域转移到不同的源域。以往的深域自适应方法主要学习全局域偏移，即在不考虑同一类不同域的两个子域之间的关系的情况下，对齐全局源分布和目标分布，导致在没有捕获细粒度信息的情况下，迁移学习性能不理想。近年来，子域自适应越来越受到研究者的重视，它的核心是精确地调整相关子域的分布。然而，这些方法大多是对抗性的，包含多个损失函数，收敛速度慢。在此基础上，我们提出了一种基于局部最大平均差（LMMD）的深子域自适应网络（DSAN），它通过在不同的域之间对齐域特定层激活的相关子域分布来学习传输网络。我们的DSAN算法简单有效，不需要对抗性训练，收敛速度快。对于大多数前馈网络模型，利用LMMD损失对其进行扩展，可以很容易地实现自适应，而LMMD损失可以通过反向传播进行有效的训练。实验表明，DSAN在目标识别和数字分类任务上都能取得显著的效果。我们的代码将在以下网址提供：https://github.com/easezyc/deep-transfer-learning 摘要：For a target task where labeled data is unavailable, domain adaptation can transfer a learner from a different source domain. Previous deep domain adaptation methods mainly learn a global domain shift, i.e., align the global source and target distributions without considering the relationships between two subdomains within the same category of different domains, leading to unsatisfying transfer learning performance without capturing the fine-grained information. Recently, more and more researchers pay attention to Subdomain Adaptation which focuses on accurately aligning the distributions of the relevant subdomains. However, most of them are adversarial methods which contain several loss functions and converge slowly. Based on this, we present Deep Subdomain Adaptation Network (DSAN) which learns a transfer network by aligning the relevant subdomain distributions of domain-specific layer activations across different domains based on a local maximum mean discrepancy (LMMD). Our DSAN is very simple but effective which does not need adversarial training and converges fast. The adaptation can be achieved easily with most feed-forward network models by extending them with LMMD loss, which can be trained efficiently via back-propagation. Experiments demonstrate that DSAN can achieve remarkable results on both object recognition tasks and digit classification tasks. Our code will be available at: https://github.com/easezyc/deep-transfer-learning

【5】 Automatic Main Character Recognition for Photographic Studies 标题：摄影研究中的主要字符自动识别

作者：Mert Seker,Anssi Männistö,Alexandros Iosifidis,Jenni Raitoharju 机构： Tampere University, fi) 3Department of Eletrical and Computer Engineering, Aarhus University, Finnish Environment Institute 备注：6 pages, 4 figures, 2 tables 链接：https://arxiv.org/abs/2106.09064 摘要：图像中的主要人物是最重要的人物，他们在第一眼看到时就吸引了观众的注意力，他们被诸如大小、位置、色彩饱和度和焦点清晰度等特性所强调。在传统的摄影研究和媒体分析中，识别图像中的主要人物起着重要的作用，但是这项工作是手工完成的，而且速度慢且费力。此外，主角的选择有时是主观的。本文分析了自动解决摄影学习所需的主要字符识别问题的可行性，提出了一种主要字符识别方法。该方法采用基于机器学习的人体姿态估计和传统的计算机视觉方法。我们将此任务视为一个二元分类问题，其中每个被检测到的人被分类为一个主要角色或不是。为了评估任务的主观性和我们的方法的性能，我们收集了来自多个来源的300个不同图像的数据集，并要求5个人，一个摄影研究人员和另外4个人，对主要人物进行注释。我们的分析表明，不同的注释者之间的一致性相对较高。该方法在完整图像集上获得了0.83的F1分，在被摄影研究者评价为最清晰和最重要案例的子集上获得了0.96的F1分。摘要：Main characters in images are the most important humans that catch the viewer's attention upon first look, and they are emphasized by properties such as size, position, color saturation, and sharpness of focus. Identifying the main character in images plays an important role in traditional photographic studies and media analysis, but the task is performed manually and can be slow and laborious. Furthermore, selection of main characters can be sometimes subjective. In this paper, we analyze the feasibility of solving the main character recognition needed for photographic studies automatically and propose a method for identifying the main characters. The proposed method uses machine learning based human pose estimation along with traditional computer vision approaches for this task. We approach the task as a binary classification problem where each detected human is classified either as a main character or not. To evaluate both the subjectivity of the task and the performance of our method, we collected a dataset of 300 varying images from multiple sources and asked five people, a photographic researcher and four other persons, to annotate the main characters. Our analysis showed a relatively high agreement between different annotators. The proposed method achieved a promising F1 score of 0.83 on the full image set and 0.96 on a subset evaluated as most clear and important cases by the photographic researcher.

分割|语义相关(6篇)

【1】 To fit or not to fit: Model-based Face Reconstruction and Occlusion Segmentation from Weak Supervision 标题：适合还是不适合：弱监督条件下基于模型的人脸重建与遮挡分割

作者：Chunlu Li,Andreas Morel-Forster,Thomas Vetter,Bernhard Egger,Adam Kortylewski 机构： Massachusetts Institute of Technology 链接：https://arxiv.org/abs/2106.09614 摘要：基于单幅图像的三维人脸重建由于其病态性而具有挑战性。基于模型的人脸自动编码器以弱监督的方式将人脸模型与目标图像相匹配，有效地解决了这一问题。然而，在不受约束的环境中，由于模型常常错误地试图适应被遮挡的人脸区域，因此遮挡会扭曲人脸重建。有监督遮挡分割是避免遮挡人脸区域拟合的有效方法，但需要大量的带注释的训练数据。在这项工作中，我们使基于模型的人脸自动编码器能够在训练过程中准确地分割遮挡者，而不需要任何额外的监督，并且这将模型将要拟合的区域与不拟合的区域分开。为了实现这一点，我们用分割网络扩展了人脸自动编码器。分割网络通过在包括像素和使模型适应像素之间达到平衡，以及排除像素来决定模型应该适应哪些区域，从而使模型拟合不受负面影响，并且在显示人脸的像素上达到更高的整体重建精度。这导致了一种协同效应，其中遮挡分割指导人脸自动编码器的训练以约束非遮挡区域中的拟合，而改进的拟合使分割模型能够更好地预测遮挡区域。在CelebA-HQ数据库和AR数据库上进行的定性和定量实验验证了该模型在改善遮挡情况下的三维人脸重建以及仅在弱监控下实现遮挡的精确分割方面的有效性。代码位于https://github.com/unibas-gravis/Occlusion-Robust-MoFA. 摘要：3D face reconstruction from a single image is challenging due to its ill-posed nature. Model-based face autoencoders address this issue effectively by fitting a face model to the target image in a weakly supervised manner. However, in unconstrained environments occlusions distort the face reconstruction because the model often erroneously tries to adapt to occluded face regions. Supervised occlusion segmentation is a viable solution to avoid the fitting of occluded face regions, but it requires a large amount of annotated training data. In this work, we enable model-based face autoencoders to segment occluders accurately without requiring any additional supervision during training, and this separates regions where the model will be fitted from those where it will not be fitted. To achieve this, we extend face autoencoders with a segmentation network. The segmentation network decides which regions the model should adapt to by reaching balances in a trade-off between including pixels and adapting the model to them, and excluding pixels so that the model fitting is not negatively affected and reaches higher overall reconstruction accuracy on pixels showing the face. This leads to a synergistic effect, in which the occlusion segmentation guides the training of the face autoencoder to constrain the fitting in the non-occluded regions, while the improved fitting enables the segmentation model to better predict the occluded face regions. Qualitative and quantitative experiments on the CelebA-HQ database and the AR database verify the effectiveness of our model in improving 3D face reconstruction under occlusions and in enabling accurate occlusion segmentation from weak supervision only. Code available at https://github.com/unibas-gravis/Occlusion-Robust-MoFA.

【2】 Knowledge distillation from multi-modal to mono-modal segmentation networks 标题：从多模态分割网络到单模态分割网络的知识提取

作者：Minhao Hu,Matthis Maillard,Ya Zhang,Tommaso Ciceri,Giammarco La Barbera,Isabelle Bloch,Pietro Gori 机构： CMIC, Shanghai Jiao Tong University, Shanghai, China, LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France 备注：None 链接：https://arxiv.org/abs/2106.09564 摘要：近年来，多成像模式联合应用于医学图像分割得到了广泛的研究。在多个应用中，不同模态的信息融合可以提高单模态分割的精度。然而，由于医生和扫描仪的数量有限，以及成本和扫描时间的限制，在临床环境中获取多种模式通常是不可能的。大多数情况下，只获得一种模态。在本文中，我们提出了KD-Net，一个将知识从训练过的多模态网络（教师）转移到单模态网络（学生）的框架。所提出的方法是对广义蒸馏框架的一种改进，其中学生网络是在教师输入（n个模态）的子集（1个模态）上训练的。我们用BraTS 2018数据集说明了该框架在脑肿瘤分割中的有效性。通过使用不同的结构，我们证明了学生网络有效地从老师那里学习，并且在分割精度方面总是优于基线单峰网络。摘要：The joint use of multiple imaging modalities for medical image segmentation has been widely studied in recent years. The fusion of information from different modalities has demonstrated to improve the segmentation accuracy, with respect to mono-modal segmentations, in several applications. However, acquiring multiple modalities is usually not possible in a clinical setting due to a limited number of physicians and scanners, and to limit costs and scan time. Most of the time, only one modality is acquired. In this paper, we propose KD-Net, a framework to transfer knowledge from a trained multi-modal network (teacher) to a mono-modal one (student). The proposed method is an adaptation of the generalized distillation framework where the student network is trained on a subset (1 modality) of the teacher's inputs (n modalities). We illustrate the effectiveness of the proposed framework in brain tumor segmentation with the BraTS 2018 dataset. Using different architectures, we show that the student network effectively learns from the teacher and always outperforms the baseline mono-modal network in terms of segmentation accuracy.

【3】 Learning to Associate Every Segment for Video Panoptic Segmentation 标题：视频全视分割中的每段关联学习

作者：Sanghyun Woo,Dahun Kim,Joon-Young Lee,In So Kweon 机构：KAIST, Adobe Research 备注：Accepted to CVPR2021 链接：https://arxiv.org/abs/2106.09453 摘要：时间对应（将像素或对象跨帧连接起来）是视频模型的基本监控信号。对于动态场景的全景理解，我们进一步将这个概念扩展到每个片段。具体来说，我们的目标是同时学习粗段级匹配和细像素级匹配。我们通过设计两个新的学习目标来实现这个想法。为了验证我们的建议，我们采用了一个深度连体模型，并训练该模型学习两个不同层次（即分段和像素）上的时间对应关系以及目标任务。在推理时，该模型独立地处理每一帧，不需要任何额外的计算和后处理。我们的每帧推断模型可以在Cityscapes VPS和VIPER数据集上获得最新的结果。此外，由于它的高效率，该模型运行的时间（3倍）比以前的国家最先进的方法。摘要：Temporal correspondence - linking pixels or objects across frames - is a fundamental supervisory signal for the video models. For the panoptic understanding of dynamic scenes, we further extend this concept to every segment. Specifically, we aim to learn coarse segment-level matching and fine pixel-level matching together. We implement this idea by designing two novel learning objectives. To validate our proposals, we adopt a deep siamese model and train the model to learn the temporal correspondence on two different levels (i.e., segment and pixel) along with the target task. At inference time, the model processes each frame independently without any extra computation and post-processing. We show that our per-frame inference model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets. Moreover, due to its high efficiency, the model runs in a fraction of time (3x) compared to the previous state-of-the-art approach.

【4】 Trilateral Attention Network for Real-time Medical Image Segmentation 标题：三边注意力网络在实时医学图像分割中的应用

作者：Ghada Zamzmi,Vandana Sachdev,Sameer Antani 机构： Lister Hill National Center for Biomedical Communications, National Heart, Lung, and Blood Institute, U.S. National Institutes of Health (NIH), Bethesda, MD 链接：https://arxiv.org/abs/2106.09201 摘要：医学图像的精确分割是提取定量指标或生物标志物的关键。用于分割的公共管道包括感兴趣区域检测阶段和分割阶段，这两个阶段彼此独立并且通常使用单独的深度学习网络来执行。分割阶段的性能在很大程度上依赖于提取的空间特征集和感受野。在这项工作中，我们提出了一种端到端的网络，称为三边注意网络（TaNet），用于医学图像的实时检测和分割。TaNet有一个区域定位模块和三个分割路径：1）手工设计的卷积核路径，2）规则卷积核的细节路径，3）扩大感受野的全局路径。前两个路径编码由手工设计和规则内核提取的丰富的手工和低级特征，而全局路径编码高级上下文信息。通过使用不同的特征集联合训练网络进行定位和分割，TaNet在心脏分割的超声心动图数据集上进行评估时，在准确性和速度方面取得了优异的性能。代码和模型将在TaNet Github页面上公开。摘要：Accurate segmentation of medical images into anatomically meaningful regions is critical for the extraction of quantitative indices or biomarkers. The common pipeline for segmentation comprises regions of interest detection stage and segmentation stage, which are independent of each other and typically performed using separate deep learning networks. The performance of the segmentation stage highly relies on the extracted set of spatial features and the receptive fields. In this work, we propose an end-to-end network, called Trilateral Attention Network (TaNet), for real-time detection and segmentation in medical images. TaNet has a module for region localization, and three segmentation pathways: 1) handcrafted pathway with hand-designed convolutional kernels, 2) detail pathway with regular convolutional kernels, and 3) a global pathway to enlarge the receptive field. The first two pathways encode rich handcrafted and low-level features extracted by hand-designed and regular kernels while the global pathway encodes high-level context information. By jointly training the network for localization and segmentation using different sets of features, TaNet achieved superior performance, in terms of accuracy and speed, when evaluated on an echocardiography dataset for cardiac segmentation. The code and models will be made publicly available in TaNet Github page.

【5】 Positional Contrastive Learning for VolumetricMedical Image Segmentation 标题：位置对比学习在体积医学图像分割中的应用

作者：Dewen Zeng,Yawen Wu,Xinrong Hu,Xiaowei Xu,Haiyun Yuan,Meiping Huang,Jian Zhuang,Jingtong Hu,Yiyu Shi 机构： University of Notre Dame, University of Pittsburgh, Guangdong Provincial People’s Hospital 备注：8 pages, conference 链接：https://arxiv.org/abs/2106.09157 摘要：深度学习的成功在很大程度上取决于大型标记训练集的可用性。然而，由于严格的隐私保护和昂贵的标记工作，在医学图像领域很难获得大的标记数据集。对比学习是一种无监督的学习方法，已经被证明在从未标记数据中学习图像级表征方面非常有效。然后可以传输或微调学习到的编码器，以提高具有有限标签的下游任务的性能。对比学习中的一个关键步骤是生成对比数据对，这对于自然图像分类来说相对简单，但是对于医学图像分割来说却非常困难，因为数据集中存在相同的组织或器官。因此，现有的对比学习框架在应用于医学图像分割时，不可避免地会引入大量的假阴性对，导致分割质量下降。为了解决这个问题，我们提出了一个新的位置对比学习（PCL）框架，利用医学图像中的位置信息生成对比数据对。在CT和MRI数据集上的实验结果表明，与现有的半监督和转移学习方法相比，本文提出的PCL方法能显著提高分割性能。摘要：The success of deep learning heavily depends on the availability of large labeled training sets. However, it is hard to get large labeled datasets in medical image domain because of the strict privacy concern and costly labeling efforts. Contrastive learning, an unsupervised learning technique, has been proved powerful in learning image-level representations from unlabeled data. The learned encoder can then be transferred or fine-tuned to improve the performance of downstream tasks with limited labels. A critical step in contrastive learning is the generation of contrastive data pairs, which is relatively simple for natural image classification but quite challenging for medical image segmentation due to the existence of the same tissue or organ across the dataset. As a result, when applied to medical image segmentation, most state-of-the-art contrastive learning frameworks inevitably introduce a lot of false-negative pairs and result in degraded segmentation quality. To address this issue, we propose a novel positional contrastive learning (PCL) framework to generate contrastive data pairs by leveraging the position information in volumetric medical images. Experimental results on CT and MRI datasets demonstrate that the proposed PCL method can substantially improve the segmentation performance compared to existing methods in both semi-supervised setting and transfer learning setting.

【6】 Automatic Segmentation of the Prostate on 3D Trans-rectal Ultrasound Images using Statistical Shape Models and Convolutional Neural Networks 标题：基于统计形状模型和卷积神经网络的三维直肠超声图像前列腺自动分割

作者：Golnoosh Samei,Davood Karimi,Claudia Kesch,Septimiu Salcudean 机构：The University of British Columbia 链接：https://arxiv.org/abs/2106.09662 摘要：在这项工作中，我们建议使用卷积神经网络（CNNs）和统计形状模型（SSMs）在具有挑战性的经直肠超声（TRUS）图像数据集上分割前列腺。TRUS通常用于前列腺的许多图像引导干预。快速准确地分割这些图像中的器官对于规划和融合其他成像方式如磁共振图像（MRIs）至关重要。然而，TRUS具有有限的软组织对比度和信噪比，这使得分割前列腺的任务具有挑战性，并且受制于观察者之间和观察者内部的变异性。这在腺体边界难以确定的基部和顶端尤其有问题。在本文中，我们的目的是解决这个问题，利用形状先验学习的磁共振数据集具有较高的软组织对比度，使前列腺轮廓更准确。我们使用这个形状的先验结合前列腺组织概率图计算的CNN分割。摘要：In this work we propose to segment the prostate on a challenging dataset of trans-rectal ultrasound (TRUS) images using convolutional neural networks (CNNs) and statistical shape models (SSMs). TRUS is commonly used for a number of image-guided interventions on the prostate. Fast and accurate segmentation on the organ in these images is crucial to planning and fusion with other modalities such as magnetic resonance images (MRIs) . However, TRUS has limited soft tissue contrast and signal to noise ratio which makes the task of segmenting the prostate challenging and subject to inter-observer and intra-observer variability. This is especially problematic at the base and apex where the gland boundary is hard to define. In this paper, we aim to tackle this problem by taking advantage of shape priors learnt on an MR dataset which has higher soft tissue contrast allowing the prostate to be contoured more accurately. We use this shape prior in combination with a prostate tissue probability map computed by a CNN for segmentation.

Zero/Few Shot|迁移|域适配|自适应(6篇)

【1】 JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting 标题：JOKR：无监督跨域运动重定位的联合关键点表示

作者：Ron Mokady,Rotem Tzaban,Sagie Benaim,Amit H. Bermano,Daniel Cohen-Or 机构：The Blavatnik School of Computer Science, Tel Aviv University 链接：https://arxiv.org/abs/2106.09679 摘要：通过使用深度神经网络，视频中的无监督运动重定目标的任务有了实质性的进展。早期的研究主要集中在特定的物体，如人脸或身体，而最近的研究则考虑了无监督的情况。然而，当源视频和目标视频的形状不同时，当前的方法就失败了。为了缓解这个问题，我们引入了JOKR-一种联合关键点表示法，它可以捕获源视频和目标视频的共同运动，而不需要任何对象的先验知识或数据采集。通过使用域混淆项，我们强制两个视频的无监督关键点表示是不可区分的。这促进了两个领域共同的运动部分之间的分离，以及它们独特的外观和运动，从而能够生成捕获一个领域的运动，同时描绘另一个领域的风格的视频。为了实现物体具有不同比例或方向的情况，我们在物体之间应用一个学习的仿射变换。这增加了表示是仿射不变，并在实践中扩大了各种可能的重定目标对。这种几何驱动的表示方法可以实现更直观的控制，例如时间一致性和手动编辑。通过综合实验，我们证明了该方法对不同挑战性的跨域视频对的适用性。我们对我们的方法进行了定性和定量的评估，并证明我们的方法能够处理各种跨领域的场景，例如不同的动物、不同的花和人类。通过统计指标和用户研究，我们还证明了与最先进的替代方案相比，具有更高的时间一致性和视觉质量。源代码和视频可以在https://rmokady.github.io/JOKR/ . 摘要：The task of unsupervised motion retargeting in videos has seen substantial advancements through the use of deep neural networks. While early works concentrated on specific object priors such as a human face or body, recent work considered the unsupervised case. When the source and target videos, however, are of different shapes, current methods fail. To alleviate this problem, we introduce JOKR - a JOint Keypoint Representation that captures the motion common to both the source and target videos, without requiring any object prior or data collection. By employing a domain confusion term, we enforce the unsupervised keypoint representations of both videos to be indistinguishable. This encourages disentanglement between the parts of the motion that are common to the two domains, and their distinctive appearance and motion, enabling the generation of videos that capture the motion of the one while depicting the style of the other. To enable cases where the objects are of different proportions or orientations, we apply a learned affine transformation between the JOKRs. This augments the representation to be affine invariant, and in practice broadens the variety of possible retargeting pairs. This geometry-driven representation enables further intuitive control, such as temporal coherence and manual editing. Through comprehensive experimentation, we demonstrate the applicability of our method to different challenging cross-domain video pairs. We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans. We also demonstrate superior temporal coherency and visual quality compared to state-of-the-art alternatives, through statistical metrics and a user study. Source code and videos can be found at https://rmokady.github.io/JOKR/ .

【2】 SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies 标题：Sucant：用于视觉策略零射泛化的自我专家克隆(Self-Expert Clone)

作者：Linxi Fan,Guanzhi Wang,De-An Huang,Zhiding Yu,Li Fei-Fei,Yuke Zhu,Anima Anandkumar 机构： 3The University of Texas atAustin, 4California Institute of Technology 备注：ICML 2021. Website: this https URL 链接：https://arxiv.org/abs/2106.09678 摘要：泛化是强化学习的一个长期挑战。特别是视觉RL，在高维的观察空间中，很容易被不相关的因素分散注意力。在这项工作中，我们考虑稳健的策略学习，其目标是将Zero-Shot泛化到具有较大分布偏移的未知视觉环境中。我们提出割线，一种新的自我专家克隆技术，利用图像增强分两个阶段解耦鲁棒表示学习策略优化。具体地说，一个专家策略首先由RL从无到有的弱增广训练。然后，学生网络通过有监督学习和强增强学习来学习模仿专家策略，使其表示比专家更能抵抗视觉变化。大量的实验表明，割线显著地提高了在4个具有挑战性的领域中的Zero-Shot泛化的技术水平。与之前的SOTA相比，我们的平均回报改进是：深度控制（+26.5%）、机器人操作（+337.8%）、基于视觉的自动驾驶（+47.7%）和室内物体导航（+15.8%）。代码发布和视频可在https://linxifan.github.io/secant-site/. 摘要：Generalization has been a long-standing challenge for reinforcement learning (RL). Visual RL, in particular, can be easily distracted by irrelevant factors in high-dimensional observation space. In this work, we consider robust policy learning which targets zero-shot generalization to unseen visual environments with large distributional shift. We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to decouple robust representation learning from policy optimization. Specifically, an expert policy is first trained by RL from scratch with weak augmentations. A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert. Extensive experiments demonstrate that SECANT significantly advances the state of the art in zero-shot generalization across 4 challenging domains. Our average reward improvements over prior SOTAs are: DeepMind Control (+26.5%), robotic manipulation (+337.8%), vision-based autonomous driving (+47.7%), and indoor object navigation (+15.8%). Code release and video are available at https://linxifan.github.io/secant-site/.

【3】 Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention 标题：利用视听自我注意改进开放领域视频的屏上声音分离

作者：Efthymios Tzinis,Scott Wisdom,Tal Remez,John R. Hershey 机构：UIUC, Google Research 链接：https://arxiv.org/abs/2106.09669 摘要：我们介绍了一个最先进的视听屏幕声音分离系统，它能够学习分离声音，并通过观看野生视频将声音与屏幕上的物体联系起来。我们指出了以往的视听屏幕声音分离工作的局限性，包括时空注意的简单性和粗糙分辨率，以及音频分离模型的收敛性差。我们提出的模型利用跨模态和自我注意模块来解决这些问题，这些模块能够以更高的分辨率捕获随时间变化的视听相关性，并通过无监督的音频分离模型预训练来解决这些问题。这些改进使得模型可以推广到更广泛的一组看不见的视频。为了评估和半监督训练，我们从一个大型的野生视频数据库（YFCC100M）中收集了屏幕音频的人类注释。我们的结果表明，在更一般的条件下，屏幕上的分离性能比以前的方法有显著的改善。摘要：We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audiovisual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio separation model. Our proposed model addresses these issues using cross-modal and self-attention modules that capture audio-visual dependencies at a finer resolution over time, and by unsupervised pre-training of audio separation model. These improvements allow the model to generalize to a much wider set of unseen videos. For evaluation and semi-supervised training, we collected human annotations of on-screen audio from a large database of in-the-wild videos (YFCC100M). Our results show marked improvements in on-screen separation performance, in more general conditions than previous methods.

【4】 Transductive Few-Shot Learning: Clustering is All You Need? 标题：传导性小机会学习：集群是您所需要的全部吗？

作者：Imtiaz Masud Ziko,Malik Boudiaf,Jose Dolz,Eric Granger,Ismail Ben Ayed 链接：https://arxiv.org/abs/2106.09516 摘要：我们研究了一种通用的聚类和导入式Few-Shot学习方法，它集成了基于原型的目标、拉普拉斯正则化和来自少数标记数据点的监督约束。我们提出了一个凹凸松弛的问题，并推导了一个计算效率高的块坐标界优化器，收敛性保证。在每次迭代中，我们的优化器为每个点到集群的分配计算独立（并行）的更新。因此，对于大规模的集群和少量的shot任务，它可以很容易地分布。在此基础上，对点到集映射进行了深入的收敛性分析。通过对不同数据集的端口综合聚类和少量镜头学习实验，表明我们的方法在精度和优化质量方面具有很强的竞争力，同时可以扩展到大型问题。我们的方法在基类上使用标准的训练，而不依赖于复杂的元学习和情景训练策略，在不同的模型、设置和数据集上，我们的方法比最新的少数镜头方法有显著的优势。令人惊讶的是，我们发现，即使是标准的聚类过程（例如，K-均值），对应于我们的一般模型的特定的、非正则化的情况，与最先进的Few-Shot学习相比，已经取得了竞争性的性能。这些令人惊讶的结果指出了当前少数镜头基准的局限性，并质疑了最近文献中大量复杂的少数镜头学习技术的可行性。摘要：We investigate a general formulation for clustering and transductive few-shot learning, which integrates prototype-based objectives, Laplacian regularization and supervision constraints from a few labeled data points. We propose a concave-convex relaxation of the problem, and derive a computationally efficient block-coordinate bound optimizer, with convergence guarantee. At each iteration,our optimizer computes independent (parallel) updates for each point-to-cluster assignment. Therefore, it could be trivially distributed for large-scale clustering and few-shot tasks. Furthermore, we provides a thorough convergence analysis based on point-to-set maps. Were port comprehensive clustering and few-shot learning experiments over various data sets, showing that our method yields competitive performances, in term of accuracy and optimization quality, while scaling up to large problems. Using standard training on the base classes, without resorting to complex meta-learning and episodic-training strategies, our approach outperforms state-of-the-art few-shot methods by significant margins, across various models, settings and data sets. Surprisingly, we found that even standard clustering procedures (e.g., K-means), which correspond to particular, non-regularized cases of our general model, already achieve competitive performances in comparison to the state-of-the-art in few-shot learning. These surprising results point to the limitations of the current few-shot benchmarks, and question the viability of a large body of convoluted few-shot learning techniques in the recent literature.

【5】 Episode Adaptive Embedding Networks for Few-shot Learning 标题：片段自适应嵌入网络在少概率学习中的应用

作者：Fangbing Liu,Qing Wang 机构：Australian National University 备注：None 链接：https://arxiv.org/abs/2106.09398 摘要：Few-Shot学习的目的是为每个类使用几个标记实例来学习分类器。面向少量镜头学习的度量学习方法将实例嵌入到高维空间中，并根据实例之间的距离进行分类。然而，这样的实例嵌入通常在所有情节中共享，因此缺乏根据情节特定的特征来概括分类器的辨别能力。在本文中，我们提出了一种新的方法，即集自适应嵌入网络（EAEN），来学习特定于集的实例嵌入。通过在每个通道像素嵌入维度上利用事件中所有实例的概率分布，EAEN不仅可以缓解在少数镜头学习任务中遇到的过度拟合问题，而且可以捕获特定于事件的区别性特征。为了验证EAEN的有效性和鲁棒性，我们在三个广泛使用的基准数据集上进行了大量实验，在不同的泛型嵌入主干和不同的分类器的不同组合下进行。结果表明，与现有的分类方法相比，EAEN方法在不同的分类环境下能显著地提高分类精度10%～20%。摘要：Few-shot learning aims to learn a classifier using a few labelled instances for each class. Metric-learning approaches for few-shot learning embed instances into a high-dimensional space and conduct classification based on distances among instance embeddings. However, such instance embeddings are usually shared across all episodes and thus lack the discriminative power to generalize classifiers according to episode-specific features. In this paper, we propose a novel approach, namely \emph{Episode Adaptive Embedding Network} (EAEN), to learn episode-specific embeddings of instances. By leveraging the probability distributions of all instances in an episode at each channel-pixel embedding dimension, EAEN can not only alleviate the overfitting issue encountered in few-shot learning tasks, but also capture discriminative features specific to an episode. To empirically verify the effectiveness and robustness of EAEN, we have conducted extensive experiments on three widely used benchmark datasets, under various combinations of different generic embedding backbones and different classifiers. The results show that EAEN significantly improves classification accuracy about $10\%$ to $20\%$ in different settings over the state-of-the-art methods.

【6】 Deep Contrastive Graph Representation via Adaptive Homotopy Learning 标题：基于自适应同伦学习的深度对比图表示

作者：Rui Zhang,Chengjun Lu,Ziheng Jiao,Xuelong Li 机构：Department of Computer Science, Northwestern Polytechnical University 备注：9 pages,4 figures 链接：https://arxiv.org/abs/2106.09244 摘要：同伦模型是机器学习领域的一个优秀工具。然而，由于缺乏自适应性，即人工固定或调整适当的同伦系数，其灵活性受到限制。为了解决上述问题，我们提出了一种新的自适应同伦框架（AH），其中采用了Maclaurin对偶，从而可以自适应地获得同伦参数。因此，本文提出的AH可以广泛应用于同伦算法的增强。特别是在本文中，我们将AH应用于对比学习（AHCL），使得它能够有效地从弱监督学习（给定标签先验）转化为无监督学习，从而直接自适应地学习对比学习的软标签。因此，AHCL具有在不需要任何先验信息的情况下提取深层特征的自适应能力。因此，由相关的自适应标签所形成的亲和矩阵可以被构造为包含输入的深表示拓扑的深拉普拉斯图。最后，在基准数据集上进行了大量的实验，验证了该方法的优越性。摘要：Homotopy model is an excellent tool exploited by diverse research works in the field of machine learning. However, its flexibility is limited due to lack of adaptiveness, i.e., manual fixing or tuning the appropriate homotopy coefficients. To address the problem above, we propose a novel adaptive homotopy framework (AH) in which the Maclaurin duality is employed, such that the homotopy parameters can be adaptively obtained. Accordingly, the proposed AH can be widely utilized to enhance the homotopy-based algorithm. In particular, in this paper, we apply AH to contrastive learning (AHCL) such that it can be effectively transferred from weak-supervised learning (given label priori) to unsupervised learning, where soft labels of contrastive learning are directly and adaptively learned. Accordingly, AHCL has the adaptive ability to extract deep features without any sort of prior information. Consequently, the affinity matrix formulated by the related adaptive labels can be constructed as the deep Laplacian graph that incorporates the topology of deep representations for the inputs. Eventually, extensive experiments on benchmark datasets validate the superiority of our method.

半弱无监督|主动学习|不确定性(8篇)

【1】 MoDist: Motion Distillation for Self-supervised Video Representation Learning 标题：MoDist：用于自监督视频表征学习的运动提取

作者：Fanyi Xiao,Joseph Tighe,Davide Modolo 机构：Amazon AI 链接：https://arxiv.org/abs/2106.09703 摘要：我们提出了一种新的方法MoDist显式提取运动信息到自我监督的视频表示。与以往主要从RGB输入中隐式学习运动线索的视频表示学习方法相比，本文提出的改进方法更注重前景运动区域的表示，从而更好地推广到下游任务。为了实现这一点，MoDist通过在运动路径和视觉路径之间建立跨模式学习目标，丰富了RGB视频片段的标准对比学习目标。我们在多个数据集上评估了MoDist在动作识别（UCF101/HMDB51/SSv2）和动作检测（AVA）方面的性能，并在所有数据集上展示了最先进的自我监督性能。此外，我们还表明，在完全监督的情况下，MoDist表示可以和所学的表示一样有效（在某些情况下甚至更好）。鉴于它的简单性，我们希望MoDist可以作为一个强大的基线，为今后的研究在自我监督视频表示学习。摘要：We present MoDist as a novel method to explicitly distill motion information into self-supervised video representations. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, we show that the representation learned with our MoDist method focus more on foreground motion regions and thus generalizes better to downstream tasks. To achieve this, MoDist enriches standard contrastive learning objectives for RGB video clips with a cross-modal learning objective between a Motion pathway and a Visual pathway. We evaluate MoDist on several datasets for both action recognition (UCF101/HMDB51/SSv2) as well as action detection (AVA), and demonstrate state-of-the-art self-supervised performance on all datasets. Furthermore, we show that MoDist representation can be as effective as (in some cases even better than) representations learned with full supervision. Given its simplicity, we hope MoDist could serve as a strong baseline for future research in self-supervised video representation learning.

【2】 Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention 标题：基于带自我注意的生成性对抗性网络的手写公式无监督训练数据生成

作者：Matthias Springstein,Eric Müller-Budack,Ralph Ewerth 机构：TIB – Leibniz Information Centre for, Science and Technology, Hannover, Germany, L,S Research Center, Leibniz, University Hannover 备注：Accepted for publication in: ACM International Conference on Multimedia Retrieval (ICMR) Workshop 2021 链接：https://arxiv.org/abs/2106.09432 摘要：图像和视频帧中手写体数学表达式的识别是一个尚未解决的难题。深对流神经网络基本上是一种很有前途的方法，但通常需要大量的标记训练数据。然而，对于手写体公式识别的任务来说，这样大的训练数据集是不存在的。在本文中，我们介绍了一个系统，它创建了一个大型的综合训练实例的数学表达式是来自乳胶文件。为此，我们提出了一种新的基于注意的生成性对抗网络来将呈现的方程转化为手写公式。这种方法生成的数据集包含数十万个公式，非常适合于预训练或设计更复杂的模型。我们在CROHME 2014基准数据集上评估了我们的合成数据集和识别方法。实验结果证明了该方法的可行性。摘要：The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training data. However, such a large training dataset does not exist for the task of handwritten formula recognition. In this paper, we introduce a system that creates a large set of synthesized training examples of mathematical expressions which are derived from LaTeX documents. For this purpose, we propose a novel attention-based generative adversarial network to translate rendered equations to handwritten formulas. The datasets generated by this approach contain hundreds of thousands of formulas, making it ideal for pretraining or the design of more complex models. We evaluate our synthesized dataset and the recognition approach on the CROHME 2014 benchmark dataset. Experimental results demonstrate the feasibility of the approach.

【3】 NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go 标题：NeuroMorph：无监督形状插值和一步到位的对应

作者：Marvin Eisenberger,David Novotny,Gael Kerchenbaum,Patrick Labatut,Natalia Neverova,Daniel Cremers,Andrea Vedaldi 机构：Facebook AI Research∗, Technical University of Munich†, interpolation,(nul)(nul)(nul)(nul), corresp., X, Y 备注：Published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 链接：https://arxiv.org/abs/2106.09431 摘要：我们提出了一种新的神经网络结构NeuroMorph，它以两个3D形状作为输入，一次生成，即在一个前馈过程中，平滑插值和它们之间的点对点对应。插值表示为变形场，改变源形状的姿势以与目标相似，但保持对象身份不变。NeuroMorph使用一种优雅的架构，将图卷积和全局特征池结合起来提取局部特征。在训练过程中，激励模型通过在基本形状空间流形上逼近测地线来创建真实的变形。这种强大的几何先验使得我们的模型能够以完全无监督的方式进行端到端的训练，而不需要任何手动的对应注释。NeuroMorph适用于各种输入形状，包括来自不同对象类别的非等距对。它在形状对应和插值任务上都获得了最新的结果，与最近的无监督和有监督方法在多个基准上的性能相当或超过。摘要：We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.

【4】 An Evaluation of Self-Supervised Pre-Training for Skin-Lesion Analysis 标题：皮损分析自我监督预训效果评价

作者：Levy Chaves,Alceu Bissoto,Eduardo Valle,Sandra Avila 机构：Institute of Computing (IC), School of Electrical and Computing Engineering (FEEC), RECOD Lab., University of Campinas (UNICAMP), Brazil 备注：12 pages, 2 figures 链接：https://arxiv.org/abs/2106.09229 摘要：对于迁移学习，自我监督预训练是有监督预训练的一个有利选择。通过在借口任务上合成注释，自我监督允许在对目标任务进行微调之前在大量伪标签上预先训练模型。在这项工作中，我们评估了皮肤损伤诊断的自我监督，比较了三个自我监督管道和一个具有挑战性的监督基线，五个测试数据集包括分布内和分布外样本。我们的结果表明，自我监督在提高准确性和减少结果的可变性方面都具有竞争力。自我监督对于低训练数据场景（样本数分别为$<1500$和$<150$）尤其有用，在这些场景中，自我监督稳定结果的能力对于提供可靠结果至关重要。摘要：Self-supervised pre-training appears as an advantageous alternative to supervised pre-trained for transfer learning. By synthesizing annotations on pretext tasks, self-supervision allows to pre-train models on large amounts of pseudo-labels before fine-tuning them on the target task. In this work, we assess self-supervision for the diagnosis of skin lesions, comparing three self-supervised pipelines to a challenging supervised baseline, on five test datasets comprising in- and out-of-distribution samples. Our results show that self-supervision is competitive both in improving accuracies and in reducing the variability of outcomes. Self-supervision proves particularly useful for low training data scenarios ($<1\,500$ and $<150$ samples), where its ability to stabilize the outcomes is essential to provide sound results.

【5】 LiRA: Learning Visual Speech Representations from Audio through Self-supervision 标题：LIRA：通过自我监督从音频中学习视觉语音表征

作者：Pingchuan Ma,Rodrigo Mira,Stavros Petridis,Björn W. Schuller,Maja Pantic 机构：BUG Group, Imperial College London, UK, Facebook London, UK, Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany 备注：Accepted for publication at Interspeech 2021 链接：https://arxiv.org/abs/2106.09171 摘要：如今，大量的视听内容在网上被分享，这使得视听自监督学习的前景备受关注。最近的工作集中在每一个这些模式分别，而其他人试图同时在一个跨模态的方式模型。然而，相对而言，很少有人注意到利用一种模式作为训练目标，从另一种模式中学习。在这项工作中，我们提出学习视觉语音表征从音频通过自我监督（LiRA）。具体来说，我们训练了一个ResNet+一致性模型来预测未标记视觉语音的声学特征。通过特征提取和微调实验，我们发现这个预先训练好的模型可以用于单词级和句子级的唇读。结果表明，我们的方法在野外唇读（LRW）数据集上显著优于其他自监督方法，并且在唇读句子2（LRS2）上仅使用了总标记数据的一小部分就达到了最先进的性能。摘要：The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.

【6】 SPeCiaL: Self-Supervised Pretraining for Continual Learning 标题：专题：持续学习的自我监督预训

作者：Lucas Caccia,Joelle Pineau 机构：McGill University, Facebook AI Research 链接：https://arxiv.org/abs/2106.09065 摘要：本文提出了一种无监督的表征预训练方法。我们的方法设计了一个元学习目标，通过顺序学习过程进行区分。具体地说，我们在表示上训练一个线性模型，以匹配同一图像的不同增强视图，每个视图按顺序呈现。然后评估线性模型对刚看到的图像进行分类的能力，以及以前迭代的图像。这就产生了有利于快速记忆知识和最小遗忘的表征。我们在连续Few-Shot学习环境下对SPeCiaL进行了评估，结果表明它可以与其他有监督的预训练方法相媲美或优于其他有监督的预训练方法。摘要：This paper presents SPeCiaL: a method for unsupervised pretraining of representations tailored for continual learning. Our approach devises a meta-learning objective that differentiates through a sequential learning process. Specifically, we train a linear model over the representations to match different augmented views of the same image together, each view presented sequentially. The linear model is then evaluated on both its ability to classify images it just saw, and also on images from previous iterations. This gives rise to representations that favor quick knowledge retention with minimal forgetting. We evaluate SPeCiaL in the Continual Few-Shot Learning setting, and show that it can match or outperform other supervised pretraining approaches.

【7】 Unsupervised Video Prediction from a Single Frame by Estimating 3D Dynamic Scene Structure 标题：基于三维动态场景结构估计的单帧无监督视频预测

作者：Paul Henderson,Christoph H. Lampert,Bernd Bickel 机构：Institute of Science and Technology (IST) Austria 链接：https://arxiv.org/abs/2106.09051 摘要：我们在这项工作中的目标是生成现实的视频给定一个初始帧作为输入。现有的无监督方法没有考虑到这样一个事实，即视频通常显示一个三维环境，即使相机和对象移动，也应该保持帧与帧之间的一致性。我们通过开发一个模型来解决这个问题，该模型首先估计场景的潜在三维结构，包括任何移动对象的分割。然后，它通过模拟对象和摄影机动态，并渲染生成的视图来预测未来的帧。重要的是，它只使用预测未来帧的无监督目标进行端到端训练，没有任何3D信息或分割注释。在两个具有挑战性的自然视频数据集上的实验表明，我们的模型可以从单个帧中估计三维结构和运动分割，从而产生合理和不同的预测。摘要：Our goal in this work is to generate realistic videos given just one initial frame as input. Existing unsupervised approaches to this task do not consider the fact that a video typically shows a 3D environment, and that this should remain coherent from frame to frame even as the camera and objects move. We address this by developing a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects. It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views. Importantly, it is trained end-to-end using only the unsupervised objective of predicting future frames, without any 3D information nor segmentation annotations. Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame, and hence generate plausible and varied predictions.

【8】 Localized Uncertainty Attacks 标题：局部化不确定性攻击

作者：Ousmane Amadou Dia,Theofanis Karaletsos,Caner Hazirbas,Cristian Canton Ferrer,Ilknur Kaynar Kabul,Erik Meijer 机构：Facebook 备注：CVPR 2021 Workshop on Adversarial Machine Learning in Computer Vision 链接：https://arxiv.org/abs/2106.09222 摘要：深度学习模型对对抗性干扰的敏感性在对抗性例子中引起了新的关注，导致了许多攻击。然而，这些攻击中的大多数都未能涵盖人类无法察觉的大范围敌对干扰。本文提出了一类新的针对确定性和随机分类器的威胁模型——局部不确定性攻击。在这个威胁模型下，我们只通过扰动输入中分类器不确定的区域来创建对抗性的例子。为了找到这样的区域，我们利用分类器的预测不确定性当分类器是随机的，或者，我们学习一个代理模型来摊销不确定性当它是确定的。与$\ellp$ball或不加区别地干扰输入的功能性攻击不同，我们的目标变化可能不太明显。在我们的威胁模型下，这些攻击仍然会产生强大的对抗性例子；示例与输入保持了更大程度的相似性。摘要：The susceptibility of deep learning models to adversarial perturbations has stirred renewed attention in adversarial examples resulting in a number of attacks. However, most of these attacks fail to encompass a large spectrum of adversarial perturbations that are imperceptible to humans. In this paper, we present localized uncertainty attacks, a novel class of threat models against deterministic and stochastic classifiers. Under this threat model, we create adversarial examples by perturbing only regions in the inputs where a classifier is uncertain. To find such regions, we utilize the predictive uncertainty of the classifier when the classifier is stochastic or, we learn a surrogate model to amortize the uncertainty when it is deterministic. Unlike $\ell_p$ ball or functional attacks which perturb inputs indiscriminately, our targeted changes can be less perceptible. When considered under our threat model, these attacks still produce strong adversarial examples; with the examples retaining a greater degree of similarity with the inputs.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 BABEL: Bodies, Action and Behavior with English Labels 标题：巴别塔：带有英文标签的身体、动作和行为

作者：Abhinanda R. Punnakkal,Arjun Chandrasekaran,Nikos Athanasiou,Alejandra Quiros-Ramirez,Michael J. Black 机构：Alejandra Quir´os-Ram´ırez, Max Planck Institute for Intelligent Systems, T¨ubingen, Germany, Universit¨at Konstanz, Konstanz, Germany 备注：11 pages, 4 figures, Accepted in CVPR'21 链接：https://arxiv.org/abs/2106.09696 摘要：理解人类运动的语义——运动的内容、方式和原因——是一个重要的问题，需要有语义标签的人类行为数据集。现有的数据集采用两种方法之一。大规模视频数据集包含许多动作标签，但不包含地面真实三维人体运动。或者，运动捕捉（mocap）数据集具有精确的身体运动，但仅限于少量动作。为了解决这个问题，我们提供了BABEL，一个大型的数据集，它带有描述mocap序列中执行的操作的语言标签。BABEL由动作标签组成，用于来自AMASS的大约43小时的mocap序列。动作标签有两个抽象层次——序列标签描述序列中的整体动作，帧标签描述序列中每个帧中的所有动作。每个帧标签与mocap序列中相应动作的持续时间精确对齐，并且多个动作可以重叠。BABEL中有超过28k个序列标签和63k个帧标签，它们属于超过250个独特的动作类别。BABEL的标签可以用于动作识别、时间动作定位、运动合成等任务。为了验证BABEL作为基准的价值，我们评估了模型在3D动作识别中的性能。我们证明，巴贝尔提出了有趣的学习挑战，适用于现实世界的情况下，可以作为一个有用的基准进展三维动作识别。数据集、基线方法和评估代码在https://babel.is.tue.mpg.de/. 摘要：Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels describe the overall action in the sequence, and frame labels describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.

【2】 Optical Mouse: 3D Mouse Pose From Single-View Video 标题：光学鼠标：来自单视图视频的3D鼠标姿势

作者：Bo Hu,Bryan Seybold,Shan Yang,David Ross,Avneesh Sud,Graham Ruby,Yi Liu 机构：Google LLC, Calico Life Sciences LLC 链接：https://arxiv.org/abs/2106.09251 摘要：我们提出了一种从单眼视频中推断小鼠三维姿势的方法，包括四肢和脚。许多人类的临床条件和相应的动物模型导致异常运动，精确测量三维运动的规模提供了对健康的见解。与二维表示相比，三维姿态改进了健康相关属性的分类。推断出的姿势足够精确，即使在脚大部分被遮挡的情况下也能估计出步幅。该方法可作为连续监测系统的一部分，用于非侵入性测量动物健康。摘要：We present a method to infer the 3D pose of mice, including the limbs and feet, from monocular videos. Many human clinical conditions and their corresponding animal models result in abnormal motion, and accurately measuring 3D motion at scale offers insights into health. The 3D poses improve classification of health-related attributes over 2D representations. The inferred poses are accurate enough to estimate stride length even when the feet are mostly occluded. This method could be applied as part of a continuous monitoring system to non-invasively measure animal health.

医学相关(1篇)

【1】 Deformation Driven Seq2Seq Longitudinal Tumor and Organs-at-Risk Prediction for Radiotherapy 标题：形变驱动的Seq2Seq纵向肿瘤和危险器官放射治疗预测

作者：Donghoon Lee,Sadegh R Alam,Jue Jiang,Pengpeng Zhang,Saad Nadeem,Yu-Chi Hu 机构：Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, USA 备注：Medical Physics 2021, Saad Nadeem and Yu-Chi Hu contributed equally 链接：https://arxiv.org/abs/2106.09076 摘要：目的：放射治疗对治疗中纵向肿瘤和危险器官（OAR）的预测提出了独特的挑战和临床要求。挑战包括肿瘤炎症/水肿和辐射引起的器官几何结构改变，然而，临床需求要求输入/输出序列时间点具有灵活性，以便在滚动基础上更新预测，并将所有预测与治疗前成像信息联系起来，用于适应性放射治疗的反应和毒性评估。方法：应对上述挑战，符合临床要求，我们提出了一种新的基于卷积长短时记忆（ConvLSTM）的三维序列对序列模型，该模型利用单个时间点和参考治疗前/计划CT之间的一系列变形向量场（DVF）来预测未来的解剖变形和肿瘤总体积的变化以及临界桨。通过对训练数据子集进行超参数优化，利用骰子系数和互信息度量生成高质量的DVF训练数据。我们在两个放射治疗数据集上验证了我们的模型：一个公开的头颈部数据集（28例患者在治疗前、治疗中和治疗后进行手动轮廓CT）和一个内部非小细胞肺癌数据集（63例患者进行手动轮廓计划CT和6周CBCT）。结果：DVF表示和跳跃连接的使用克服了传统图像表示中ConvLSTM预测的模糊问题。DICE预测第4、5和6周肺GTV的平均值和标准差分别为0.83$\pm$0.09、0.82$\pm$0.08和0.81$\pm$0.10，治疗后同侧和对侧腮腺的平均值和标准差分别为0.81$\pm$0.06和0.85$\pm$0.02。摘要：Purpose: Radiotherapy presents unique challenges and clinical requirements for longitudinal tumor and organ-at-risk (OAR) prediction during treatment. The challenges include tumor inflammation/edema and radiation-induced changes in organ geometry, whereas the clinical requirements demand flexibility in input/output sequence timepoints to update the predictions on rolling basis and the grounding of all predictions in relationship to the pre-treatment imaging information for response and toxicity assessment in adaptive radiotherapy. Methods: To deal with the aforementioned challenges and to comply with the clinical requirements, we present a novel 3D sequence-to-sequence model based on Convolution Long Short Term Memory (ConvLSTM) that makes use of series of deformation vector fields (DVF) between individual timepoints and reference pre-treatment/planning CTs to predict future anatomical deformations and changes in gross tumor volume as well as critical OARs. High-quality DVF training data is created by employing hyper-parameter optimization on the subset of the training data with DICE coefficient and mutual information metric. We validated our model on two radiotherapy datasets: a publicly available head-and-neck dataset (28 patients with manually contoured pre-, mid-, and post-treatment CTs), and an internal non-small cell lung cancer dataset (63 patients with manually contoured planning CT and 6 weekly CBCTs). Results: The use of DVF representation and skip connections overcomes the blurring issue of ConvLSTM prediction with the traditional image representation. The mean and standard deviation of DICE for predictions of lung GTV at week 4, 5, and 6 were 0.83$\pm$0.09, 0.82$\pm$0.08, and 0.81$\pm$0.10, respectively, and for post-treatment ipsilateral and contralateral parotids, were 0.81$\pm$0.06 and 0.85$\pm$0.02.

GAN|对抗|攻击|生成相关(1篇)

【1】 Adversarial Visual Robustness by Causal Intervention 标题：因果干预下的对抗性视觉稳健性

作者：Kaihua Tang,Mingyuan Tao,Hanwang Zhang 机构：Nanyang Technological University, Alibaba Group 备注：Codes are available at this https URL 链接：https://arxiv.org/abs/2106.09534 摘要：对抗性训练实际上是对抗性例子最有前途的防御手段。然而，它的被动性不可避免地阻止它免疫未知的攻击者。为了实现主动防御，除了流行的有界威胁模型之外，我们还需要对对手的例子有更基本的了解。在本文中，我们提供了一个因果观点的对手脆弱性：原因是混杂普遍存在于学习，攻击者正是利用混杂效应。因此，对抗鲁棒性的根本解决方案是因果干预。由于一般情况下无法观察到混杂因素，因此我们建议使用工具变量进行干预，而无需观察混杂因素。我们将稳健训练方法称为工具变量因果干预（CiiV）。它具有可微采样层和一致性损失，稳定且不受梯度模糊的影响。对MNIST、CIFAR-10和mini-ImageNet数据集中应用的各种攻击者和设置进行的大量实验表明，CiiV对自适应攻击具有鲁棒性。摘要：Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. To achieve a proactive defense, we need a more fundamental understanding of adversarial examples, beyond the popular bounded threat model. In this paper, we provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning, where attackers are precisely exploiting the confounding effect. Therefore, a fundamental solution for adversarial robustness is causal intervention. As the confounder is unobserved in general, we propose to use the instrumental variable that achieves intervention without the need for confounder observation. We term our robust training method as Causal intervention by instrumental Variable (CiiV). It has a differentiable retinotopic sampling layer and a consistency loss, which is stable and guaranteed not to suffer from gradient obfuscation. Extensive experiments on a wide spectrum of attackers and settings applied in MNIST, CIFAR-10, and mini-ImageNet datasets empirically demonstrate that CiiV is robust to adaptive attacks.

自动驾驶|车辆|车道检测等(2篇)

【1】 How can we learn (more) from challenges? A statistical approach to driving future algorithm development 标题：我们怎样才能从挑战中学到(更多)？一种推动未来算法发展的统计方法

作者：Tobias Roß,Pierangela Bruno,Annika Reinke,Manuel Wiesenfarth,Lisa Koeppel,Peter M. Full,Bünyamin Pekdemir,Patrick Godau,Darya Trofimova,Fabian Isensee,Sara Moccia,Francesco Calimeri,Beat P. Müller-Stich,Annette Kopp-Schneider,Lena Maier-Hein 机构：Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), Heidelberg, Germany, Medical Faculty, Heidelberg University, Heidelberg, Germany, HIP Helmholtz Imaging Platform, German Cancer Research Center (DKFZ), Heidelberg, Germany 链接：https://arxiv.org/abs/2106.09302 摘要：挑战已成为最先进的方法，以基准图像分析算法的比较方式。虽然对相同数据集的验证是向前迈出的一大步，但结果分析往往局限于纯粹的排名表，相关问题得不到回答。具体来说，很少有人致力于系统的研究什么特征的图像中，国家的最先进的算法失败。为了弥补文献中的这一空白，我们（1）提出了一个从挑战中学习的统计框架，并（2）将其用于腹腔镜视频中器械实例分割的具体任务。我们的框架依赖于图像的语义元数据注释，这是一般线性混合模型（GLMM）分析的基础。基于对2728张图像进行的51542个元数据注释，我们将我们的方法应用于Robust Medical Instrument Segmentation Challenge（Robust-MIS）Challenge 2019的结果，发现曝光不足，仪器的运动和遮挡以及背景中烟雾或其他物体的存在是算法失败的主要原因。我们随后的方法开发，针对特定的遗留问题，产生了一个深度学习模型，具有最先进的整体性能和图像处理方面的特定优势，而以前的方法往往失败。由于我们的方法的客观性和通用性，它可以成为一个有价值的工具，在医学图像分析领域的验证和超越。以及小型、交叉、移动和透明仪器（部件）的分割。摘要：Challenges have become the state-of-the-art approach to benchmark image analysis algorithms in a comparative manner. While the validation on identical data sets was a great step forward, results analysis is often restricted to pure ranking tables, leaving relevant questions unanswered. Specifically, little effort has been put into the systematic investigation on what characterizes images in which state-of-the-art algorithms fail. To address this gap in the literature, we (1) present a statistical framework for learning from challenges and (2) instantiate it for the specific task of instrument instance segmentation in laparoscopic videos. Our framework relies on the semantic meta data annotation of images, which serves as foundation for a General Linear Mixed Models (GLMM) analysis. Based on 51,542 meta data annotations performed on 2,728 images, we applied our approach to the results of the Robust Medical Instrument Segmentation Challenge (ROBUST-MIS) challenge 2019 and revealed underexposure, motion and occlusion of instruments as well as the presence of smoke or other objects in the background as major sources of algorithm failure. Our subsequent method development, tailored to the specific remaining issues, yielded a deep learning model with state-of-the-art overall performance and specific strengths in the processing of images in which previous methods tended to fail. Due to the objectivity and generic applicability of our approach, it could become a valuable tool for validation in the field of medical image analysis and beyond. and segmentation of small, crossing, moving and transparent instrument(s) (parts).

【2】 Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks 标题：摄像机和激光雷达都不可见：物理世界攻击下基于多传感器融合感知的自主驾驶安全性

作者：Yulong Cao*,Ningfei Wang*,Chaowei Xiao*,Dawei Yang*,Jin Fang,Ruigang Yang,Qi Alfred Chen,Mingyan Liu,Bo Li 机构：∥NVIDIA Research, ‡‡Arizona State University, ††Inceptio, ‡Baidu Research and National Engineering Laboratory of Deep Learning Technology and Application, China 备注：Accepted by IEEE S&P 2021 链接：https://arxiv.org/abs/2106.09249 摘要：在自动驾驶（AD）系统中，感知对安全性和安全性都至关重要。尽管之前对其安全性问题进行了各种研究，但它们都只考虑了对基于摄像头或激光雷达的广告感知的攻击。然而，目前生产广告系统主要采用基于多传感器融合（MSF）的设计，在假设并非所有融合源同时受到（或可能受到）攻击的情况下，该设计原则上可以更稳健地抵抗这些攻击。本文首次研究了基于MSF感知的AD系统的安全问题。我们通过探索同时攻击所有核聚变源的可能性，直接挑战了上述基本MSF设计假设。这让我们第一次了解无国界医生可以提供多少安全保障作为广告感知的一般防御策略。我们将攻击描述为一个优化问题，以生成一个物理上可实现的、对抗性的3D打印对象，从而误导广告系统检测不到它，从而使其崩溃。我们提出了一种新的攻击管道，解决了两个主要的设计挑战：（1）不可微分的目标摄像机和激光雷达传感系统，以及（2）基于激光雷达的广告感知中广泛使用的不可微分单元级聚集特征。我们评估了在真实驾驶场景中，对具有代表性的开源行业级广告系统中包含的MSF的攻击。我们的结果表明，该攻击在不同的对象类型和MSF中的成功率超过90%。我们的攻击也被发现是隐蔽的，对受害者的位置很强大，可以通过MSF算法转移，并且在激光雷达和相机设备进行3D打印和捕捉后，物理世界是可以实现的。为了具体评估端到端的安全影响，我们进一步进行了模拟评估，结果表明，对于工业级AD系统，它可以导致100%的车辆碰撞率。摘要：In Autonomous Driving (AD) systems, perception is both security and safety critical. Despite various prior studies on its security issues, all of them only consider attacks on camera- or LiDAR-based AD perception alone. However, production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF) based design, which in principle can be more robust against these attacks under the assumption that not all fusion sources are (or can be) attacked at the same time. In this paper, we present the first study of security issues of MSF-based perception in AD systems. We directly challenge the basic MSF design assumption above by exploring the possibility of attacking all fusion sources simultaneously. This allows us for the first time to understand how much security guarantee MSF can fundamentally provide as a general defense strategy for AD perception. We formulate the attack as an optimization problem to generate a physically-realizable, adversarial 3D-printed object that misleads an AD system to fail in detecting it and thus crash into it. We propose a novel attack pipeline that addresses two main design challenges: (1) non-differentiable target camera and LiDAR sensing systems, and (2) non-differentiable cell-level aggregated features popularly used in LiDAR-based AD perception. We evaluate our attack on MSF included in representative open-source industry-grade AD systems in real-world driving scenarios. Our results show that the attack achieves over 90% success rate across different object types and MSF. Our attack is also found stealthy, robust to victim positions, transferable across MSF algorithms, and physical-world realizable after being 3D-printed and captured by LiDAR and camera devices. To concretely assess the end-to-end safety impact, we further perform simulation evaluation and show that it can cause a 100% vehicle collision rate for an industry-grade AD system.

Attention注意力(1篇)

【1】 Multi-level Motion Attention for Human Motion Prediction 标题：多层次运动注意在人体运动预测中的应用

作者：Wei Mao,Miaomiao Liu,Mathieu Salzmann,Hongdong Li 备注：Accepted by IJCV. arXiv admin note: substantial text overlap with arXiv:2007.11755 链接：https://arxiv.org/abs/2106.09300 摘要：人体运动预测的目的是预测给定历史运动的未来人体姿态。无论是基于递归神经网络还是基于前馈神经网络，现有的基于学习的方法都无法对观察到的人体运动倾向于重复进行建模，甚至对于复杂的运动行为和烹饪活动也是如此。在这里，我们介绍一个基于注意的前馈网络，它显式地利用了这种观察。特别地，我们建议提取运动注意来捕捉当前运动背景和历史运动子序列之间的相似性，而不是通过姿势相似性来建模帧方向的注意。在这种情况下，我们研究不同类型的注意力的使用，在关节、身体部位和全姿态水平上计算。将相关的过去运动进行聚合，并使用图卷积网络对结果进行处理，可以有效地利用长期历史中的运动模式来预测未来的姿态。我们在Human3.6M、AMASS和3DPW上的实验验证了我们的方法对周期性和非周期性动作的好处。由于我们的注意力模型，它在所有三个数据集上都产生了最新的结果。我们的代码在https://github.com/wei-mao-2019/HisRepItself. 摘要：Human motion prediction aims to forecast future human poses given a historical motion. Whether based on recurrent or feed-forward neural networks, existing learning based methods fail to model the observation that human motion tends to repeat itself, even for complex sports actions and cooking activities. Here, we introduce an attention based feed-forward network that explicitly leverages this observation. In particular, instead of modeling frame-wise attention via pose similarity, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences. In this context, we study the use of different types of attention, computed at joint, body part, and full pose levels. Aggregating the relevant past motions and processing the result with a graph convolutional network allows us to effectively exploit motion patterns from the long-term history to predict the future poses. Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions. Thanks to our attention model, it yields state-of-the-art results on all three datasets. Our code is available at https://github.com/wei-mao-2019/HisRepItself.

人脸|人群计数(1篇)

【1】 using multiple losses for accurate facial age estimation 标题：利用多重损失精确估计面部年龄

作者：Yi Zhou,Heikki Huttunen,Tapio Elomaa 机构：⋆Department of Computing Sciences, Tampere University, Finland 链接：https://arxiv.org/abs/2106.09393 摘要：年龄估计是计算机视觉的一个重要挑战。随着卷积神经网络的发展，年龄估计的性能有了很大的提高。现有的方法通常将年龄估计作为一个分类问题来处理。然而，由于年龄标签的模糊性，给分类工作带来了困难。在本文中，我们提出了一种简单而有效的年龄估计方法，与基于分类的方法相比，该方法提高了性能。该方法将代表不同类粒度的四个分类损失和一个回归损失结合起来，称之为年龄粒度网。我们在CVPR Chalearn 2016数据集上验证了年龄粒度网络框架，大量实验表明，与任何单个损失相比，该方法可以减少预测误差。源代码链接是https://github.com/yipersevere/age-estimation. 摘要：Age estimation is an essential challenge in computer vision. With the advances of convolutional neural networks, the performance of age estimation has been dramatically improved. Existing approaches usually treat age estimation as a classification problem. However, the age labels are ambiguous, thus make the classification task difficult. In this paper, we propose a simple yet effective approach for age estimation, which improves the performance compared to classification-based methods. The method combines four classification losses and one regression loss representing different class granularities together, and we name it as Age-Granularity-Net. We validate the Age-Granularity-Net framework on the CVPR Chalearn 2016 dataset, and extensive experiments show that the proposed approach can reduce the prediction error compared to any individual loss. The source code link is https://github.com/yipersevere/age-estimation.

跟踪(1篇)

【1】 Privacy-Preserving Eye-tracking Using Deep Learning 标题：基于深度学习的隐私保护眼动跟踪

作者：Salman Seyedi,Zifan Jiang,Allan Levey,Gari D. Clifford 机构：dept. of Biomedical Informatics, Emory School of Medicine, Atlanta, Georgia, dept. of Biomedical Engineering, Georgia Institute of Technology, dept. of Neurology 链接：https://arxiv.org/abs/2106.09621 摘要：像深度学习这样的复杂机器学习方法的广泛使用导致了人类活动识别的爆炸性增长，特别是应用于健康领域。特别是，作为一个更大的身体传感器网络系统的一部分，面部和全身分析在评估健康状况方面变得越来越普遍。然而，处理私有数据（有时是受保护的数据）的复杂模型引发了对可识别数据潜在泄漏的担忧。在这项工作中，我们集中在一个深层网络模型的情况下训练的图像个人的脸。从493名接受基于眼球跟踪的神经功能评估的个体中获取的全脸视频记录被使用。输出、梯度、中间层输出、损失和标签被用作一个带有支持向量机发射层的深层网络的输入，以识别训练数据中的成员。推理攻击方法和相关的数学分析表明，在深度学习模型中，非有意记忆面部特征的可能性很低。在本研究中，我们发现命名的模型以合理的置信度来维持训练资料的完整性。对于不同的模型，相同的过程可以在相似的条件下实现。摘要：The expanding usage of complex machine learning methods like deep learning has led to an explosion in human activity recognition, particularly applied to health. In particular, as part of a larger body sensor network system, face and full-body analysis is becoming increasingly common for evaluating health status. However, complex models which handle private and sometimes protected data, raise concerns about the potential leak of identifiable data. In this work, we focus on the case of a deep network model trained on images of individual faces. Full-face video recordings taken from 493 individuals undergoing an eye-tracking based evaluation of neurological function were used. Outputs, gradients, intermediate layer outputs, loss, and labels were used as inputs for a deep network with an added support vector machine emission layer to recognize membership in the training data. The inference attack method and associated mathematical analysis indicate that there is a low likelihood of unintended memorization of facial features in the deep learning model. In this study, it is showed that the named model preserves the integrity of training data with reasonable confidence. The same process can be implemented in similar conditions for different models.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Controllable Confidence-Based Image Denoising 标题：基于可控置信度的图像去噪

作者：Haley Owsianko,Florian Cassayre,Qiyuan Liang 机构：´Ecole Polytechnique F´ed´erale de Lausanne - EPFL, Switzerland 链接：https://arxiv.org/abs/2106.09311 摘要：图像去噪是一个经典的复原问题。然而，目前的深度学习方法存在着泛化和可解释性的问题。为了缓解这些问题，在这个项目中，我们提出了一个框架，能够控制，信心为基础的噪声消除。该框架是基于两个不同的去噪图像之间的融合，这两个图像都来自相同的噪声输入。其中一种是使用通用算法（如高斯）去噪，该算法对输入图像几乎不作任何假设，因此在所有情况下都是通用的。另一种是使用深度学习去噪，在可见数据集上表现良好。我们介绍了一套技术，以融合这两个组成部分顺利在频域。除此之外，我们估计了深度学习去噪器的置信度，以允许用户解释输出，并提供了一种融合策略，以保护用户免受分布外输入的影响。通过实验，我们证明了该框架在不同用例下的有效性。摘要：Image denoising is a classic restoration problem. Yet, current deep learning methods are subject to the problems of generalization and interpretability. To mitigate these problems, in this project, we present a framework that is capable of controllable, confidence-based noise removal. The framework is based on the fusion between two different denoised images, both derived from the same noisy input. One of the two is denoised using generic algorithms (e.g. Gaussian), which make few assumptions on the input images, therefore, generalize in all scenarios. The other is denoised using deep learning, performing well on seen datasets. We introduce a set of techniques to fuse the two components smoothly in the frequency domain. Beyond that, we estimate the confidence of a deep learning denoiser to allow users to interpret the output, and provide a fusion strategy that safeguards them against out-of-distribution inputs. Through experiments, we demonstrate the effectiveness of the proposed framework in different use cases.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Layer Folding: Neural Network Depth Reduction using Activation Linearization 标题：层折叠：基于激活线性化的神经网络深度缩减

作者：Amir Ben Dror,Niv Zehngut,Avraham Raviv,Evgeny Artyomov,Ran Vitek,Roy Jevnisek 机构：Samsung Israel Research Center 链接：https://arxiv.org/abs/2106.09309 摘要：尽管深度神经网络的应用越来越广泛，但由于计算量大，其在资源受限设备中的应用受到限制。虽然现代设备表现出高度的并行性，但实时延迟仍然高度依赖于网络的深度。尽管最近的研究表明，在一定深度以下，较浅网络的宽度必须呈指数增长，但我们假设神经网络通常会超过这个最小深度，以加速收敛并逐渐提高精度。这促使我们将已经利用这些优势的预先训练的深层网络转化为较浅的形式。我们提出了一种方法，学习非线性激活是否可以删除，允许折叠成一个连续的线性层。我们将我们的方法应用于在CIFAR-10和CIFAR-100上预先训练的网络，发现它们都可以转化为具有相似深度的较浅形式。最后，我们使用我们的方法在ImageNet分类任务中提供了比MobileNetV2和EfficientNet-Lite架构更有效的替代方案。摘要：Despite the increasing prevalence of deep neural networks, their applicability in resource-constrained devices is limited due to their computational load. While modern devices exhibit a high level of parallelism, real-time latency is still highly dependent on networks' depth. Although recent works show that below a certain depth, the width of shallower networks must grow exponentially, we presume that neural networks typically exceed this minimal depth to accelerate convergence and incrementally increase accuracy. This motivates us to transform pre-trained deep networks that already exploit such advantages into shallower forms. We propose a method that learns whether non-linear activations can be removed, allowing to fold consecutive linear layers into one. We apply our method to networks pre-trained on CIFAR-10 and CIFAR-100 and find that they can all be transformed into shallower forms that share a similar depth. Finally, we use our method to provide more efficient alternatives to MobileNetV2 and EfficientNet-Lite architectures on the ImageNet classification task.

多模态(1篇)

【1】 A Two-stage Multi-modal Affect Analysis Framework for Children with Autism Spectrum Disorder 标题：自闭症谱系障碍儿童两阶段多模态情感分析框架

作者：Jicheng Li,Anjana Bhat,Roghayeh Barmaki 机构：University of Delaware 备注：None 链接：https://arxiv.org/abs/2106.09199 摘要：自闭症谱系障碍（ASD）是一种发育性障碍，影响一个人的交流和社会行为，而自闭症谱系中的人很难感知他人的面部表情，也很难通过自己的面部和身体表达和交流情感和情感。游戏治疗是一种通过游戏和游戏来提高儿童社会技能的常用方法，在预测和改善自闭症儿童的情感状态方面已经做出了一些努力。然而，许多以前的研究只是在基准情绪数据集上使用预先训练的模型，没有考虑典型发育期儿童和自闭症儿童之间的情绪差异。在这篇论文中，我们提出了一个开源的两阶段多模态方法，利用听觉和视觉线索来预测自闭症儿童在真实游戏治疗情境中的三种主要情感状态（积极、消极和中性），总体准确率达到72:40%。本研究提出一个两阶段模式，将人类的专业知识与机器智能结合起来进行自闭症情感识别。摘要：Autism spectrum disorder (ASD) is a developmental disorder that influences the communication and social behavior of a person in a way that those in the spectrum have difficulty in perceiving other people's facial expressions, as well as presenting and communicating emotions and affect via their own faces and bodies. Some efforts have been made to predict and improve children with ASD's affect states in play therapy, a common method to improve children's social skills via play and games. However, many previous works only used pre-trained models on benchmark emotion datasets and failed to consider the distinction in emotion between typically developing children and children with autism. In this paper, we present an open-source two-stage multi-modal approach leveraging acoustic and visual cues to predict three main affect states of children with ASD's affect states (positive, negative, and neutral) in real-world play therapy scenarios, and achieved an overall accuracy of 72:40%. This work presents a novel way to combine human expertise and machine intelligence for ASD affect recognition by proposing a two-stage schema.

其他神经网络|深度学习|模型|建模(13篇)

【1】 Multi-Label Learning from Single Positive Labels 标题：基于单个正标签的多标签学习

作者：Elijah Cole,Oisin Mac Aodha,Titouan Lorieul,Pietro Perona,Dan Morris,Nebojsa Jojic 机构：Caltech, University of Edinburgh, Inria, Microsoft AI for Earth, Microsoft Research 备注：CVPR 2021 链接：https://arxiv.org/abs/2106.09708 摘要：预测给定图像的所有适用标签称为多标签分类。与标准的多类情况（每个图像只有一个标签）相比，为多标签分类对训练数据进行注释要困难得多。当潜在标签的数量很大时，人工注释者发现很难为每个训练图像提及所有适用的标签。此外，在某些设置中，检测本质上是困难的，例如在高分辨率图像中查找小对象实例。因此，多标签训练数据往往存在误报问题。我们考虑这个问题的最困难版本，注释器只为每个图像提供一个相关标签。因此，训练集将只有一个积极的标签，每个图像和没有确认的负面影响。我们探讨了这种特殊的情况下，学习从四个不同的多标签图像分类数据集的线性分类器和端到端微调深网络丢失标签。我们将现有的多标签损失扩展到这一设置，并提出了新的变体来限制训练过程中预期阳性标签的数量。令人惊讶的是，我们发现，在某些情况下，尽管训练时确认的标签明显较少，但仍然有可能接近完全标记分类器的性能。摘要：Predicting all applicable labels for a given image is known as multi-label classification. Compared to the standard multi-class case (where each image has only one label), it is considerably more challenging to annotate training data for multi-label classification. When the number of potential labels is large, human annotators find it difficult to mention all applicable labels for each training image. Furthermore, in some settings detection is intrinsically difficult e.g. finding small object instances in high resolution images. As a result, multi-label training data is often plagued by false negatives. We consider the hardest version of this problem, where annotators provide only one relevant label for each image. As a result, training sets will have only one positive label per image and no confirmed negatives. We explore this special case of learning from missing labels across four different multi-label image classification datasets for both linear classifiers and end-to-end fine-tuned deep networks. We extend existing multi-label losses to this setting and propose novel variants that constrain the number of expected positive labels during training. Surprisingly, we show that in some cases it is possible to approach the performance of fully labeled classifiers despite training with significantly fewer confirmed labels.

【2】 Learning to Predict Visual Attributes in the Wild 标题：学习在野外预测视觉属性

作者：Khoi Pham,Kushal Kafle,Zhe Lin,Zhihong Ding,Scott Cohen,Quan Tran,Abhinav Shrivastava 机构：University of Maryland, College Park, Adobe Research 备注：Accepted to CVPR 2021 链接：https://arxiv.org/abs/2106.09707 摘要：视觉属性构成了场景中包含的大部分信息。对象可以使用各种各样的属性来描述，这些属性描述了对象的视觉外观（颜色、纹理）、几何体（形状、大小、姿势）和其他固有属性（状态、动作）。现有的研究大多局限于特定领域的属性预测研究。在本文中，我们介绍了一个大规模的在野外视觉属性预测数据集，其中包含超过927K的属性注释，用于超过260K的对象实例。形式上，对象属性预测是一个多标签分类问题，其中应用于对象的所有属性都必须被预测。由于大量的属性、标签稀疏性、数据不平衡性和对象遮挡，我们的数据集对现有的方法提出了巨大的挑战。为此，我们提出了一些系统地解决这些挑战的技术，包括一个同时利用低层和高层CNN特征和多跳注意的基本模型、重加权和重采样技术、一个新的负标签扩展方案和一个新的有监督的属性感知对比学习算法。使用这些技术，我们实现了近3.7地图和5.7总F1点的改善比目前的艺术水平。有关VAW数据集的更多详细信息，请访问http://vawdataset.com/. 摘要：Visual attributes constitute a large portion of information contained in a scene. Objects can be described using a wide variety of attributes which portray their visual appearance (color, texture), geometry (shape, size, posture), and other intrinsic properties (state, action). Existing work is mostly limited to study of attribute prediction in specific domains. In this paper, we introduce a large-scale in-the-wild visual attribute prediction dataset consisting of over 927K attribute annotations for over 260K object instances. Formally, object attribute prediction is a multi-label classification problem where all attributes that apply to an object must be predicted. Our dataset poses significant challenges to existing methods due to large number of attributes, label sparsity, data imbalance, and object occlusion. To this end, we propose several techniques that systematically tackle these challenges, including a base model that utilizes both low- and high-level CNN features with multi-hop attention, reweighting and resampling techniques, a novel negative label expansion scheme, and a novel supervised attribute-aware contrastive learning algorithm. Using these techniques, we achieve near 3.7 mAP and 5.7 overall F1 points improvement over the current state of the art. Further details about the VAW dataset can be found at http://vawdataset.com/.

【3】 Always Be Dreaming: A New Approach for Data-Free Class-Incremental Learning 标题：永远是梦想：无数据课堂增量学习的新途径

作者：James Smith,Yen-Chang Hsu,Jonathan Balloch,Yilin Shen,Hongxia Jin,Zsolt Kira 机构：Georgia Institute of Technology,Samsung Research America 链接：https://arxiv.org/abs/2106.09701 摘要：随着时间的推移，现代计算机视觉应用在逐渐学习新概念时会遭遇灾难性的遗忘。最成功的缓解这种遗忘的方法需要大量重放以前看到的数据，这在内存限制或数据合法性问题存在时是有问题的。在这项工作中，我们考虑了无数据类增量学习（DFCIL）的高影响问题，其中增量学习代理必须随着时间的推移学习新的概念，而无需存储生成器或来自过去任务的训练数据。DFCIL的一种方法是重放通过反转学习者分类模型的冻结副本生成的合成图像，但是我们表明，当使用标准蒸馏策略时，这种方法对于常见的类增量基准测试是失败的。我们诊断了失败的原因，并提出了一种新的DFCIL增量蒸馏策略，改进了交叉熵训练和重要性加权特征蒸馏，结果表明，与SOTA-DFCIL方法相比，该方法的最终任务精度（绝对差）提高了25.1%。我们的方法甚至优于几种标准的基于回放的方法，这些方法存储一组核心图像。摘要：Modern computer vision applications suffer from catastrophic forgetting when incrementally learning new concepts over time. The most successful approaches to alleviate this forgetting require extensive replay of previously seen data, which is problematic when memory constraints or data legality concerns exist. In this work, we consider the high-impact problem of Data-Free Class-Incremental Learning (DFCIL), where an incremental learning agent must learn new concepts over time without storing generators or training data from past tasks. One approach for DFCIL is to replay synthetic images produced by inverting a frozen copy of the learner's classification model, but we show this approach fails for common class-incremental benchmarks when using standard distillation strategies. We diagnose the cause of this failure and propose a novel incremental distillation strategy for DFCIL, contributing a modified cross-entropy training and importance-weighted feature distillation, and show that our method results in up to a 25.1% increase in final task accuracy (absolute difference) compared to SOTA DFCIL methods for common class-incremental benchmarks. Our method even outperforms several standard replay based methods which store a coreset of images.

【4】 Orthogonal-Padé Activation Functions: Trainable Activation functions for smooth and faster convergence in deep networks 标题：正交Padé激活函数：深层网络平滑快速收敛的可训练激活函数

作者：Koushik Biswas,Shilpak Banerjee,Ashish Kumar Pandey 备注：11 pages 链接：https://arxiv.org/abs/2106.09693 摘要：我们提出了正交Pad′e激活函数，这是一种可训练的激活函数，表明它们具有更快的学习能力，提高了标准深度学习数据集和模型的准确性。在我们的实验基础上，我们从六个正交Pad激活函数中找到了两个最佳的候选函数，我们称之为安全Hermite Pade（HP）激活函数，即HP-1和HP-2。与ReLU相比，在PreActResNet-34中，HP-1和HP-2的top-1准确率分别提高了5.06%和4.63%，在CIFAR100数据集上MobileNet V2模型分别提高了3.02%和2.75%，而在CIFAR10数据集上PreActResNet-34的top-1精度分别提高了2.02%和1.78%，LeNet分别提高了2.24%和2.06%，Efficientnet B0分别提高了2.15%和2.03%。摘要：We have proposed orthogonal-Pad\'e activation functions, which are trainable activation functions and show that they have faster learning capability and improves the accuracy in standard deep learning datasets and models. Based on our experiments, we have found two best candidates out of six orthogonal-Pad\'e activations, which we call safe Hermite-Pade (HP) activation functions, namely HP-1 and HP-2. When compared to ReLU, HP-1 and HP-2 has an increment in top-1 accuracy by 5.06% and 4.63% respectively in PreActResNet-34, by 3.02% and 2.75% respectively in MobileNet V2 model on CIFAR100 dataset while on CIFAR10 dataset top-1 accuracy increases by 2.02% and 1.78% respectively in PreActResNet-34, by 2.24% and 2.06% respectively in LeNet, by 2.15% and 2.03% respectively in Efficientnet B0.

【5】 On Anytime Learning at Macroscale 标题：浅谈大尺度下的随时随地学习

作者：Lucas Caccia,Jing Xu,Myle Ott,Marc'Aurelio Ranzato,Ludovic Denoyer 机构：Marc’Aurelio Ranzato,, Facebook AI Research, MILA - McGill University 链接：https://arxiv.org/abs/2106.09563 摘要：经典的机器学习框架假设访问一个可能很大的数据集来训练预测模型。然而，在许多实际应用中，数据并非一次全部到达，而是随时间分批到达。这就在模型的准确性和获得这种模型的时间之间建立了一种自然的权衡。一个贪婪的预测可以产生非平凡的预测，通过立即训练成批一旦这些变得可用，但它也可能作出次优利用未来的数据。另一方面，迟钝的预测器可能要等待很长时间才能将多个批聚合到一个更大的数据集中，但最终会提供更好的性能。在这项工作中，我们考虑这样一个流式学习设置，我们称之为{\emanytime learning at macroscale}（ALMA）。它是一个即时学习的实例，不应用于单个数据块的级别，而是应用于整个大批量序列的级别。我们首先将这种学习设置形式化，然后引入指标来评估学习者在给定内存和计算预算的情况下在给定任务上的表现，最后我们在重新设计的标准基准上测试了几种基线方法，用于在宏观尺度上的任何时间学习。一般的发现是，模型越大，推广效果越好。特别是，如果初始模型相对较小，则随时间增长模型容量非常重要。此外，以中间速率更新模型可以在准确度和时间之间取得最佳折衷，从而获得有用的预测值。摘要：Classical machine learning frameworks assume access to a possibly large dataset in order to train a predictive model. In many practical applications however, data does not arrive all at once, but in batches over time. This creates a natural trade-off between accuracy of a model and time to obtain such a model. A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data. On the other hand, a tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance. In this work, we consider such a streaming learning setting, which we dub {\em anytime learning at macroscale} (ALMA). It is an instance of anytime learning applied not at the level of a single chunk of data, but at the level of the entire sequence of large batches. We first formalize this learning setting, we then introduce metrics to assess how well learners perform on the given task for a given memory and compute budget, and finally we test several baseline approaches on standard benchmarks repurposed for anytime learning at macroscale. The general finding is that bigger models always generalize better. In particular, it is important to grow model capacity over time if the initial model is relatively small. Moreover, updating the model at an intermediate rate strikes the best trade off between accuracy and time to obtain a useful predictor.

【6】 On the Dark Side of Calibration for Modern Neural Networks 标题：论现代神经网络校准的阴暗面

作者：Aditya Singh,Alessandro Bay,Biswa Sengupta,Andrea Mirabile 机构：Zebra AI, Zebra Technologies, London, United Kingdom 备注：15 pages including references and supplemental 链接：https://arxiv.org/abs/2106.09385 摘要：现代神经网络是高度未校准的。如何可靠地利用深度神经网络（DNNs）是安全关键系统面临的重大挑战。最近提出的许多方法在改进DNN校准方面取得了实质性进展。然而，他们几乎没有涉及到细化，这在历史上一直是校准的一个重要方面。细化表示网络正确预测和错误预测的可分性。本文提出了一个理论和经验支持的论述，审查模型的校准和完善。首先，我们将期望校准误差分解为预测置信度和精细化。结合这个结果，我们强调基于正则化的校准只关注于天真地降低模型的可信度。从逻辑上讲，这对模型的改进有严重的不利影响。我们通过对标准数据集上许多最先进的校准方法进行严格的经验评估来支持我们的主张。我们发现，许多校准方法，如标签平滑，混合等，降低了效用的DNN降低其细化。即使在自然数据移动的情况下，这种校准优化的权衡也适用于大多数校准方法。这些发现要求对现代DNN校准所采用的一些流行途径进行紧急回顾。摘要：Modern neural networks are highly uncalibrated. It poses a significant challenge for safety-critical systems to utilise deep neural networks (DNNs), reliably. Many recently proposed approaches have demonstrated substantial progress in improving DNN calibration. However, they hardly touch upon refinement, which historically has been an essential aspect of calibration. Refinement indicates separability of a network's correct and incorrect predictions. This paper presents a theoretically and empirically supported exposition for reviewing a model's calibration and refinement. Firstly, we show the breakdown of expected calibration error (ECE), into predicted confidence and refinement. Connecting with this result, we highlight that regularisation based calibration only focuses on naively reducing a model's confidence. This logically has a severe downside to a model's refinement. We support our claims through rigorous empirical evaluations of many state of the art calibration approaches on standard datasets. We find that many calibration approaches with the likes of label smoothing, mixup etc. lower the utility of a DNN by degrading its refinement. Even under natural data shift, this calibration-refinement trade-off holds for the majority of calibration methods. These findings call for an urgent retrospective into some popular pathways taken for modern DNN calibration.

【7】 ShuffleBlock: Shuffle to Regularize Deep Convolutional Neural Networks 标题：ShuffleBlock：对深度卷积神经网络进行正则化的洗牌

作者：Sudhakar Kumawat,Gagan Kanojia,Shanmuganathan Raman 机构：Indian Institute of Technology Gandhinagar, Gandhinagar, India 链接：https://arxiv.org/abs/2106.09358 摘要：深层神经网络具有巨大的表征能力，这使得它们在大多数数据集上都会过度拟合。因此，对其进行正则化处理对于减少过拟合、提高泛化能力具有重要意义。近年来，在资源节约型网路中，为了减少记忆体和运算量，在群组卷积中引入了混合通道的通道洗牌运算。本文研究了在深度卷积网络中作为正则化技术的信道洗牌操作。我们发现，虽然随机洗牌的渠道，在训练大幅降低其性能，然而，随机洗牌小补丁之间的渠道显着提高其性能。将要洗牌的面片从特征图中相同的空间位置选取，使得面片在从一个通道转移到另一个通道时，充当后一个通道的结构化噪声。我们称这种方法为“ShuffleBlock”。该ShuffleBlock模块易于实现，提高了多个基线网络在CIFAR和ImageNet数据集上进行图像分类的性能。与其他正则化方法相比，它还具有可比性，并且在许多情况下具有更好的性能。我们对ShuffleBlock模块的各种超参数的选择进行了研究，并提出了一种新的调度方法，进一步提高了ShuffleBlock模块的性能。摘要：Deep neural networks have enormous representational power which leads them to overfit on most datasets. Thus, regularizing them is important in order to reduce overfitting and enhance their generalization capabilities. Recently, channel shuffle operation has been introduced for mixing channels in group convolutions in resource efficient networks in order to reduce memory and computations. This paper studies the operation of channel shuffle as a regularization technique in deep convolutional networks. We show that while random shuffling of channels during training drastically reduce their performance, however, randomly shuffling small patches between channels significantly improves their performance. The patches to be shuffled are picked from the same spatial locations in the feature maps such that a patch, when transferred from one channel to another, acts as structured noise for the later channel. We call this method "ShuffleBlock". The proposed ShuffleBlock module is easy to implement and improves the performance of several baseline networks on the task of image classification on CIFAR and ImageNet datasets. It also achieves comparable and in many cases better performance than many other regularization methods. We provide several ablation studies on selecting various hyperparameters of the ShuffleBlock module and propose a new scheduling method that further enhances its performance.

【8】 Evaluating the Robustness of Bayesian Neural Networks Against Different Types of Attacks 标题：评估贝叶斯神经网络对不同类型攻击的鲁棒性

作者：Yutian Pang,Sheng Cheng,Jueming Hu,Yongming Liu 机构：Arizona State University, Tempe, AZ 链接：https://arxiv.org/abs/2106.09223 摘要：为了评估贝叶斯神经网络在图像分类任务中的鲁棒性增益，我们对最新的贝叶斯神经网络进行了输入扰动和对抗性攻击，并以一个基准CNN模型为参考。选择这些攻击来模拟对基于CNN的机器学习系统的信号干扰和网络攻击。结果表明，在不进行对抗性训练的情况下，贝叶斯神经网络对确定性神经网络模型产生的对抗性攻击具有更高的鲁棒性。贝叶斯网络可以作为正在进行的恶意活动的安全前兆。此外，我们证明了在确定CNN抽取器之后的随机分类器比在随机分类器之前的随机特征抽取器具有足够的鲁棒性增强。建议在安全关键领域内利用随机层来建立决策管道。摘要：To evaluate the robustness gain of Bayesian neural networks on image classification tasks, we perform input perturbations, and adversarial attacks to the state-of-the-art Bayesian neural networks, with a benchmark CNN model as reference. The attacks are selected to simulate signal interference and cyberattacks towards CNN-based machine learning systems. The result shows that a Bayesian neural network achieves significantly higher robustness against adversarial attacks generated against a deterministic neural network model, without adversarial training. The Bayesian posterior can act as the safety precursor of ongoing malicious activities. Furthermore, we show that the stochastic classifier after the deterministic CNN extractor has sufficient robustness enhancement rather than a stochastic feature extractor before the stochastic classifier. This advises on utilizing stochastic layers in building decision-making pipelines within a safety-critical domain.

【9】 Learning Perceptual Manifold of Fonts 标题：学习字体的感知流形

作者：Haoran Xie,Yuki Fujita,Kazunori Miyata 机构：Japan Advanced Institute of Science and Technology 备注：9 pages, 16 figures 链接：https://arxiv.org/abs/2106.09198 摘要：随着生成模型深度学习技术的迅速发展，将机器智能与人的智能相结合来解决实际应用成为一个迫切需要解决的问题。基于此方法，本研究旨在藉由知觉研究中人类工作者的努力来调整机器产生的字型。尽管网上有许多字体可供公众使用，但要生成和探索一种字体以满足普通用户的偏好是困难和挑战的。为了解决这一问题，我们提出了字体的感知流形，将字体生成模型的潜在空间中的感知调整可视化。在我们的架构中，我们采用变分自动编码网路来产生字型。然后，从生成模型的多维潜在空间对生成的字体进行感知研究。在获得特定偏好的分布数据后，利用流形学习方法对字体分布进行可视化。与传统的用户界面相比，本文提出的字体探索用户界面是有效的，并且有助于指定用户的偏好。摘要：Along the rapid development of deep learning techniques in generative models, it is becoming an urgent issue to combine machine intelligence with human intelligence to solve the practical applications. Motivated by this methodology, this work aims to adjust the machine generated character fonts with the effort of human workers in the perception study. Although numerous fonts are available online for public usage, it is difficult and challenging to generate and explore a font to meet the preferences for common users. To solve the specific issue, we propose the perceptual manifold of fonts to visualize the perceptual adjustment in the latent space of a generative model of fonts. In our framework, we adopt the variational autoencoder network for the font generation. Then, we conduct a perceptual study on the generated fonts from the multi-dimensional latent space of the generative model. After we obtained the distribution data of specific preferences, we utilize manifold learning approach to visualize the font distribution. In contrast to the conventional user interface in our user study, the proposed font-exploring user interface is efficient and helpful in the designated user preference.

【10】 Insights into Data through Model Behaviour: An Explainability-driven Strategy for Data Auditing for Responsible Computer Vision Applications 标题：通过模型行为洞察数据：用于负责任的计算机视觉应用的数据审计的可解释性驱动的策略

作者：Alexander Wong,Adam Dorfman,Paul McInnis,Hayden Gunraj 机构：Department of Systems Design Engineering, University of Waterloo, Waterloo Artificial Intelligence Institute, DarwinAI Corp. 备注：4 pages 链接：https://arxiv.org/abs/2106.09177 摘要：在这项研究中，我们从一个角度出发，探索了一种解释性驱动的数据审计策略，即通过对虚拟模型原型暴露于数据时的行为进行定量解释的角度，发现对手头数据的可操作见解。我们通过审计两个流行的医学基准数据集来证明这一策略，并发现隐藏的数据质量问题，这些问题导致深度学习模型出于错误的原因进行预测。从这种可解释性驱动的数据审计策略中获得的可操作的见解被用来解决发现的问题，从而能够创建具有适当预测行为的高性能深度学习模型。希望这种可解释性驱动的策略可以作为数据驱动策略的补充，有助于更负责任地开发计算机视觉应用中的机器学习算法。摘要：In this study, we take a departure and explore an explainability-driven strategy to data auditing, where actionable insights into the data at hand are discovered through the eyes of quantitative explainability on the behaviour of a dummy model prototype when exposed to data. We demonstrate this strategy by auditing two popular medical benchmark datasets, and discover hidden data quality issues that lead deep learning models to make predictions for the wrong reasons. The actionable insights gained from this explainability driven data auditing strategy is then leveraged to address the discovered issues to enable the creation of high-performing deep learning models with appropriate prediction behaviour. The hope is that such an explainability-driven strategy can be complimentary to data-driven strategies to facilitate for more responsible development of machine learning algorithms for computer vision applications.

【11】 Scaling-up Diverse Orthogonal Convolutional Networks with a Paraunitary Framework 标题：具有准么正框架的放大不同正交卷积网络

作者：Jiahao Su,Wonmin Byeon,Furong Huang 机构：University of Maryland, College Park, NVIDIA Research, NVIDIA Corporation 链接：https://arxiv.org/abs/2106.09121 摘要：加强神经网络的正交性是解决梯度消失/爆炸问题、对抗性扰动的敏感性和边界泛化误差的一种方法。然而，以前的许多方法都是启发式的，卷积层的正交性没有得到系统的研究：有些设计不是完全正交的，而有些设计只考虑标准卷积层并提出了具体的实现类别。为了解决这个问题，我们提出了一个正交卷积层的理论框架，该框架建立了空间域中各种正交卷积层与谱域中准酉系统之间的等价性。由于准酉系统存在完全谱分解，任何正交卷积层都可以参数化为空间滤波器的卷积。我们的框架赋予了各种卷积层很高的表达能力，同时保持了它们的精确正交性。此外，与以前的设计相比，我们的层对于深度网络具有内存和计算效率。我们的多功能框架，第一次使我们能够研究深度正交网络的体系结构设计，例如跳跃连接、初始化、跨步和扩展的选择。因此，我们将正交网络扩展到深层架构，包括ResNet、WideResNet和ShuffleNet，大大提高了传统浅层正交网络的性能。摘要：Enforcing orthogonality in neural networks is an antidote for gradient vanishing/exploding problems, sensitivity by adversarial perturbation, and bounding generalization errors. However, many previous approaches are heuristic, and the orthogonality of convolutional layers is not systematically studied: some of these designs are not exactly orthogonal, while others only consider standard convolutional layers and propose specific classes of their realizations. To address this problem, we propose a theoretical framework for orthogonal convolutional layers, which establishes the equivalence between various orthogonal convolutional layers in the spatial domain and the paraunitary systems in the spectral domain. Since there exists a complete spectral factorization of paraunitary systems, any orthogonal convolution layer can be parameterized as convolutions of spatial filters. Our framework endows high expressive power to various convolutional layers while maintaining their exact orthogonality. Furthermore, our layers are memory and computationally efficient for deep networks compared to previous designs. Our versatile framework, for the first time, enables the study of architecture designs for deep orthogonal networks, such as choices of skip connection, initialization, stride, and dilation. Consequently, we scale up orthogonal networks to deep architectures, including ResNet, WideResNet, and ShuffleNet, substantially increasing the performance over the traditional shallow orthogonal networks.

【12】 Regularization of Mixture Models for Robust Principal Graph Learning 标题：用于鲁棒主图学习的混合模型的正则化

作者：Tony Bonnaire,Aurélien Decelle,Nabila Aghanim 备注：12 pages, 6 figures 链接：https://arxiv.org/abs/2106.09035 摘要：提出了一种正则化的混合模型，用于从$D$维数据点的分布中学习主图。在脊线检测流形学习的特殊情况下，我们假设基本流形可以被建模为一个图结构，就像高斯聚类的拓扑先验一样，将问题转化为最大后验估计。通过期望最大化过程迭代估计模型参数，使得结构的学习在多项式时间内对任何图的先验收敛都是有效的。我们还嵌入了一种自然的方法，使算法对模式的异常值和流形采样的异方差具有鲁棒性，并与图结构保持一致。该方法利用最小生成树给出的先验图，通过对数据集的随机子样本进行扩展，以考虑空间分布中可以观察到的循环。摘要：A regularized version of Mixture Models is proposed to learn a principal graph from a distribution of $D$-dimensional data points. In the particular case of manifold learning for ridge detection, we assume that the underlying manifold can be modeled as a graph structure acting like a topological prior for the Gaussian clusters turning the problem into a maximum a posteriori estimation. Parameters of the model are iteratively estimated through an Expectation-Maximization procedure making the learning of the structure computationally efficient with guaranteed convergence for any graph prior in a polynomial time. We also embed in the formalism a natural way to make the algorithm robust to outliers of the pattern and heteroscedasticity of the manifold sampling coherently with the graph structure. The method uses a graph prior given by the minimum spanning tree that we extend using random sub-samplings of the dataset to take into account cycles that can be observed in the spatial distribution.

【13】 A Multi-task convolutional neural network for blind stereoscopic image quality assessment using naturalness analysis 标题：基于自然度分析的多任务卷积神经网络盲立体图像质量评价

作者：Salima Bourbia,Ayoub Karine,Aladine Chetouani,Mohammed El Hassouni 机构： LRIT, Mohammed V University in Rabat, Rabat, Morocco, Laboratoire PRISME, Universit´e d’Orl´eans, France, FLSH, Mohammed V University in Rabat, Rabat, Morocco 链接：https://arxiv.org/abs/2106.09303 摘要：提出了一种基于多任务深度学习的盲立体图像质量评价方法。在立体视觉领域中，信息在左右视图之间以及双眼现象之间是公平分布的。在这项工作中，我们建议整合这些特性，通过卷积神经网络来评估无参考立体影像的品质。我们的方法基于两个主要任务：第一个任务预测基于自然度分析的适合立体图像的特征，而第二个任务预测此类图像的质量。前者，即所谓的辅助任务，旨在发现更具鲁棒性和相关性的特征，以提高预测质量。为此，我们使用复小波域中的自然场景统计（NSS）模型来计算基于自然度的特征。它允许捕获立体图像对之间的统计相关性。实验在著名的livephasei和livephaseii数据库上进行。结果表明，与现有方法相比，该方法具有一定的实用性。我们的代码可在\url上在线获取{https://github.com/Bourbia-Salima/multitask-cnn-nrsiqa_2021}. 摘要：This paper addresses the problem of blind stereoscopic image quality assessment (NR-SIQA) using a new multi-task deep learning based-method. In the field of stereoscopic vision, the information is fairly distributed between the left and right views as well as the binocular phenomenon. In this work, we propose to integrate these characteristics to estimate the quality of stereoscopic images without reference through a convolutional neural network. Our method is based on two main tasks: the first task predicts naturalness analysis based features adapted to stereo images, while the second task predicts the quality of such images. The former, so-called auxiliary task, aims to find more robust and relevant features to improve the quality prediction. To do this, we compute naturalness-based features using a Natural Scene Statistics (NSS) model in the complex wavelet domain. It allows to capture the statistical dependency between pairs of the stereoscopic images. Experiments are conducted on the well known LIVE PHASE I and LIVE PHASE II databases. The results obtained show the relevance of our method when comparing with those of the state-of-the-art. Our code is available online on \url{https://github.com/Bourbia-Salima/multitask-cnn-nrsiqa_2021}.

其他(8篇)

【1】 Visual Correspondence Hallucination: Towards Geometric Reasoning 标题：视觉对应幻觉：走向几何推理

作者：Hugo Germain,Vincent Lepetit,Guillaume Bourmaud 机构：LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vallée, France, IMS, University of Bordeaux, Bordeaux INP, CNRS, Bordeaux, France 链接：https://arxiv.org/abs/2106.09711 摘要：给定一对部分重叠的源图像和目标图像以及源图像中的一个关键点，该关键点在目标图像中的对应可以是可见的、被遮挡的或在视野之外的。局部特征匹配方法只能在对应物可见时识别其位置，而人类也可以通过几何推理在遮挡或视野外时产生幻觉。在本文中，我们通过训练一个网络来弥补这一差距，使其在相应的位置上输出一个峰值概率分布，而不管这个相应的位置是可见的、被遮挡的还是在视野之外的。我们通过实验证明，这个网络确实能够在看不见的图像对上产生幻觉。我们还将该网络应用于一个摄像机姿态估计问题，发现它比现有的基于局部特征匹配的竞争对手具有更强的鲁棒性。摘要：Given a pair of partially overlapping source and target images and a keypoint in the source image, the keypoint's correspondent in the target image can be either visible, occluded or outside the field of view. Local feature matching methods are only able to identify the correspondent's location when it is visible, while humans can also hallucinate its location when it is occluded or outside the field of view through geometric reasoning. In this paper, we bridge this gap by training a network to output a peaked probability distribution over the correspondent's location, regardless of this correspondent being visible, occluded, or outside the field of view. We experimentally demonstrate that this network is indeed able to hallucinate correspondences on unseen pairs of images. We also apply this network to a camera pose estimation problem and find it is significantly more robust than state-of-the-art local feature matching-based competitors.

【2】 The 2021 Image Similarity Dataset and Challenge 标题：2021年图像相似度数据集及其面临的挑战

作者：Matthijs Douze,Giorgos Tolias,Ed Pizzi,Zoë Papakipos,Lowik Chanussot,Filip Radenovic,Tomas Jenicek,Maxim Maximov,Laura Leal-Taixé,Ismail Elezi,Ondřej Chum,Cristian Canton Ferrer 机构：Zo¨e Papakipos, Laura Leal-Taix´e, Ondˇrej Chum, Facebook AI, VRG, Czech Technical University in Prague, Technical University Munich 链接：https://arxiv.org/abs/2106.09672 摘要：介绍了一种新的大规模图像相似性检测基准。该基准用于NeurIPS'21（ISC2021）的图像相似性挑战。其目的是确定一个查询图像是否是一个大小为100万的参考语料库中任何图像的修改副本。该基准具有多种图像转换功能，如自动转换、手工图像编辑和基于机器学习的操作。这模仿了社交媒体中出现的真实案例，例如处理错误信息和不良内容的诚信相关问题。根据一组基线方法的性能来校准图像处理的强度，以及基准的难度。查询集和引用集都包含大多数不匹配的“干扰物”图像，这对应于现实生活中的大海捞针设置，评估度量反映了这一点。我们期望DISC21基准将图像拷贝检测提升为一项重要且具有挑战性的计算机视觉任务，并刷新最新技术。摘要：This paper introduces a new benchmark for large-scale image similarity detection. This benchmark is used for the Image Similarity Challenge at NeurIPS'21 (ISC2021). The goal is to determine whether a query image is a modified copy of any image in a reference corpus of size 1~million. The benchmark features a variety of image transformations such as automated transformations, hand-crafted image edits and machine-learning based manipulations. This mimics real-life cases appearing in social media, for example for integrity-related problems dealing with misinformation and objectionable content. The strength of the image manipulations, and therefore the difficulty of the benchmark, is calibrated according to the performance of a set of baseline approaches. Both the query and reference set contain a majority of ``distractor'' images that do not match, which corresponds to a real-life needle-in-haystack setting, and the evaluation metric reflects that. We expect the DISC21 benchmark to promote image copy detection as an important and challenging computer vision task and refresh the state of the art.

【3】 Indian Masked Faces in the Wild Dataset 标题：野生数据集中的印第安人蒙面面孔

作者：Shiksha Mishra,Puspita Majumdar,Richa Singh,Mayank Vatsa 机构： IIT Jodhpur, India, IIIT-Delhi, India 链接：https://arxiv.org/abs/2106.09670 摘要：由于COVID-19病毒的流行，在世界各地的公共场所戴口罩已成为一项强制要求。口罩遮住了面部很大一部分。此外，人们戴着不同类型的面具，从简单的面具到有图案和印花的面具。这些都对人脸识别算法提出了新的挑战。研究人员最近提出了一些蒙面人脸数据集来设计算法，以克服蒙面人脸识别的挑战。然而，现有的数据集缺乏文化多样性和收集在不受限制的设置。像印度这样的国家，由于服装的多样性，人们不仅要戴传统的面具，还要穿薄棉布印花毛巾（当地称为“gamcha”）、“stoles”和“手帕”之类的衣服来遮住自己的脸。在本文中，我们提出了一个新的\textbf{野生印度蒙面人脸（IMFW）}数据集，其中包含了在姿势、光照、分辨率以及被试戴的各种面具上变化的图像。我们还测试了现有人脸识别模型在IMFW数据集上的性能。实验结果证明了现有算法在不同条件下的局限性。摘要：Due to the COVID-19 pandemic, wearing face masks has become a mandate in public places worldwide. Face masks occlude a significant portion of the facial region. Additionally, people wear different types of masks, from simple ones to ones with graphics and prints. These pose new challenges to face recognition algorithms. Researchers have recently proposed a few masked face datasets for designing algorithms to overcome the challenges of masked face recognition. However, existing datasets lack the cultural diversity and collection in the unrestricted settings. Country like India with attire diversity, people are not limited to wearing traditional masks but also clothing like a thin cotton printed towel (locally called as ``gamcha''), ``stoles'', and ``handkerchiefs'' to cover their faces. In this paper, we present a novel \textbf{Indian Masked Faces in the Wild (IMFW)} dataset which contains images with variations in pose, illumination, resolution, and the variety of masks worn by the subjects. We have also benchmarked the performance of existing face recognition models on the proposed IMFW dataset. Experimental results demonstrate the limitations of existing algorithms in presence of diverse conditions.

【4】 SIFT Matching by Context Exposed 标题：按暴露的上下文筛选匹配

作者：Fabio Bellavia 机构： Bellavia is with the Department of Mathematics and Computer Science, Universita degli Studi di Palermo 备注：Early paper version 链接：https://arxiv.org/abs/2106.09584 摘要：本文研究了如何利用匹配上下文信息来提高局部图像描述符匹配的速度。识别出两个主要的上下文，分别来自描述符空间和关键点空间。前者一般用于设计实际的匹配策略，后者则根据局部空间一致性对匹配进行过滤。在此基础上，设计了一种新的匹配策略和一种新的局部空间滤波器，分别称为blob匹配和Delaunay三角匹配（DTM）。Blob匹配提供了一个通用的匹配框架，它将多个策略合并在一起，包括预过滤、多对多和对称匹配，使得每个策略都能得到全局的改进。DTM在Delaunay三角剖分收缩和展开之间交替进行，以确定和调整关键点邻域的一致性。实验结果表明，DTM在匹配精度和鲁棒性方面与现有技术相当或更好，特别是对于非平面场景。根据设计的一个新的基准来分析匹配管道在平面和非平面场景中的正确对应，包括最新的方法以及常用的SIFT匹配方法，以供参考。这一评价对今后该领域的研究有一定的帮助。摘要：This paper investigates how to step up local image descriptor matching by exploiting matching context information. Two main contexts are identified, originated respectively from the descriptor space and from the keypoint space. The former is generally used to design the actual matching strategy while the latter to filter matches according to the local spatial consistency. On this basis, a new matching strategy and a novel local spatial filter, named respectively blob matching and Delaunay Triangulation Matching (DTM) are devised. Blob matching provides a general matching framework by merging together several strategies, including pre-filtering as well as many-to-many and symmetric matching, enabling to achieve a global improvement upon each individual strategy. DTM alternates between Delaunay triangulation contractions and expansions to figure out and adjust keypoint neighborhood consistency. Experimental evaluation shows that DTM is comparable or better than the state-of-the-art in terms of matching accuracy and robustness, especially for non-planar scenes. Evaluation is carried out according to a new benchmark devised for analyzing the matching pipeline in terms of correct correspondences on both planar and non-planar scenes, including state-of-the-art methods as well as the common SIFT matching approach for reference. This evaluation can be of assistance for future research in this field.

【5】 Scale-Consistent Fusion: from Heterogeneous Local Sampling to Global Immersive Rendering 标题：尺度一致融合：从异构局部采样到全局沉浸式渲染

作者：Wenpeng Xing,Jie Chen,Zaifeng Yang,Qiang Wang 机构： Yangis with the Department of Electronics and Photonics 链接：https://arxiv.org/abs/2106.09548 摘要：基于图像的几何建模和基于稀疏大基线采样的新颖视图合成对于虚拟现实和沉浸式临场感等新兴多媒体应用来说是一项具有挑战性的重要任务。现有的方法由于在这些具有挑战性的参考条件下推断出可靠的深度信息的局限性而不能产生令人满意的结果。随着商用光场（LF）相机的普及，LF图像的拍摄与普通照片一样方便，并且可以可靠地推断出图像的几何信息。这启发我们使用一组稀疏的LF捕获来全局呈现高质量的新视图。然而，由于不同的捕获设置所导致的尺度不一致性，从多个角度进行LF捕获的融合具有挑战性。为了克服这一挑战，我们提出了一种新的尺度一致性体积重缩放算法，该算法能够在不同的捕获点之间鲁棒地对齐视差概率体积（DPV），实现尺度一致性的全局几何融合。基于融合后的DPV投影到目标摄像机截头台上，提出了新的基于学习的模块（即注意力引导的多尺度残差融合模块，视差场引导的深度再正则化模块（deep-re-regulation module）综合正则化非均匀捕获的噪声观测，以获得高质量的新lfi绘制。在Stanford-Lytro多视角LF数据集上的定量和定性实验表明，在不同的视差推断和LF合成实验条件下，该方法明显优于现有的方法。摘要：Image-based geometric modeling and novel view synthesis based on sparse, large-baseline samplings are challenging but important tasks for emerging multimedia applications such as virtual reality and immersive telepresence. Existing methods fail to produce satisfactory results due to the limitation on inferring reliable depth information over such challenging reference conditions. With the popularization of commercial light field (LF) cameras, capturing LF images (LFIs) is as convenient as taking regular photos, and geometry information can be reliably inferred. This inspires us to use a sparse set of LF captures to render high-quality novel views globally. However, fusion of LF captures from multiple angles is challenging due to the scale inconsistency caused by various capture settings. To overcome this challenge, we propose a novel scale-consistent volume rescaling algorithm that robustly aligns the disparity probability volumes (DPV) among different captures for scale-consistent global geometry fusion. Based on the fused DPV projected to the target camera frustum, novel learning-based modules have been proposed (i.e., the attention-guided multi-scale residual fusion module, and the disparity field guided deep re-regularization module) which comprehensively regularize noisy observations from heterogeneous captures for high-quality rendering of novel LFIs. Both quantitative and qualitative experiments over the Stanford Lytro Multi-view LF dataset show that the proposed method outperforms state-of-the-art methods significantly under different experiment settings for disparity inference and LF synthesis.

【6】 Deep HDR Hallucination for Inverse Tone Mapping 标题：逆色调映射的深度HDR幻觉

作者：Demetris Marnerides,Thomas Bashford-Rogers,Kurt Debattista 机构：��, Citation: Marnerides, D.;, Bashford-Rogers, T.; Debattista, K., Deep HDR Hallucination for Inverse, Tone Mapping. Preprints ,. 备注：None 链接：https://arxiv.org/abs/2106.09486 摘要：逆色调映射（ITM）方法试图从低动态范围（LDR）图像内容中重建高动态范围（HDR）信息。必须扩大良好暴露区域的动态范围，并且必须恢复因过度/不足暴露而丢失的任何信息（幻觉）。大多数的方法都集中在前者，并且是比较成功的，而对后者的大多数尝试质量都不高，即使是基于卷积神经网络（CNNs）的方法。在某些工程中，影响修补质量的一个主要因素是损失函数的选择。基于生成性对抗网络（Generative敌对网络，GANs）的研究结果表明，GAN损失可以提高图像合成和LDR修复的效果。这项工作提出了一种基于GAN的方法，该方法可以从LDR图像中的严重暴露区域产生幻觉，并将其效果与其他方法进行比较。所提出的方法在数量上与最先进的逆色调映射方法相竞争，为良好暴露区域提供良好的动态范围扩展，并为饱和和欠暴露区域提供合理的幻觉。提出了一种针对HDR内容的基于密度的归一化方法，以及针对HDR幻觉的HDR数据增强方法。摘要：Inverse Tone Mapping (ITM) methods attempt to reconstruct High Dynamic Range (HDR) information from Low Dynamic Range (LDR) image content. The dynamic range of well-exposed areas must be expanded and any missing information due to over/under-exposure must be recovered (hallucinated). The majority of methods focus on the former and are relatively successful, while most attempts on the latter are not of sufficient quality, even ones based on Convolutional Neural Networks (CNNs). A major factor for the reduced inpainting quality in some works is the choice of loss function. Work based on Generative Adversarial Networks (GANs) shows promising results for image synthesis and LDR inpainting, suggesting that GAN losses can improve inverse tone mapping results. This work presents a GAN-based method that hallucinates missing information from badly exposed areas in LDR images and compares its efficacy with alternative variations. The proposed method is quantitatively competitive with state-of-the-art inverse tone mapping methods, providing good dynamic range expansion for well-exposed areas and plausible hallucinations for saturated and under-exposed areas. A density-based normalisation method, targeted for HDR content, is also proposed, as well as an HDR data augmentation method targeted for HDR hallucination.

【7】 A Random CNN Sees Objects: One Inductive Bias of CNN and Its Applications 标题：随机CNN看对象：CNN的一种归纳偏向及其应用

作者：Yun-Hao Cao,Jianxin Wu 机构：National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 备注：17 pages, 9 figures, 10 tables 链接：https://arxiv.org/abs/2106.09259 摘要：本文首先揭示了一个令人惊讶的发现：在没有任何学习的情况下，一个随机初始化的CNN可以很好地定位物体。也就是说，CNN有一种自然聚焦于物体的归纳倾向，本文称之为Tobias（“物体在视线中”）。进一步分析了这种经验归纳偏差，并将其成功地应用于自监督学习。鼓励CNN学习聚焦于前景对象的表示，通过将每个图像转换成具有不同背景的不同版本，其中前景和背景分离由Tobias引导。实验结果表明，本文提出的Tobias算法能有效地改善下游任务，尤其是目标检测。本文还表明，Tobias在不同大小的训练集上有一致的改进，并且对图像增强的变化更具弹性。我们的代码将在https://github.com/CupidJay/Tobias. 摘要：This paper starts by revealing a surprising finding: without any learning, a randomly initialized CNN can localize objects surprisingly well. That is, a CNN has an inductive bias to naturally focus on objects, named as Tobias (``The object is at sight'') in this paper. This empirical inductive bias is further analyzed and successfully applied to self-supervised learning. A CNN is encouraged to learn representations that focus on the foreground object, by transforming every image into various versions with different backgrounds, where the foreground and background separation is guided by Tobias. Experimental results show that the proposed Tobias significantly improves downstream tasks, especially for object detection. This paper also shows that Tobias has consistent improvements on training sets of different sizes, and is more resilient to changes in image augmentations. Our codes will be available at https://github.com/CupidJay/Tobias.

【8】 Federated CycleGAN for Privacy-Preserving Image-to-Image Translation 标题：保护隐私的图像到图像转换的联邦循环GAN

作者：Joonyoung Song,Jong Chul Ye 机构：Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea 链接：https://arxiv.org/abs/2106.09246 摘要：无监督图像到图像的转换方法，如CycleGAN学习使用来自不同域的未配对训练数据集将图像从一个域转换到另一个域。不幸的是，这些方法仍然需要集中收集未配对的记录，这可能会侵犯隐私和安全问题。虽然最近的联邦学习（FL）允许神经网络在不交换数据的情况下进行训练，但FL的基本假设是所有客户都有自己的来自相似领域的训练数据，这与我们的图像到图像转换场景不同，在这种场景中，每个客户端都有来自其唯一域的图像，目标是在不访问目标域数据的情况下学习不同域之间的图像转换。为了解决这一问题，本文提出了一种新的联邦CycleGAN体系结构，它可以在保持数据隐私的同时，以无监督的方式学习图像翻译。具体来说，我们的方法源自一个新的观察，即CycleGAN损失可以分解为客户特定的局部目标的总和，这些目标可以仅使用其数据进行评估。这种局部目标分解允许多个客户机在不牺牲性能的情况下参与联合CycleGAN训练。此外，我们的方法采用了新的可切换生成器和鉴别器架构，使用自适应实例规范化（AdaIN）显著降低了联邦学习的带宽要求。我们在各种无监督图像翻译任务上的实验结果表明，我们的联邦CycleGAN与非联邦CycleGAN具有相当的性能。摘要：Unsupervised image-to-image translation methods such as CycleGAN learn to convert images from one domain to another using unpaired training data sets from different domains. Unfortunately, these approaches still require centrally collected unpaired records, potentially violating privacy and security issues. Although the recent federated learning (FL) allows a neural network to be trained without data exchange, the basic assumption of the FL is that all clients have their own training data from a similar domain, which is different from our image-to-image translation scenario in which each client has images from its unique domain and the goal is to learn image translation between different domains without accessing the target domain data. To address this, here we propose a novel federated CycleGAN architecture that can learn image translation in an unsupervised manner while maintaining the data privacy. Specifically, our approach arises from a novel observation that CycleGAN loss can be decomposed into the sum of client specific local objectives that can be evaluated using only their data. This local objective decomposition allows multiple clients to participate in federated CycleGAN training without sacrificing performance. Furthermore, our method employs novel switchable generator and discriminator architecture using Adaptive Instance Normalization (AdaIN) that significantly reduces the band-width requirement of the federated learning. Our experimental results on various unsupervised image translation tasks show that our federated CycleGAN provides comparable performance compared to the non-federated counterpart.

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-06-18，如有侵权请联系 cloudcommunity@tencent.com 删除

linux