计算机视觉与模式识别学术速递[12.21]

公众号-arXiv每日学术速递

发布于 2021-12-24 08:49:19

1.5K0

发布于 2021-12-24 08:49:19

cs.CV 方向，今日共计104篇

Transformer(5篇)

【1】 StyleSwin: Transformer-based GAN for High-resolution Image Generation 标题：StyleSwin：用于高分辨率图像生成的Transformer基GaN 链接：https://arxiv.org/abs/2112.10762

作者：Bowen Zhang,Shuyang Gu,Bo Zhang,Jianmin Bao,Dong Chen,Fang Wen,Yong Wang,Baining Guo 机构：University of Science and Technology of China, Microsoft Research Asia 摘要：尽管在广泛的视觉任务中取得了诱人的成功，但Transformer在高分辨率图像生成建模方面还没有表现出与CONVNET同等的能力。在本文中，我们试图探索使用纯Transformer来构建一个用于高分辨率图像合成的生成对抗网络。为此，我们认为局部关注对于在计算效率和建模能力之间取得平衡至关重要。因此，建议的发电机采用基于样式的架构中的Swin transformer。为了获得更大的接受域，我们建议双重关注，同时利用本地窗口和移动窗口的上下文，从而提高生成质量。此外，我们还表明，提供基于窗口的Transformer中丢失的绝对位置的知识极大地提高了发电质量。建议的STYLESWN可扩展到高分辨率，粗糙的几何结构和精细的结构都得益于Transformer的强大表现力。然而，在高分辨率合成过程中会出现块效应，因为以块方式执行局部注意可能会破坏空间相干性。为了解决这个问题，我们对各种解决方案进行了实证研究，其中我们发现使用小波鉴别器来检测光谱差异可以有效地抑制伪影。大量的实验表明，它比以前的基于Transformer的GAN更优越，特别是在高分辨率（例如1024x1024）上。StyleSwin没有复杂的训练策略，在CelebA HQ 1024上优于StyleGAN，在FFHQ-1024上实现了不相上下的性能，证明了使用Transformer生成高分辨率图像的前景。代码和模型将在https://github.com/microsoft/StyleSwin. 摘要：Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformer-based GANs, especially on high resolutions, e.g., 1024x1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024, and achieves on-par performance on FFHQ-1024, proving the promise of using transformers for high-resolution image generation. The code and models will be available at https://github.com/microsoft/StyleSwin.

【2】 On Efficient Transformer and Image Pre-training for Low-level Vision 标题：低水平视觉的高效变换与图像预训练链接：https://arxiv.org/abs/2112.10175

作者：Wenbo Li,Xin Lu,Jiangbo Lu,Xiangyu Zhang,Jiaya Jia 机构：The Chinese University of Hong Kong, Megvii Technology, SmartMore Technology 摘要：预训练标志着高级计算机视觉领域的众多技术水平，但很少有人尝试研究预训练在图像处理系统中的作用。本文对图像预训练进行了深入的研究。为了在坚实的基础上进行这项具有实用价值的研究，我们首先提出了一种通用的、经济高效的基于Transformer的图像处理框架。它在一系列低级别任务中产生了极具竞争力的性能，尽管参数和计算复杂度受到限制。然后，基于该框架，我们设计了一整套有原则的评估工具，对不同任务中的图像预训练进行认真、全面的诊断，并揭示其对内部网络表征的影响。我们发现，预训练在低水平任务中扮演着截然不同的角色。例如，预训练将更多的局部信息引入超分辨率（SR）的更高层，从而产生显著的性能提升，而预训练几乎不会影响去噪中的内部特征表示，从而产生少量提升。进一步，我们探索了不同的预训练方法，发现多任务预训练更有效，数据效率更高。所有代码和型号将在https://github.com/fenglinglwb/EDT. 摘要：Pre-training has marked numerous state of the arts in high-level computer vision, but few attempts have ever been made to investigate how pre-training acts in image processing systems. In this paper, we present an in-depth study of image pre-training. To conduct this study on solid ground with practical value in mind, we first propose a generic, cost-effective Transformer-based framework for image processing. It yields highly competitive performance across a range of low-level tasks, though under constrained parameters and computational complexity. Then, based on this framework, we design a whole set of principled evaluation tools to seriously and comprehensively diagnose image pre-training in different tasks, and uncover its effects on internal network representations. We find pre-training plays strikingly different roles in low-level tasks. For example, pre-training introduces more local information to higher layers in super-resolution (SR), yielding significant performance gains, while pre-training hardly affects internal feature representations in denoising, resulting in a little gain. Further, we explore different methods of pre-training, revealing that multi-task pre-training is more effective and data-efficient. All codes and models will be released at https://github.com/fenglinglwb/EDT.

【3】 LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach 标题：LocFormer：使Transformer能够使用特征采样方法对长的未裁剪视频执行时间时刻定位链接：https://arxiv.org/abs/2112.10066

作者：Cristian Rodriguez-Opazo,Edison Marrese-Taylor,Basura Fernando,Hiroya Takamura,Qi Wu 机构：♠ University of Adelaide, Australian Institute for Machine Learning, ♣ National Institute of Advanced Industrial Science and Technology (AIST), †The University of Tokyo, ♦ ASTAR Singapore 摘要：我们提出了LocFormer，一种基于Transformer的视频接地模型，它在恒定的内存占用下运行，而不考虑视频长度，即帧数。LocFormer设计用于需要处理整个长视频的任务，其核心有两个主要贡献。首先，我们的模型采用了一种新的采样技术，将输入特征序列分割成固定数量的部分，并使用随机方法选择每个部分的单个特征，这允许我们获得一个特征样本集，该样本集代表手头任务的视频内容，同时保持内存占用不变。其次，我们提出了一个模块化设计，将功能分离，使我们能够通过监督自我注意头来学习归纳偏见，同时有效地利用预先训练的文本和视频编码器。我们在视频接地的相关基准数据集上测试了我们的方案，结果表明，LocFormer不仅可以实现出色的结果，包括YouCookII的最新性能，而且我们的采样技术比竞争对手更有效，并且它不断提高了先前工作的性能，在平均时间IoU中高达3.13\%，最终导致在猜字谜STA上进行最新的艺术表演。摘要：We propose LocFormer, a Transformer-based model for video grounding which operates at a constant memory footprint regardless of the video length, i.e. number of frames. LocFormer is designed for tasks where it is necessary to process the entire long video and at its core lie two main contributions. First, our model incorporates a new sampling technique that splits the input feature sequence into a fixed number of sections and selects a single feature per section using a stochastic approach, which allows us to obtain a feature sample set that is representative of the video content for the task at hand while keeping the memory footprint constant. Second, we propose a modular design that separates functionality, enabling us to learn an inductive bias via supervising the self-attention heads, while also effectively leveraging pre-trained text and video encoders. We test our proposals on relevant benchmark datasets for video grounding, showing that not only LocFormer can achieve excellent results including state-of-the-art performance on YouCookII, but also that our sampling technique is more effective than competing counterparts and that it consistently improves the performance of prior work, by up to 3.13\% in the mean temporal IoU, ultimately leading to a new state-of-the-art performance on Charades-STA.

【4】 Pre-Training Transformers for Domain Adaptation 标题：用于域自适应的预训练Transformer 链接：https://arxiv.org/abs/2112.09965

作者：Burhan Ul Tayyab,Nicholas Chua 机构：Shirley Robotics 摘要：视觉域适应挑战2021称为无监督域适应方法，通过将从源数据集获得的知识转移到分发目标数据集之外，可以提高模型的性能。在本文中，我们利用BeiT[1]，展示了它从源数据集中捕获关键属性的能力，并以半监督的方式将其应用于目标数据集。我们的方法能够超越目前最先进的（SoTA）技术，并且能够在ViSDA域适应挑战中获得第一名，ACC为56.29%，AUROC为69.79%。摘要：The Visual Domain Adaptation Challenge 2021 called for unsupervised domain adaptation methods that could improve the performance of models by transferring the knowledge obtained from source datasets to out-of-distribution target datasets. In this paper, we utilize BeiT [1] and demonstrate its capability of capturing key attributes from source datasets and apply it to target datasets in a semi-supervised manner. Our method was able to outperform current state-of-the-art (SoTA) techniques and was able to achieve 1st place on the ViSDA Domain Adaptation Challenge with ACC of 56.29% and AUROC of 69.79%.

【5】 A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation 标题：一种用于目标定位和实例分割的简易单尺度视觉转换器链接：https://arxiv.org/abs/2112.09747

作者：Wuyang Chen,Xianzhi Du,Fan Yang,Lucas Beyer,Xiaohua Zhai,Tsung-Yi Lin,Huizhong Chen,Jing Li,Xiaodan Song,Zhangyang Wang,Denny Zhou 机构： University of Texas, Austin, Google 摘要：这项工作提出了一个简单的视觉转换器设计作为一个强大的基线对象定位和实例分割任务。Transformers最近在图像分类任务中表现出了很强的竞争力。为了将ViT应用于目标检测和密集预测任务，许多工作继承了卷积网络的多级设计和高度定制的ViT架构。在这种设计背后，目标是在计算成本和有效聚合多尺度全局上下文之间寻求更好的权衡。然而，现有的工程采用多阶段建筑设计作为黑盒解决方案，而没有清楚地了解其真正的好处。在本文中，我们全面研究了ViT上的三种架构设计选择——空间缩减、双通道和多尺度特征——并证明了一种普通的ViT架构可以实现这一目标，而无需手工制作多尺度特征，保持原有的ViT设计理念。我们进一步完善了缩放规则，以优化我们的模型在精度和计算成本/模型大小方面的权衡。通过在整个编码器块中利用恒定的特征分辨率和隐藏大小，我们提出了一种简单紧凑的ViT体系结构，称为Universal Vision Transformer（UViT），该体系结构在COCO对象检测和实例分割任务中实现了强大的性能。摘要：This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to object detection and dense prediction tasks, many works inherit the multistage design from convolutional networks and highly customized ViT architectures. Behind this design, the goal is to pursue a better trade-off between computational cost and effective aggregation of multiscale global contexts. However, existing works adopt the multistage architectural design as a black-box solution without a clear understanding of its true benefits. In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy. We further complete a scaling rule to optimize our model's trade-off on accuracy and computation cost / model size. By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong performance on COCO object detection and instance segmentation tasks.

检测相关(10篇)

【1】 a novel attention-based network for fast salient object detection 标题：一种新的基于注意力的快速显著目标检测网络链接：https://arxiv.org/abs/2112.10481

作者：Bin Zhang,Yang Wu,Xiaojing Zhang,Ming Ma 机构：College of Computer Science and Engineering, Inner Mongolian University, China, Hohhot 摘要：在当前的显著目标检测网络中，最常用的方法是采用U形结构。然而，大量的参数会导致更多的计算和存储资源消耗，这些资源不可能部署在有限的内存设备上。其他一些浅层网络将无法保持与U形结构相同的精度，而参数较多的深层网络结构将无法快速收敛到全局最小损耗。为了克服所有这些缺点，我们提出了一种新的深度卷积网络结构，其贡献有三：（1）在改进的显著对象特征压缩和强化提取模块（ISFCREM）中使用较小的卷积神经网络（CNN）对模型进行压缩，以减少模型的参数。（2）在ISFCREM中引入通道注意机制，对不同通道进行加权，提高特征表示能力。（3）在训练过程中，应用一种新的优化器来积累长期梯度信息，以自适应地调整学习速率。结果表明，与其他模型相比，该方法可以在不损失精度的情况下将模型压缩到原始尺寸的1/3，并且在六个广泛使用的显著目标检测数据集上收敛更快、更平滑。我们的代码发布在https://gitee.com/binzhangbinzhangbin/code-a-novel-attention-based-network-for-fast-salient-object-detection.git 摘要：In the current salient object detection network, the most popular method is using U-shape structure. However, the massive number of parameters leads to more consumption of computing and storage resources which are not feasible to deploy on the limited memory device. Some others shallow layer network will not maintain the same accuracy compared with U-shape structure and the deep network structure with more parameters will not converge to a global minimum loss with great speed. To overcome all of these disadvantages, we proposed a new deep convolution network architecture with three contributions: (1) using smaller convolution neural networks (CNNs) to compress the model in our improved salient object features compression and reinforcement extraction module (ISFCREM) to reduce parameters of the model. (2) introducing channel attention mechanism in ISFCREM to weigh different channels for improving the ability of feature representation. (3) applying a new optimizer to accumulate the long-term gradient information during training to adaptively tune the learning rate. The results demonstrate that the proposed method can compress the model to 1/3 of the original size nearly without losing the accuracy and converging faster and more smoothly on six widely used datasets of salient object detection compared with the others models. Our code is published in https://gitee.com/binzhangbinzhangbin/code-a-novel-attention-based-network-for-fast-salient-object-detection.git

【2】 UFPMP-Det: Toward Accurate and Efficient Object Detection on Drone Imagery 标题：UFPMP-DET：面向精确高效的无人机图像目标检测链接：https://arxiv.org/abs/2112.10415

作者：Yecheng Huang,Jiaxin Chen,Di Huang 机构： State Key Laboratory of Software Development Environment, Beihang University, Beijing, China, School of Computer Science and Engineering, Beihang University, Beijing, China 备注：8 pages, 6 figures, 摘要：提出了一种新的无人机图像目标检测方法，即基于统一前景布局的多代理检测网络（UFPMP-Det）。为了处理许多非常小规模的实例，不同于将高分辨率输入图像划分为相当多的具有低前景比率的芯片以对每个芯片执行检测的常见解决方案，设计了统一前景打包（UFP）模块，其中，粗检测器给出的子区域最初通过聚类进行合并以抑制背景，并且随后将得到的子区域打包成马赛克以进行单一推断，从而显著降低总体时间成本。此外，为了解决实例的类间相似性和类内变化之间更严重的混淆问题（这会降低检测性能，但很少讨论），提出了多代理检测网络（MP Det），通过使用多代理学习以细粒度方式对对象分布建模，通过最小化实例词包（BoIW）引导的最优传输损耗，使代理变得多样化。通过这种方式，UFPMP Det大大提高了检测精度和效率。在广泛使用的VisDrone和UAVDT数据集上进行了广泛的实验，UFPMP Det以更高的速度报告了新的最先进的分数，突出了其优势。摘要：This paper proposes a novel approach to object detection on drone imagery, namely Multi-Proxy Detection Network with Unified Foreground Packing (UFPMP-Det). To deal with the numerous instances of very small scales, different from the common solution that divides the high-resolution input image into quite a number of chips with low foreground ratios to perform detection on them each, the Unified Foreground Packing (UFP) module is designed, where the sub-regions given by a coarse detector are initially merged through clustering to suppress background and the resulting ones are subsequently packed into a mosaic for a single inference, thus significantly reducing overall time cost. Furthermore, to address the more serious confusion between inter-class similarities and intra-class variations of instances, which deteriorates detection performance but is rarely discussed, the Multi-Proxy Detection Network (MP-Det) is presented to model object distributions in a fine-grained manner by employing multiple proxy learning, and the proxies are enforced to be diverse by minimizing a Bag-of-Instance-Words (BoIW) guided optimal transport loss. By such means, UFPMP-Det largely promotes both the detection accuracy and efficiency. Extensive experiments are carried out on the widely used VisDrone and UAVDT datasets, and UFPMP-Det reports new state-of-the-art scores at a much higher speed, highlighting its advantages.

【3】 Product Re-identification System in Fully Automated Defect Detection 标题：全自动缺陷检测中的产品再识别系统链接：https://arxiv.org/abs/2112.10324

作者：Chenggui Sun,Li Bin Song 摘要：在这项工作中，我们介绍了一种方法，并提出了一种改进的神经网络工作来执行产品重新识别，这是全自动产品缺陷检测系统的基本核心功能。我们的方法是基于特征距离的。它是特征提取神经网络（如VGG16、AlexNet）与图像搜索引擎Vearch的结合。我们用于开发产品再识别系统的数据集是一个水瓶数据集，由18种水瓶的400幅图像组成。这是一个小数据集，这是我们工作中最大的挑战。然而，神经网络与Vearch的结合显示出解决产品再识别问题的潜力。特别是，我们的新神经网络——AlphaAlexNet，它是在AlexNet的基础上改进的一种神经网络，可以将产品识别精度提高4%。这表明，如果能够引入和重新设计有效的特征提取方法，对几乎相同的产品进行图像特征提取，可以达到理想的产品识别精度。为了解决数据集规模小以及难以识别彼此差异不大的产品所带来的最大挑战。在我们未来的工作中，我们提出了一个新的路线图来解决几乎相同的产品标识：引入或开发新的算法，这些算法只需要很少的图像来训练自己。摘要：In this work, we introduce a method and present an improved neural work to perform product re-identification, which is an essential core function of a fully automated product defect detection system. Our method is based on feature distance. It is the combination of feature extraction neural networks, such as VGG16, AlexNet, with an image search engine - Vearch. The dataset that we used to develop product re-identification systems is a water-bottle dataset that consists of 400 images of 18 types of water bottles. This is a small dataset, which was the biggest challenge of our work. However, the combination of neural networks with Vearch shows potential to tackle the product re-identification problems. Especially, our new neural network - AlphaAlexNet that a neural network was improved based on AlexNet could improve the production identification accuracy by four percent. This indicates that an ideal production identification accuracy could be achieved when efficient feature extraction methods could be introduced and redesigned for image feature extractions of nearly identical products. In order to solve the biggest challenges caused by the small size of the dataset and the difficult nature of identifying productions that have little differences from each other. In our future work, we propose a new roadmap to tackle nearly-identical production identifications: to introduce or develop new algorithms that need very few images to train themselves.

【4】 Driver Drowsiness Detection Using Ensemble Convolutional Neural Networks on YawDD 标题：基于YawDD的集成卷积神经网络在驾驶员困倦检测中的应用链接：https://arxiv.org/abs/2112.10298

作者：Rais Mohammad Salman,Mahbubur Rashid,Rupal Roy,Md Manjurul Ahsan,Zahed Siddique 机构：Mechatronics Engineering, International Islamic University Malaysia, Kula Lumpur, Malaysia, Industrial and Systems Engineering, University of Oklahoma, Norman, Oklahoma-, School of Aerospace and Mechanical Engineering, University of Memphis, Tennessee, USA 摘要：使用视频/图像检测驾驶员睡意是当今驾驶员安全最重要的领域之一。深度学习技术的发展，尤其是卷积神经网络（CNN），应用于计算机视觉应用中，如睡意检测，由于近几十年来技术的巨大发展，已经显示出有希望的结果。闭上眼睛或过度眨眼、打哈欠、点头和遮挡都是困倦的关键因素。在这项工作中，我们在YawDD数据集上应用了四种不同的卷积神经网络（CNN）技术，根据特定姿势和遮挡变化的打呵欠频率来检测和检查嗜睡程度。初步计算结果表明，我们提出的集成卷积神经网络（ECNN）的F1得分为0.935，优于传统的基于CNN的方法，而其他三种CNN，如CNN1、CNN2和CNN3方法的F1得分分别为0.92、0.90和0.912。摘要：Driver drowsiness detection using videos/images is one of the most essential areas in today's time for driver safety. The development of deep learning techniques, notably Convolutional Neural Networks (CNN), applied in computer vision applications such as drowsiness detection, has shown promising results due to the tremendous increase in technology in the recent few decades. Eyes that are closed or blinking excessively, yawning, nodding, and occlusion are all key aspects of drowsiness. In this work, we have applied four different Convolutional Neural Network (CNN) techniques on the YawDD dataset to detect and examine the extent of drowsiness depending on the yawning frequency with specific pose and occlusion variation. Preliminary computational results show that our proposed Ensemble Convolutional Neural Network (ECNN) outperformed the traditional CNN-based approach by achieving an F1 score of 0.935, whereas the other three CNN, such as CNN1, CNN2, and CNN3 approaches gained 0.92, 0.90, and 0.912 F1 scores, respectively.

【5】 Parallel Multi-Scale Networks with Deep Supervision for Hand Keypoint Detection 标题：用于手部关键点检测的深度监督并行多尺度网络链接：https://arxiv.org/abs/2112.10275

作者：Renjie Li,Son Tran,Saurabh Garg,Katherine Lawler,Jane Alty,Quan Bai 摘要：关键点检测在广泛的应用中扮演着重要的角色。然而，预测诸如人手等小物体的关键点是一个具有挑战性的问题。最近的工作融合深卷积神经网络（CNN）的特征映射，无论是通过多级特征集成还是多分辨率聚合。尽管取得了一些成功，但特征融合方法增加了CNN的复杂性和不透明度。为了解决这个问题，我们提出了一种新的CNN模型，称为多尺度深度监控网络（P-MSDSNet），该模型通过深度监控学习不同尺度的特征映射，以生成用于自适应特征层间传播的注意映射。P-MSDSNet具有多阶段体系结构，使其具有可扩展性，同时其对空间注意力的深度监控提高了每个阶段特征学习的透明度。我们表明，P-MSDSNet在基准数据集上优于最先进的方法，同时需要更少的参数。在一项神经科学研究中，我们还展示了P-MSDSNet在量化手指敲击手动作方面的应用。摘要：Keypoint detection plays an important role in a wide range of applications. However, predicting keypoints of small objects such as human hands is a challenging problem. Recent works fuse feature maps of deep Convolutional Neural Networks (CNNs), either via multi-level feature integration or multi-resolution aggregation. Despite achieving some success, the feature fusion approaches increase the complexity and the opacity of CNNs. To address this issue, we propose a novel CNN model named Multi-Scale Deep Supervision Network (P-MSDSNet) that learns feature maps at different scales with deep supervisions to produce attention maps for adaptive feature propagation from layers to layers. P-MSDSNet has a multi-stage architecture which makes it scalable while its deep supervision with spatial attention improves transparency to the feature learning at each stage. We show that P-MSDSNet outperforms the state-of-the-art approaches on benchmark datasets while requiring fewer number of parameters. We also show the application of P-MSDSNet to quantify finger tapping hand movements in a neuroscience study.

【6】 Deep Graph-level Anomaly Detection by Glocal Knowledge Distillation 标题：基于局部知识提取的深层图级异常检测链接：https://arxiv.org/abs/2112.10063

作者：Rongrong Ma,Guansong Pang,Ling Chen,Anton van den Hengel 机构：Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, Australia, School of Computing and Information Systems, Singapore Management University, Australian Institute for Machine Learning, The University of Adelaide, Adelaide, Australia 备注：Accepted to WSDM 2022 摘要：图级异常检测（GAD）描述了与其他图相比，检测结构和/或节点特征异常的图的问题。GAD中的一个挑战是设计能够同时检测局部和全局异常图的图表示，即分别检测细粒度（节点级）或整体（图级）属性异常的图。为了应对这一挑战，我们引入了一种新的GAD深度异常检测方法，该方法通过图和节点表示的联合随机蒸馏来学习丰富的全局和局部正常模式信息。随机蒸馏是通过训练一个GNN来预测另一个具有随机初始化网络权值的GNN来实现的。对来自不同领域的16个真实图形数据集进行的大量实验表明，我们的模型显著优于7个最先进的模型。代码和数据集可在https://git.io/GLocalKD. 摘要：Graph-level anomaly detection (GAD) describes the problem of detecting graphs that are abnormal in their structure and/or the features of their nodes, as compared to other graphs. One of the challenges in GAD is to devise graph representations that enable the detection of both locally- and globally-anomalous graphs, i.e., graphs that are abnormal in their fine-grained (node-level) or holistic (graph-level) properties, respectively. To tackle this challenge we introduce a novel deep anomaly detection approach for GAD that learns rich global and local normal pattern information by joint random distillation of graph and node representations. The random distillation is achieved by training one GNN to predict another GNN with randomly initialized network weights. Extensive experiments on 16 real-world graph datasets from diverse domains show that our model significantly outperforms seven state-of-the-art models. Code and datasets are available at https://git.io/GLocalKD.

【7】 Rapid Face Mask Detection and Person Identification Model based on Deep Neural Networks 标题：基于深度神经网络的快速人脸面具检测与身份识别模型链接：https://arxiv.org/abs/2112.09951

作者：Abdullah Ahmad Khan,Mohd. Belal,GhufranUllah 机构：Department of Computer Science, Aligarh Muslim University, Uttar Pradesh, India, ABSTACT, As Covid-, has been constantly getting mutated and in three or four months a new variant gets introduced to us and 备注：12 pages , 15 figures , International Conference 摘要：随着COVID-19不断变异，在三个月或四个月内，一个新的变体被介绍给我们，它带来了更致命的问题。阻止我们感染新冠病毒的因素是接种疫苗和戴口罩。在本文中，我们已经实现了一个新的面罩检测和人脸识别模型Insight Face，该模型基于SoftMax损失分类算法Arc Face loss，并将其命名为RFMPI-DNN（基于深度神经网络的快速人脸检测和Peron识别模型），以与其他模型相比快速检测面罩和身份可获得的为了比较我们的新模型，我们使用了以前的MobileNet_V2模型和人脸识别模块，以便在时间的基础上进行有效的比较。在系统中实现的模型在各个方面都优于本文所比较的模型摘要：As Covid-19 has been constantly getting mutated and in three or four months a new variant gets introduced to us and it comes with more deadly problems. The things that prevent us from getting Covid is getting vaccinated and wearing a face mask. In this paper, we have implemented a new Face Mask Detection and Person Recognition model named Insight face which is based on SoftMax loss classification algorithm Arc Face loss and names it as RFMPI-DNN(Rapid Face Detection and Peron Identification Model based on Deep Neural Networks) to detect face mask and person identity rapidly as compared to other models available. To compare our new model, we have used previous MobileNet_V2 model and face recognition module for effective comparison on the basis of time. The proposed model implemented in the system has outperformed the model compared in this paper in every aspect

【8】 Enhanced Object Detection in Floor-plan through Super Resolution 标题：利用超分辨率增强平面图中的目标检测链接：https://arxiv.org/abs/2112.09844

作者：Dev Khare,N S Kamal,Barathi Ganesh HB,V Sowmya,V V Sajith Variyar 机构： Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India, RBG.AI, Resilience Business Grids LLP, SREC Incubation Center, Coimbatore, Tamil Nadu, India 备注：3rd International Conference on Machine Learning, Image Processing, Network Security and Data Sciences 摘要：建筑信息建模（BIM）软件使用可伸缩向量格式，以实现行业内平面布置图的灵活设计。建筑领域中的平面布置图可以来自许多来源，这些来源可能采用也可能不采用可伸缩矢量格式。将平面图图像转换为完全带注释的矢量图像是一个现在可以通过计算机视觉实现的过程。该领域的新数据集已被用于训练卷积神经网络（CNN）结构，用于目标检测。超分辨率图像增强（SR）也是一种基于CNN的计算机视觉网络，用于将低分辨率图像转换为高分辨率图像。这项工作的重点是创建一个多组件模块，该模块将SR模型堆叠在楼层平面对象检测模型上。所提出的堆叠模型比相应的普通目标检测模型表现出更高的性能。在最佳情况下，在香草网络中加入SR后，目标检测性能提高了39.47%。数据和代码公开于https://github.com/rbg-research/Floor-Plan-Detection. 摘要：Building Information Modelling (BIM) software use scalable vector formats to enable flexible designing of floor plans in the industry. Floor plans in the architectural domain can come from many sources that may or may not be in scalable vector format. The conversion of floor plan images to fully annotated vector images is a process that can now be realized by computer vision. Novel datasets in this field have been used to train Convolutional Neural Network (CNN) architectures for object detection. Image enhancement through Super-Resolution (SR) is also an established CNN based network in computer vision that is used for converting low resolution images to high resolution ones. This work focuses on creating a multi-component module that stacks a SR model on a floor plan object detection model. The proposed stacked model shows greater performance than the corresponding vanilla object detection model. For the best case, the the inclusion of SR showed an improvement of 39.47% in object detection over the vanilla network. Data and code are made publicly available at https://github.com/rbg-research/Floor-Plan-Detection.

【9】 Query Adaptive Few-Shot Object Detection with Heterogeneous Graph Convolutional Networks 标题：异构图卷积网络的查询自适应Few-Shot目标检测链接：https://arxiv.org/abs/2112.09791

作者：Guangxing Han,Yicheng He,Shiyuan Huang,Jiawei Ma,Shih-Fu Chang 机构：Columbia University 备注：ICCV 2021 摘要：Few-Shot目标检测（FSOD）旨在使用很少的示例来检测从未见过的目标。由于元学习技术的出现，该领域最近得到了改进，通过学习如何在查询图像和少数镜头类示例之间进行匹配，使得学习的模型可以推广到少数镜头类。然而，目前大多数基于元学习的方法分别在查询图像区域（通常是建议）和新类之间执行成对匹配，因此没有考虑它们之间的多种关系。本文提出了一种基于异构图卷积网络的FSOD模型。通过在具有三种不同边缘类型的所有提议和类节点之间高效地传递消息，我们可以获得上下文感知的提议特征，并为每个类查询自适应的、多类增强的原型表示，这有助于促进成对匹配并提高最终的FSOD精度。大量的实验结果表明，我们提出的模型（称为QA FewDet）在PASCAL VOC和MSCOCO FSOD基准上，在不同的放炮和评估指标下，优于当前最先进的方法。摘要：Few-shot object detection (FSOD) aims to detect never-seen objects using few examples. This field sees recent improvement owing to the meta-learning techniques by learning how to match between the query image and few-shot class examples, such that the learned model can generalize to few-shot novel classes. However, currently, most of the meta-learning-based methods perform pairwise matching between query image regions (usually proposals) and novel classes separately, therefore failing to take into account multiple relationships among them. In this paper, we propose a novel FSOD model using heterogeneous graph convolutional networks. Through efficient message passing among all the proposal and class nodes with three different types of edges, we could obtain context-aware proposal features and query-adaptive, multiclass-enhanced prototype representations for each class, which could help promote the pairwise matching and improve final FSOD accuracy. Extensive experimental results show that our proposed model, denoted as QA-FewDet, outperforms the current state-of-the-art approaches on the PASCAL VOC and MSCOCO FSOD benchmarks under different shots and evaluation metrics.

【10】 A Deep Learning Based Workflow for Detection of Lung Nodules With Chest Radiograph 标题：一种基于深度学习的胸片肺结节检测工作流链接：https://arxiv.org/abs/2112.10184

作者：Yang Tai 机构：Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan(ROC) 摘要：目的：本研究旨在开发一种基于深度学习的工具，用胸片（CXR）检测和定位肺结节。我们期望它能提高解释CXR的效率，减少肺癌延迟诊断的可能性。材料和方法：我们从NCKUH数据库和开放源代码医学图像数据集VBD中收集CXR作为我们的训练和验证数据。来自卫生和福利部（MOHW）数据库的许多CXR作为我们的测试数据。我们建立了一个分割模型来识别来自CXR的肺部区域，并将它们分割成16块。医生通过点击补丁给CXR贴上标签。然后用这些标记的贴片训练和微调深度神经网络（DNN）模型，将贴片分为阳性或阴性。最后，我们用MOHW的CXR肺片测试DNN模型。结果：我们的分割模型能够很好地从整个CXR中识别出肺区域。地面真值和分割结果之间的联合交集（IoU）为0.9228。此外，我们的DNN模型在125例患者中，98例的敏感性为0.81，特异性为0.82，AUROC为0.869。对于其他27例困难病例，敏感性为0.54，特异性为0.494，AUROC为0.682。总的来说，我们获得的敏感性为0.78，特异性为0.79，AUROC为0.837。结论：我们的两步工作流程在从CXRs定位肺结节的敏感性和特异性方面与最先进的算法相当。值得注意的是，我们的工作流程为专家标记数据提供了一种有效的方法，由于标记的医学图像数据相对稀少，因此对相关研究具有价值。摘要：PURPOSE: This study aimed to develop a deep learning-based tool to detect and localize lung nodules with chest radiographs(CXRs). We expected it to enhance the efficiency of interpreting CXRs and reduce the possibilities of delayed diagnosis of lung cancer. MATERIALS AND METHODS: We collected CXRs from NCKUH database and VBD, an open-source medical image dataset, as our training and validation data. A number of CXRs from the Ministry of Health and Welfare(MOHW) database served as our test data. We built a segmentation model to identify lung areas from CXRs, and sliced them into 16 patches. Physicians labeled the CXRs by clicking the patches. These labeled patches were then used to train and fine-tune a deep neural network(DNN) model, classifying the patches as positive or negative. Finally, we test the DNN model with the lung patches of CXRs from MOHW. RESULTS: Our segmentation model identified the lung regions well from the whole CXR. The Intersection over Union(IoU) between the ground truth and the segmentation result was 0.9228. In addition, our DNN model achieved a sensitivity of 0.81, specificity of 0.82, and AUROC of 0.869 in 98 of 125 cases. For the other 27 difficult cases, the sensitivity was 0.54, specificity 0.494, and AUROC 0.682. Overall, we obtained a sensitivity of 0.78, specificity of 0.79, and AUROC 0.837. CONCLUSIONS: Our two-step workflow is comparable to state-of-the-art algorithms in the sensitivity and specificity of localizing lung nodules from CXRs. Notably, our workflow provides an efficient way for specialists to label the data, which is valuable for relevant researches because of the relative rarity of labeled medical image data.

分类|识别相关(11篇)

【1】 Image-free multi-character recognition 标题：无图像多字符识别链接：https://arxiv.org/abs/2112.10587

作者：Huayi Wang,Chunli Zhu,Liheng Bian 机构：School of Information and Electronics & Advanced Research Institute of Multidisciplinary Sci-, ence, Beijing Institute of Technology, Beijing 备注：17pages, 4figures 摘要：最近发展起来的无图像传感技术保持了轻硬件和软件的优点，已应用于简单的目标分类和运动跟踪。然而，在实际应用中，视野中通常存在多个目标，现有的试验无法产生多语义信息。在这封信中，我们首次报道了一种新的无图像传感技术来解决多目标识别的挑战。与无图像单像素网络的卷积层堆栈不同，所报道的CRNN网络利用双向LSTM结构来同时预测多个字符的分布。该框架能够捕获长程依赖关系，提供多个字符的高识别精度。我们证明了该技术在车牌检测中的有效性，在5%的采样率和高于100fps的刷新率下实现了87.60%的识别准确率。摘要：The recently developed image-free sensing technique maintains the advantages of both the light hardware and software, which has been applied in simple target classification and motion tracking. In practical applications, however, there usually exist multiple targets in the field of view, where existing trials fail to produce multi-semantic information. In this letter, we report a novel image-free sensing technique to tackle the multi-target recognition challenge for the first time. Different from the convolutional layer stack of image-free single-pixel networks, the reported CRNN network utilities the bidirectional LSTM architecture to predict the distribution of multiple characters simultaneously. The framework enables to capture the long-range dependencies, providing a high recognition accuracy of multiple characters. We demonstrated the technique's effectiveness in license plate detection, which achieved 87.60% recognition accuracy at a 5% sampling rate with a higher than 100 FPS refresh rate.

【2】 Dynamic Hypergraph Convolutional Networks for Skeleton-Based Action Recognition 标题：基于骨架的动态超图卷积网络动作识别链接：https://arxiv.org/abs/2112.10570

作者：Jinfeng Wei,Yunxin Wang,Mengli Guo,Pei Lv,Xiaoshan Yang,Mingliang Xu 备注：12 pages, 6 figures 摘要：基于图卷积网络（GCNs）的方法在基于骨架的动作识别任务中取得了较高的性能。但是，骨架图不能完全表示骨架数据中包含的运动信息。此外，基于GCN的方法中，骨架图的拓扑是根据自然连接手动设置的，并且对于所有样本都是固定的，不能很好地适应不同的情况。在这项工作中，我们提出了一种新的用于基于骨架的动作识别的动态超图卷积网络（DHGCN）。DHGCN使用超图表示骨骼结构，有效地利用人体关节中包含的运动信息。骨架超图中的每个关节根据其运动动态分配相应的权重，并且我们的模型中的超图拓扑可以根据关节之间的关系动态调整到不同的样本。实验结果表明，我们的模型在动力学骨架400、NTU RGB+D 60和NTU RGB+D 120三个数据集上的性能具有竞争力。摘要：Graph convolutional networks (GCNs) based methods have achieved advanced performance on skeleton-based action recognition task. However, the skeleton graph cannot fully represent the motion information contained in skeleton data. In addition, the topology of the skeleton graph in the GCN-based methods is manually set according to natural connections, and it is fixed for all samples, which cannot well adapt to different situations. In this work, we propose a novel dynamic hypergraph convolutional networks (DHGCN) for skeleton-based action recognition. DHGCN uses hypergraph to represent the skeleton structure to effectively exploit the motion information contained in human joints. Each joint in the skeleton hypergraph is dynamically assigned the corresponding weight according to its moving, and the hypergraph topology in our model can be dynamically adjusted to different samples according to the relationship between the joints. Experimental results demonstrate that the performance of our model achieves competitive performance on three datasets: Kinetics-Skeleton 400, NTU RGB+D 60, and NTU RGB+D 120.

【3】 Object Recognition as Classification of Visual Properties 标题：作为视觉属性分类的目标识别链接：https://arxiv.org/abs/2112.10531

作者：Fausto Giunchiglia,Mayukh Bagchi 机构：Object Recognition as Classification of Visual Properties 摘要：我们的工作基于概念的遥视建模，即实现识别和分类不同功能的能力。因此，我们对两种类型的概念进行建模——适用于利用视觉特性进行对象识别的物质概念和适用于利用语言基础特性对物质概念进行分类的分类概念。本文的目标是证明物体识别可以被解释为视觉特性的分类，这与主流计算机视觉中的工作不同。为此，我们提出了一个基于Ranganathan的四阶段知识组织过程的目标识别过程，该过程基于物质概念和分类概念的遥操作区别。我们还简要介绍了正在进行的项目MultiMedia UKC，该项目的目标是按照我们提出的流程构建一个对象识别资源摘要：We base our work on the teleosemantic modelling of concepts as abilities implementing the distinct functions of recognition and classification. Accordingly, we model two types of concepts - substance concepts suited for object recognition exploiting visual properties, and classification concepts suited for classification of substance concepts exploiting linguistically grounded properties. The goal in this paper is to demonstrate that object recognition can be construed as classification of visual properties, as distinct from work in mainstream computer vision. Towards that, we present an object recognition process based on Ranganathan's four-phased faceted knowledge organization process, grounded in the teleosemantic distinctions of substance concept and classification concept. We also briefly introduce the ongoing project MultiMedia UKC, whose aim is to build an object recognition resource following our proposed process

【4】 Evaluation and Comparison of Deep Learning Methods for Pavement Crack Identification with Visual Images 标题：基于视觉图像的深度学习路面裂缝识别方法评价与比较链接：https://arxiv.org/abs/2112.10390

作者：Kai-Liang Lu 机构：一, Jiangsu Automation Research Institute Shanghai Branch, Shengxia Road NO., Pudong, Shanghai , P.R. China, Pavement crack identification with visual images via deep learning algorithms, compared to the 备注：This work will be submitted for possible publication. It is a further study from 2012.14704v2 摘要：与接触检测技术相比，基于视觉图像的深度学习算法识别路面裂缝具有不受检测对象材料限制、速度快、成本低等优点。首先回顾了转移学习（TL）、编译码器（ED）、生成对抗网络（GAN）及其常用模块的基本框架和典型模型结构，然后总结了卷积神经网络（CNN）主干模型和GAN模型的发展。在SDNET2018和CFD公共数据集上测试了裂纹分类、分割性能和影响。在斑块样本分类方面，微调TL模型在精度上与ED模型相当甚至稍好，预测速度较快；在裂纹精确定位方面，ED和GAN算法均能实现像素级的分割，并有望在低计算能力的平台上实时检测到裂纹。此外，本文还提出了一种TL-SSGAN组合弱监督学习框架及其性能增强措施，该框架可以保持与监督学习相当的裂纹识别性能，同时大大减少了所需的标记样本数量。摘要：Compared with contact detection techniques, pavement crack identification with visual images via deep learning algorithms has the advantages of not being limited by the material of object to be detected, fast speed and low cost. The fundamental frameworks and typical model architectures of transfer learning (TL), encoder-decoder (ED), generative adversarial networks (GAN), and their common modules were first reviewed, and then the evolution of convolutional neural network (CNN) backbone models and GAN models were summarized. The crack classification, segmentation performance, and effect were tested on the SDNET2018 and CFD public data sets. In the aspect of patch sample classification, the fine-tuned TL models can be equivalent to or even slightly better than the ED models in accuracy, and the predicting time is faster; In the aspect of accurate crack location, both ED and GAN algorithms can achieve pixel-level segmentation and is expected to be detected in real time on low computing power platform. Furthermore, a weakly supervised learning framework of combined TL-SSGAN and its performance enhancement measures are proposed, which can maintain comparable crack identification performance with that of the supervised learning, while greatly reducing the number of labeled samples required.

【5】 Model-based gait recognition using graph network on very large population database 标题：基于图网络的超大规模人口库步态识别模型链接：https://arxiv.org/abs/2112.10305

作者：Zhihao Wang,Chaoying Tang 机构： College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing , China 摘要：目前，已有的步态识别系统正致力于研究从轮廓图像中提取鲁棒步态特征的方法，并取得了很大的成功。然而，步态可能对服装和携带物品等外观特征敏感。与基于外观的步态识别方法相比，基于模型的步态识别具有良好的鲁棒性。近年来，随着人体姿态估计技术的发展，基于模型的步态识别方法的难度得到了缓解。在本文中，为了抵抗主题的增加和视图的变化，构建了局部特征，并提出了一个连体网络来最大化来自同一主题的样本距离。我们利用动作识别的最新进展，将人体姿势序列嵌入到一个向量中，并引入时空图卷积块（STGCB），这一块在步态识别的动作识别中常用。在大规模人群数据集OUMVLP Pose和流行数据集CASIA-B上的实验表明，我们的方法在基于模型的步态识别中具有一些最先进的性能（SOTA）。我们的方法的代码和模型可在https://github.com/timelessnaive/Gait-for-Large-Dataset被录取后。摘要：At present, the existing gait recognition systems are focusing on developing methods to extract robust gait feature from silhouette images and they indeed achieved great success. However, gait can be sensitive to appearance features such as clothing and carried items. Compared with appearance-based method, model-based gait recognition is promising due to the robustness against these variations. In recent years, with the development of human pose estimation, the difficulty of model-based gait recognition methods has been mitigated. In this paper, to resist the increase of subjects and views variation, local features are built and a siamese network is proposed to maximize the distance of samples from the same subject. We leverage recent advances in action recognition to embed human pose sequence to a vector and introduce Spatial-Temporal Graph Convolution Blocks (STGCB) which has been commonly used in action recognition for gait recognition. Experiments on the very large population dataset named OUMVLP-Pose and the popular dataset, CASIA-B, show that our method archives some state-of-the-art (SOTA) performances in model-based gait recognition. The code and models of our method are available at https://github.com/timelessnaive/Gait-for-Large-Dataset after being accepted.

【6】 Camera-aware Style Separation and Contrastive Learning for Unsupervised Person Re-identification 标题：基于相机感知的风格分离和对比学习无人识别链接：https://arxiv.org/abs/2112.10089

作者：Xue Li,Tengfei Liang,Yi Jin,Tao Wang,Yidong Li 机构：School of Computer and Information Technology, Beijing JiaoTong University, Beijing, China 备注：6 pages, 4 figures, 2 tables 摘要：无监督人员再识别（ReID）是一项具有挑战性的任务，没有数据注释来指导区别性学习。现有的方法试图通过聚类提取的嵌入来生成伪标签来解决这个问题。然而，大多数方法忽略了相机样式变化引起的类内差异，一些方法虽然试图解决相机样式对特征分布的负面影响，但相对复杂和间接。为了解决这个问题，我们提出了一种相机感知风格分离和对比学习方法（CA-UReID），该方法通过设计的相机感知注意模块直接在特征空间中分离相机风格。它可以明确地将可学习的特征划分为特定于摄影机的部分和不可知于摄影机的部分，从而减少不同摄影机的影响。此外，为了进一步缩小摄像头之间的差距，我们设计了一个摄像头感知的对比中心损失，以了解每个身份的更多区分嵌入。大量实验表明，在无监督的person-ReID任务中，我们的方法优于最新的方法。摘要：Unsupervised person re-identification (ReID) is a challenging task without data annotation to guide discriminative learning. Existing methods attempt to solve this problem by clustering extracted embeddings to generate pseudo labels. However, most methods ignore the intra-class gap caused by camera style variance, and some methods are relatively complex and indirect although they try to solve the negative impact of the camera style on feature distribution. To solve this problem, we propose a camera-aware style separation and contrastive learning method (CA-UReID), which directly separates camera styles in the feature space with the designed camera-aware attention module. It can explicitly divide the learnable feature into camera-specific and camera-agnostic parts, reducing the influence of different cameras. Moreover, to further narrow the gap across cameras, we design a camera-aware contrastive center loss to learn more discriminative embedding for each identity. Extensive experiments demonstrate the superiority of our method over the state-of-the-art methods on the unsupervised person ReID task.

【7】 Precondition and Effect Reasoning for Action Recognition 标题：动作识别的前提和效果推理链接：https://arxiv.org/abs/2112.10057

作者：Yoo Hongsang,Li Haopeng,Ke Qiuhong,Liu Liangchen,Zhang Rui 机构： Rui ZhangbaSchool of Computing and Information Systems, University of Melbourne, AustraliabSchool of Information Science and Technology, Tsinghua University 摘要：人体动作识别由于其研究和应用意义，近年来受到了广泛关注。现有的动作识别研究大多侧重于从视频中学习有效的时空特征，而忽略了前提、动作和效果之间的强因果关系。这种关系对于动作识别的准确性也至关重要。在本文中，我们提出了基于前提和效果的因果关系模型，以提高动作识别的性能。具体而言，提出了一个循环推理模型来捕捉动作识别的因果关系。为此，我们对大规模动作数据集的前提条件和效果进行了注释。实验结果表明，提出的循环推理模型能够有效地推理前提和效果，提高动作识别性能。摘要：Human action recognition has drawn a lot of attention in the recent years due to the research and application significance. Most existing works on action recognition focus on learning effective spatial-temporal features from videos, but neglect the strong causal relationship among the precondition, action and effect. Such relationships are also crucial to the accuracy of action recognition. In this paper, we propose to model the causal relationships based on the precondition and effect to improve the performance of action recognition. Specifically, a Cycle-Reasoning model is proposed to capture the causal relationships for action recognition. To this end, we annotate precondition and effect for a large-scale action dataset. Experimental results show that the proposed Cycle-Reasoning model can effectively reason about the precondition and effect and can enhance action recognition performance.

【8】 Tell me what you see: A zero-shot action recognition method based on natural language descriptions 标题：告诉我你看到了什么：一种基于自然语言描述的Zero-Shot动作识别方法链接：https://arxiv.org/abs/2112.09976

作者：Valter Estevam,Rayson Laroca,David Menotti,Helio Pedrini 机构：Federal Institute of Paran´a, Irati-PR,-, Brazil, University of Campinas, Institute of Computing, Campinas-SP,-, Brazil, Federal University of Paran´a, Department of Informatics, Curitiba-PR,-, Brazil 摘要：最近，有几种方法探索了视频中目标的检测和分类，以实现Zero-Shot动作识别，并取得了显著的效果。在这些方法中，类-对象关系用于将视觉模式和语义侧信息关联起来，因为这些关系也往往出现在文本中。因此，词向量方法将在其潜在表示中反映它们。受这些方法以及视频字幕不仅能用一组对象而且能用上下文信息来描述事件的能力的启发，我们提出了一种方法，在这种方法中，视频字幕模型（称为观察者）提供了不同且互补的描述句。我们证明，在ZSAR中，用描述性句子而不是深层特征表示视频是可行的，并且自然地缓解了领域适应问题，因为我们在UCF101数据集上达到了最先进的（SOTA）性能，在HMDB51上达到了无需训练集的竞争性能。我们还证明了词向量不适合构建描述的语义嵌入空间。因此，我们建议使用从互联网上搜索引擎获取的文档中提取的句子来表示类，而不需要对描述的质量进行任何人工评估。最后，我们使用在多个文本数据集上的释义任务中预先训练的基于BERT的嵌入程序构建了一个共享语义空间。我们表明，这种预训练对于弥合语义鸿沟至关重要。对于这两种类型的信息（视觉信息和语义信息），在这个空间上的投影都很简单，因为它们都是句子，因此可以在这个共享空间中使用最近邻规则进行分类。我们的代码可在https://github.com/valterlej/zsarcap. 摘要：Recently, several approaches have explored the detection and classification of objects in videos to perform Zero-Shot Action Recognition with remarkable results. In these methods, class-object relationships are used to associate visual patterns with the semantic side information because these relationships also tend to appear in texts. Therefore, word vector methods would reflect them in their latent representations. Inspired by these methods and by video captioning's ability to describe events not only with a set of objects but with contextual information, we propose a method in which video captioning models, called observers, provide different and complementary descriptive sentences. We demonstrate that representing videos with descriptive sentences instead of deep features, in ZSAR, is viable and naturally alleviates the domain adaptation problem, as we reached state-of-the-art (SOTA) performance on the UCF101 dataset and competitive performance on HMDB51 without their training sets. We also demonstrate that word vectors are unsuitable for building the semantic embedding space of our descriptions. Thus, we propose to represent the classes with sentences extracted from documents acquired with search engines on the Internet, without any human evaluation on the quality of descriptions. Lastly, we build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets. We show that this pre-training is essential for bridging the semantic gap. The projection onto this space is straightforward for both types of information, visual and semantic, because they are sentences, enabling the classification with nearest neighbour rule in this shared space. Our code is available at https://github.com/valterlej/zsarcap.

【9】 Distill and De-bias: Mitigating Bias in Face Recognition using Knowledge Distillation 标题：抽取与去偏向：利用知识精馏减轻人脸识别中的偏向链接：https://arxiv.org/abs/2112.09786

作者：Prithviraj Dhar,Joshua Gleason,Aniket Roy,Carlos D. Castillo,P. Jonathon Phillips,Rama Chellappa 机构：Johns Hopkins University,University of Maryland, College Park,NIST 摘要：人脸识别网络通常表现出对敏感属性（如性别、肤色等）的偏见。对于性别和肤色，我们观察到网络关注的人脸区域因属性类别而异。这可能导致偏见。基于这一直觉，我们提出了一种新的基于提取的方法，称为提取和消除偏差（D&D），以强制网络关注相似的人脸区域，而不考虑属性类别。在D&D中，我们在一类属性的图像上训练教师网络；e、浅肤色。然后从教师那里提取信息，在剩余类别的图像上训练学生网络；e、例如，深色肤色。特征级的蒸馏损失限制学生网络生成类似于教师的表示。这允许学生网络关注所有属性类别的相似人脸区域，并使其能够减少偏差。我们还提出了D&D之上的第二个蒸馏步骤，称为D&D++。对于D&D++网络，我们将D&D网络的“无偏见”提取为一个新的学生网络，即D&D++网络。我们在所有属性类别上训练新网络；e、浅肤色和深色肤色。这有助于我们训练一个属性偏向性较小的网络，同时获得比D&D更高的人脸验证性能。我们表明，D&D++在减少IJB-C数据集上的性别和肤色偏向方面优于现有基线，同时获得比现有对抗性去偏向方法更高的人脸验证性能。我们在两个最先进的人脸识别网络：Crystalface和ArcFace上评估了我们提出的方法的有效性。摘要：Face recognition networks generally demonstrate bias with respect to sensitive attributes like gender, skintone etc. For gender and skintone, we observe that the regions of the face that a network attends to vary by the category of an attribute. This might contribute to bias. Building on this intuition, we propose a novel distillation-based approach called Distill and De-bias (D&D) to enforce a network to attend to similar face regions, irrespective of the attribute category. In D&D, we train a teacher network on images from one category of an attribute; e.g. light skintone. Then distilling information from the teacher, we train a student network on images of the remaining category; e.g., dark skintone. A feature-level distillation loss constrains the student network to generate teacher-like representations. This allows the student network to attend to similar face regions for all attribute categories and enables it to reduce bias. We also propose a second distillation step on top of D&D, called D&D++. For the D&D++ network, we distill the `un-biasedness' of the D&D network into a new student network, the D&D++ network. We train the new network on all attribute categories; e.g., both light and dark skintones. This helps us train a network that is less biased for an attribute, while obtaining higher face verification performance than D&D. We show that D&D++ outperforms existing baselines in reducing gender and skintone bias on the IJB-C dataset, while obtaining higher face verification performance than existing adversarial de-biasing methods. We evaluate the effectiveness of our proposed methods on two state-of-the-art face recognition networks: Crystalface and ArcFace.

【10】 Skin lesion segmentation and classification using deep learning and handcrafted features 标题：基于深度学习和手工特征的皮肤病变分割与分类链接：https://arxiv.org/abs/2112.10307

作者：Redha Ali,Hussin K. Ragb 机构：Department of Electrical and Computer Engineering, University of Dayton, College Park, Dayton, Ohio , Christian Brothers University, Memphis, Tennessee 备注：7 pages, 3 figures 摘要：皮肤损伤的准确诊断是皮肤镜图像分类的关键任务。在本研究中，我们形成了一种新的图像特征，称为混合特征，它比单一方法特征具有更强的识别能力。本研究涉及一种新技术，在训练过程中，我们将手工制作的特征或特征转换注入到卷积神经网络（CNN）模型的完全连接层。根据我们的文献回顾，到目前为止，还没有研究检查或调查在训练过程中将手工特征注入CNN模型对分类性能的影响。此外，我们还研究了分割掩码的影响及其对整体分类性能的影响。我们的模型实现了92.3%的均衡多类准确率，比典型的用于深度学习的单方法分类器结构高6.8%。摘要：Accurate diagnostics of a skin lesion is a critical task in classification dermoscopic images. In this research, we form a new type of image features, called hybrid features, which has stronger discrimination ability than single method features. This study involves a new technique where we inject the handcrafted features or feature transfer into the fully connected layer of Convolutional Neural Network (CNN) model during the training process. Based on our literature review until now, no study has examined or investigated the impact on classification performance by injecting the handcrafted features into the CNN model during the training process. In addition, we also investigated the impact of segmentation mask and its effect on the overall classification performance. Our model achieves an 92.3% balanced multiclass accuracy, which is 6.8% better than the typical single method classifier architecture for deep learning.

【11】 Interpretable and Interactive Deep Multiple Instance Learning for Dental Caries Classification in Bitewing X-rays 标题：可解释交互式深度多示例学习在咬合X射线龋病分类中的应用链接：https://arxiv.org/abs/2112.09694

作者：Benjamin Bergner,Csaba Rohrer,Aiham Taleb,Martha Duchrau,Guilherme De Leon,Jonas Almeida Rodrigues,Falk Schwendicke,Joachim Krois,Christoph Lippert 机构： Digital Health & Machine Learning, Hasso Plattner Institute, University of Potsdam, Germany, Department of Oral Diagnostics, Digital Health and Health Services Research, Charit´e - Univer-, sit¨atsmedizin Berlin, Germany 备注：19 pages, 10 figures, submitted to MIDL 2022 摘要：我们提出了一种简单有效的基于深度多实例学习的图像分类体系结构，并将其应用于具有挑战性的牙齿X光片龋齿检测任务中。从技术上讲，我们的方法有两方面的贡献：第一，它输出局部斑块分类概率的热图，尽管使用弱图像级别标签进行训练。第二，可以通过学习分割标签来指导训练。与现有方法相比，人类用户可以忠实地解释预测，并与模型交互，以决定关注哪些区域。实验是在一个庞大的临床数据集上进行的，该数据集的比特数为$\sim$38k（$\sim$316k牙齿），与各种基线相比，我们取得了具有竞争力的性能。在外部龋齿分割模型的指导下，分类和定位性能显著提高。摘要：We propose a simple and efficient image classification architecture based on deep multiple instance learning, and apply it to the challenging task of caries detection in dental radiographs. Technically, our approach contributes in two ways: First, it outputs a heatmap of local patch classification probabilities despite being trained with weak image-level labels. Second, it is amenable to learning from segmentation labels to guide training. In contrast to existing methods, the human user can faithfully interpret predictions and interact with the model to decide which regions to attend to. Experiments are conducted on a large clinical dataset of $\sim$38k bitewings ($\sim$316k teeth), where we achieve competitive performance compared to various baselines. When guided by an external caries segmentation model, a significant improvement in classification and localization performance is observed.

分割|语义相关(8篇)

【1】 Mask2Former for Video Instance Segmentation 标题：用于视频实例分割的Mask2Former 链接：https://arxiv.org/abs/2112.10764

作者：Bowen Cheng,Anwesa Choudhuri,Ishan Misra,Alexander Kirillov,Rohit Girdhar,Alexander G. Schwing 机构：†Equal advising., Back-, bone, Pixel, Decoder, Transformer, 𝐿×, mask, class 备注：Code and models: this https URL 摘要：我们发现Mask2Former在视频实例分割方面也实现了最先进的性能，而无需修改体系结构、丢失甚至训练管道。在本报告中，我们展示了通用的图像分割体系结构，通过直接预测三维分割体积，可以简单地推广到视频分割。具体而言，Mask2Former在YouTubeVIS-2019和YouTubeVIS-2021上分别设定了60.4 AP和52.6 AP的最新水平。我们相信Mask2Former还能够处理视频语义和全景分割，因为它在图像分割方面具有多功能性。我们希望这将使最先进的视频分割研究更容易获得，并使设计通用的图像和视频分割体系结构受到更多关注。摘要：We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline. In this report, we show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art of 60.4 AP on YouTubeVIS-2019 and 52.6 AP on YouTubeVIS-2021. We believe Mask2Former is also capable of handling video semantic and panoptic segmentation, given its versatility in image segmentation. We hope this will make state-of-the-art video segmentation research more accessible and bring more attention to designing universal image and video segmentation architectures.

【2】 Anisotropic mesh adaptation for region-based segmentation accounting for image spatial information 标题：考虑图像空间信息的各向异性网格自适应区域分割链接：https://arxiv.org/abs/2112.10138

作者：Matteo Giacomini,Simona Perotto 备注：38 pages, 13 figures, 1 table 摘要：提出了一种各向异性网格自适应增强的有限元图像分割策略。该方法依赖于分割Bregman算法来最小化基于区域的能量泛函，以及基于各向异性恢复的误差估计来驱动网格自适应。更准确地说，贝叶斯能量泛函被认为是考虑图像空间信息，确保该方法能够识别复杂图像中的不均匀空间模式。此外，各向异性网格自适应可以保证在减少自由度的情况下，对图像背景和前景之间的界面进行清晰的检测。在一组真实图像上对由此产生的分割自适应Bregman算法进行了测试，显示了该方法的准确性和鲁棒性，即使在存在高斯、椒盐和斑点噪声的情况下也是如此。摘要：A finite element-based image segmentation strategy enhanced by an anisotropic mesh adaptation procedure is presented. The methodology relies on a split Bregman algorithm for the minimisation of a region-based energy functional and on an anisotropic recovery-based error estimate to drive mesh adaptation. More precisely, a Bayesian energy functional is considered to account for image spatial information, ensuring that the methodology is able to identify inhomogeneous spatial patterns in complex images. In addition, the anisotropic mesh adaptation guarantees a sharp detection of the interface between background and foreground of the image, with a reduced number of degrees of freedom. The resulting split-adapt Bregman algorithm is tested on a set of real images showing the accuracy and robustness of the method, even in the presence of Gaussian, salt and pepper and speckle noise.

【3】 Prompt-Based Multi-Modal Image Segmentation 标题：基于提示的多模态图像分割链接：https://arxiv.org/abs/2112.10003

作者：Timo Lüddecke,Alexander S. Ecker 机构：Institute of Computer Science and Campus Institute Data Science, University of G¨ottingen, Max Planck Institute for Dynamics and Self-Organization 摘要：图像分割通常通过为一组固定的对象类训练模型来解决。以后合并额外的类或更复杂的查询是昂贵的，因为它需要在包含这些表达式的数据集上重新训练模型。在这里，我们提出了一个系统，可以在测试时根据任意提示生成图像分割。提示可以是文本或图像。这种方法使我们能够为三种常见的分割任务创建一个统一的模型（训练一次），这三种任务面临不同的挑战：引用表达式分割、Zero-Shot分割和一镜头分割。我们将CLIP模型作为主干，并使用基于转换器的解码器进行扩展，从而实现密集预测。在对PhraseCut数据集的扩展版本进行训练后，我们的系统基于自由文本提示或表示查询的附加图像生成图像的二进制分割图。详细分析了后一种基于图像的提示的不同变体。这种新颖的混合输入不仅允许动态地适应上述三种分割任务，而且还允许动态地适应任何可以制定文本或图像查询的二进制分割任务。最后，我们发现我们的系统能够很好地适应涉及供给或属性的广义查询。源代码：https://eckerlab.org/code/clipseg 摘要：Image segmentation is usually addressed by training a model for a fixed set of object classes. Incorporating additional classes or more complex queries later is expensive as it requires re-training the model on a dataset that encompasses these expressions. Here we propose a system that can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text or an image. This approach enables us to create a unified model (trained once) for three common segmentation tasks, which come with distinct challenges: referring expression segmentation, zero-shot segmentation and one-shot segmentation. We build upon the CLIP model as a backbone which we extend with a transformer-based decoder that enables dense prediction. After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query. Different variants of the latter image-based prompts are analyzed in detail. This novel hybrid input allows for dynamic adaptation not only to the three segmentation tasks mentioned above, but to any binary segmentation task where a text or image query can be formulated. Finally, we find our system to adapt well to generalized queries involving affordances or properties. Source code: https://eckerlab.org/code/clipseg

【4】 Anomaly Discovery in Semantic Segmentation via Distillation Comparison Networks 标题：基于蒸馏比较网络的语义分割异常发现链接：https://arxiv.org/abs/2112.09908

作者：Huan Zhou,Shi Gong,Yu Zhou,Zengqiang Zheng,Ronghua Liu,Xiang Bai 机构： Huazhong University of Science and Technology, Wuhan Jingce Electronic Group Co.,Ltd. 备注：9 pages, 7 figures 摘要：本文旨在解决语义分割中的异常发现问题。我们的主要观察结果是，语义分类在现有方法中起着关键作用，而分类错误的像素很容易被视为异常。这种现象经常出现，很少被讨论，这大大降低了异常发现的性能。为此，我们提出了一种新的蒸馏比较网络（DiCNet）。它由一个教师分支和一个学生分支组成，教师分支是移除语义分类头的语义分段网络，学生分支是通过分布蒸馏从教师分支中提取出来的。我们证明了提取保证了两个分支的语义特征在已知类中保持一致性，而在未知类中反映不一致性。因此，我们利用两个分支之间的语义特征差异来发现异常。DiCNet在推理过程中放弃了语义分类头，从而大大缓解了由于语义分类错误而引起的问题。在StreetHazards数据集和BDD异常数据集上进行了大量实验，以验证DiCNet的优越性能。特别是，DiCNet在StreetHazards数据集上的AUPR提高了6.3%，FPR95提高了5.2%，在BDD异常数据集上的AUPR提高了4.2%，FPR95提高了6.8%。代码可在https://github.com/zhouhuan-hust/DiCNet. 摘要：This paper aims to address the problem of anomaly discovery in semantic segmentation. Our key observation is that semantic classification plays a critical role in existing approaches, while the incorrectly classified pixels are easily regarded as anomalies. Such a phenomenon frequently appears and is rarely discussed, which significantly reduces the performance of anomaly discovery. To this end, we propose a novel Distillation Comparison Network (DiCNet). It comprises of a teacher branch which is a semantic segmentation network that removed the semantic classification head, and a student branch that is distilled from the teacher branch through a distribution distillation. We show that the distillation guarantees the semantic features of the two branches hold consistency in the known classes, while reflect inconsistency in the unknown class. Therefore, we leverage the semantic feature discrepancy between the two branches to discover the anomalies. DiCNet abandons the semantic classification head in the inference process, and hence significantly alleviates the issue caused by incorrect semantic classification. Extensive experimental results on StreetHazards dataset and BDD-Anomaly dataset are conducted to verify the superior performance of DiCNet. In particular, DiCNet obtains a 6.3% improvement in AUPR and a 5.2% improvement in FPR95 on StreetHazards dataset, achieves a 4.2% improvement in AUPR and a 6.8% improvement in FPR95 on BDD-Anomaly dataset. Codes are available at https://github.com/zhouhuan-hust/DiCNet.

【5】 3D Instance Segmentation of MVS Buildings 标题：MVS建筑的三维实例分割链接：https://arxiv.org/abs/2112.09902

作者：Yanghui Xu,Jiazhou Chen,Shufang Lu,Ronghua Liang,Liangliang Nan 机构： Liang are with the School of ComputerScience and Technology, Zhejiang University of Technology, Nan is with the Deft University of Technology 摘要：我们提出了一个新的框架，用于从多视图立体（MVS）城市场景中分割三维建筑物。与关注城市场景语义分割的现有工作不同，这项工作的重点在于检测和分割三维建筑实例，即使它们附着并嵌入在大型且不精确的三维曲面模型中。首先通过添加高度贴图将多视图RGB图像增强为RGBH图像，然后使用微调的2D实例分割神经网络对其进行分割以获得所有屋顶实例。然后将来自不同多视图图像的屋顶实例遮罩聚集到全局遮罩中。我们的遮罩聚类考虑了空间遮挡和重叠，可以消除多视图图像之间的分割模糊性。基于这些全局遮罩，通过遮罩背面投影分割出三维屋顶实例，并通过马尔可夫随机场（MRF）优化将其扩展到整个建筑实例。定量评估和消融研究表明了该方法所有主要步骤的有效性。还提供了一个用于评估三维建筑模型实例分割的数据集。据我们所知，它是第一个实例分割级别的三维城市建筑数据集。摘要：We present a novel framework for instance segmentation of 3D buildings from Multi-view Stereo (MVS) urban scenes. Unlike existing works focusing on semantic segmentation of an urban scene, the emphasis of this work lies in detecting and segmenting 3D building instances even if they are attached and embedded in a large and imprecise 3D surface model. Multi-view RGB images are first enhanced to RGBH images by adding a heightmap and are segmented to obtain all roof instances using a fine-tuned 2D instance segmentation neural network. Roof instance masks from different multi-view images are then clustered into global masks. Our mask clustering accounts for spatial occlusion and overlapping, which can eliminate segmentation ambiguities among multi-view images. Based on these global masks, 3D roof instances are segmented out by mask back-projections and extended to the entire building instances through a Markov random field (MRF) optimization. Quantitative evaluations and ablation studies have shown the effectiveness of all major steps of the method. A dataset for the evaluation of instance segmentation of 3D building models is provided as well. To the best of our knowledge, it is the first dataset for 3D urban buildings on the instance segmentation level.

【6】 HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet 标题：HyperSegNAS：利用HyperNet将一次性神经结构搜索与三维医学图像分割相结合链接：https://arxiv.org/abs/2112.10652

作者：Cheng Peng,Andriy Myronenko,Ali Hatamizadeh,Vish Nath,Md Mahfuzur Rahman Siddiquee,Yufan He,Daguang Xu,Rama Chellappa,Dong Yang 机构：Johns Hopkins University, NVIDIA, Arizona State University 摘要：由于物体（如器官或肿瘤）形状和模式的高度可变性，三维医学图像的语义分割是一项具有挑战性的任务。鉴于近年来深度学习在医学图像分割方面取得的成功，人们引入了神经结构搜索（NAS）来寻找高性能的三维分割网络结构。然而，由于3D数据的大量计算需求和架构搜索的离散优化性质，以前的NAS方法需要较长的搜索时间或必要的连续松弛，通常会导致次优网络架构。虽然一次性NAS可以潜在地解决这些缺点，但在广阔的多尺度多路径搜索空间中，它在分割领域的应用还没有得到很好的研究。为了使一次性NAS能够用于医学图像分割，我们的方法称为HyperSegNAS，它引入了一个超网，通过合并架构拓扑信息来辅助超网训练。这样的超网可以在超级网经过训练后移除，并且在架构搜索期间不会引入任何开销。我们表明，与以前最先进的（SOTA）分段网络相比，超EGNAS产生了性能更好、更直观的体系结构；此外，它还可以在不同的计算约束条件下快速准确地找到合适的体系结构候选者。我们的方法在医学分段十项全能（MSD）挑战赛的公共数据集上进行了评估，并实现了SOTA性能。摘要：Semantic segmentation of 3D medical images is a challenging task due to the high variability of the shape and pattern of objects (such as organs or tumors). Given the recent success of deep learning in medical image segmentation, Neural Architecture Search (NAS) has been introduced to find high-performance 3D segmentation network architectures. However, because of the massive computational requirements of 3D data and the discrete optimization nature of architecture search, previous NAS methods require a long search time or necessary continuous relaxation, and commonly lead to sub-optimal network architectures. While one-shot NAS can potentially address these disadvantages, its application in the segmentation domain has not been well studied in the expansive multi-scale multi-path search space. To enable one-shot NAS for medical image segmentation, our method, named HyperSegNAS, introduces a HyperNet to assist super-net training by incorporating architecture topology information. Such a HyperNet can be removed once the super-net is trained and introduces no overhead during architecture search. We show that HyperSegNAS yields better performing and more intuitive architectures compared to the previous state-of-the-art (SOTA) segmentation networks; furthermore, it can quickly and accurately find good architecture candidates under different computing constraints. Our method is evaluated on public datasets from the Medical Segmentation Decathlon (MSD) challenge, and achieves SOTA performances.

【7】 Deep Co-supervision and Attention Fusion Strategy for Automatic COVID-19 Lung Infection Segmentation on CT Images 标题：CT图像上冠状病毒肺部感染自动分割的深度协同监控与注意力融合策略链接：https://arxiv.org/abs/2112.10368

作者：Haigen Hu,Leizhao Shen,Qiu Guan,Xiaoxin Li,Qianwei Zhou,Su Ruan 机构：College of Computer Science and Technology, Zhejiang University of Technology, Key Laboratory of Visual Media Intelligent Processing Technology of Zhejiang Province, University of Rouen Normandy 备注：None 摘要：由于正常和感染组织之间的形状不规则，大小不同，边界不清，因此在CT图像上准确地分割COVID-19感染病灶仍然是一项具有挑战性的任务。针对2019冠状病毒疾病，提出了一种基于编码信息的融合方案，通过增强监督信息和融合多层次多特征映射，提出了一种新的分割方案。为此，提出了一种深度协同监控（Co-supervision）方案，指导网络学习边缘和语义特征。更具体地说，首先设计了一个边缘监督模块（ESM），通过将边缘监督信息合并到下采样的初始阶段来突出低层边界特征。同时，提出了一种辅助语义监督模块（ASSM），通过将掩码监督信息集成到后一阶段来增强高层语义信息。然后，开发了一个注意融合模块（AFM），利用注意机制对不同层次的多尺度特征图进行融合，以缩小高层次和低层次特征图之间的语义差距。最后，2019冠状病毒疾病的四种CT数据集的有效性。结果表明，所提出的三个模块都是有前途的。基于基线（ResUnet），单独使用ESM、ASSM或AFM可使我们的数据集中的Dice度量分别增加1.12%、1.95%和1.63%，而将三个模型合并在一起的集成可增加3.97%。与现有的各种数据集分割方法相比，该方法在一些主要指标上都能获得更好的分割性能，并且具有最佳的泛化和综合性能。摘要：Due to the irregular shapes,various sizes and indistinguishable boundaries between the normal and infected tissues, it is still a challenging task to accurately segment the infected lesions of COVID-19 on CT images. In this paper, a novel segmentation scheme is proposed for the infections of COVID-19 by enhancing supervised information and fusing multi-scale feature maps of different levels based on the encoder-decoder architecture. To this end, a deep collaborative supervision (Co-supervision) scheme is proposed to guide the network learning the features of edges and semantics. More specifically, an Edge Supervised Module (ESM) is firstly designed to highlight low-level boundary features by incorporating the edge supervised information into the initial stage of down-sampling. Meanwhile, an Auxiliary Semantic Supervised Module (ASSM) is proposed to strengthen high-level semantic information by integrating mask supervised information into the later stage. Then an Attention Fusion Module (AFM) is developed to fuse multiple scale feature maps of different levels by using an attention mechanism to reduce the semantic gaps between high-level and low-level feature maps. Finally, the effectiveness of the proposed scheme is demonstrated on four various COVID-19 CT datasets. The results show that the proposed three modules are all promising. Based on the baseline (ResUnet), using ESM, ASSM, or AFM alone can respectively increase Dice metric by 1.12\%, 1.95\%,1.63\% in our dataset, while the integration by incorporating three models together can rise 3.97\%. Compared with the existing approaches in various datasets, the proposed method can obtain better segmentation performance in some main metrics, and can achieve the best generalization and comprehensive performance.

【8】 QU-BraTS: MICCAI BraTS 2020 Challenge on Quantifying Uncertainty in Brain Tumor Segmentation -- Analysis of Ranking Metrics and Benchmarking Results 标题：Qu-brats：MICCAI brats 2020脑瘤分割量化不确定性挑战--排名指标和基准结果分析链接：https://arxiv.org/abs/2112.10074

作者：Raghav Mehta,Angelos Filos,Ujjwal Baid,Chiharu Sako,Richard McKinley,Michael Rebsamen,Katrin Dätwyler,Raphael Meier,Piotr Radojewski,Gowtham Krishnan Murugesan,Sahil Nalawade,Chandan Ganesh,Ben Wagner,Fang F. Yu,Baowei Fei,Ananth J. Madhuranthakam,Joseph A. Maldjian,Laura Daza,Catalina Gómez,Pablo Arbeláez,Chengliang Dai,Shuo Wang,Hadrien Raynaud,Yuanhan Mo,Elsa Angelini,Yike Guo,Wenjia Bai,Subhashis Banerjee,Linmin Pei,Murat AK,Sarahi Rosas-González,Illyess Zemmoura,Clovis Tauber,Minh H. Vu,Tufve Nyholm,Tommy Löfstedt,Laura Mora Ballestar,Veronica Vilaplana,Hugh McHugh,Gonzalo Maso Talou,Alan Wang,Jay Patel,Ken Chang,Katharina Hoebel,Mishka Gidwani,Nishanth Arun,Sharut Gupta,Mehak Aggarwal,Praveer Singh,Elizabeth R. Gerstner,Jayashree Kalpathy-Cramer,Nicolas Boutry,Alexis Huard,Lasitha Vidyaratne,Md Monibor Rahman,Khan M. Iftekharuddin,Joseph Chazalon,Elodie Puybareau,Guillaume Tochon,Jun Ma,Mariano Cabezas,Xavier Llado,Arnau Oliver,Liliana Valencia,Sergi Valverde,Mehdi Amian,Mohammadreza Soltaninejad,Andriy Myronenko,Ali Hatamizadeh,Xue Feng,Quan Dou,Nicholas Tustison,Craig Meyer,Nisarg A. Shah,Sanjay Talbar,Marc-Andr Weber,Abhishek Mahajan,Andras Jakab,Roland Wiest,Hassan M. Fathallah-Shaykh,Arash Nazeri,Mikhail Milchenko,Daniel Marcus,Aikaterini Kotrotsou,Rivka Colen,John Freymann,Justin Kirby,Christos Davatzikos,Bjoern Menze,Spyridon Bakas,Yarin Gal,Tal Arbel 机构：Centre for Intelligent Machines (CIM), McGill University, Montreal, QC, Canada,Oxford Applied and Theo-, retical Machine Learning (OATML) Group, University of Oxford, Oxford, England,Center for Biomedical Image 备注：Under submission at MELBA journal 摘要：深度学习（DL）模型在各种医学成像基准挑战中提供了最先进的性能，包括脑肿瘤分割（BraTS）挑战。然而，病灶病理学多室分割（如肿瘤和病变亚区）的任务尤其具有挑战性，潜在的错误阻碍了DL模型转化为临床工作流程。以不确定性的形式量化DL模型预测的可靠性，可以对最不确定的区域进行临床审查，从而建立信任并为临床翻译铺平道路。最近，许多不确定性估计方法被引入到DL医学图像分割任务中。制定评估和比较不确定性度量性能的指标将有助于最终用户做出更明智的决策。在本研究中，我们探索和评估了BraTS 2019-2020年不确定性量化任务（QU-BraTS）期间制定的一项指标，旨在评估和排序脑肿瘤多室分割的不确定性估计。该指标（1）奖励在正确断言中产生高置信度的不确定性估计，以及在错误断言中分配低置信度的不确定性估计，以及（2）惩罚导致低置信正确断言百分比较高的不确定性度量。我们进一步对QU BraTS 2020的14个独立参与团队产生的细分不确定性进行了基准测试，所有团队也参与了主要的BraTS细分任务。总的来说，我们的研究结果证实了不确定性估计对分割算法的重要性和互补价值，因此强调了医学图像分析中不确定性量化的必要性。我们的评估代码公开于https://github.com/RagMeh11/QU-BraTS. 摘要：Deep learning (DL) models have provided the state-of-the-art performance in a wide variety of medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder the translation of DL models into clinical workflows. Quantifying the reliability of DL model predictions in the form of uncertainties, could enable clinical review of the most uncertain regions, thereby building trust and paving the way towards clinical translation. Recently, a number of uncertainty estimation methods have been introduced for DL medical image segmentation tasks. Developing metrics to evaluate and compare the performance of uncertainty measures will assist the end-user in making more informed decisions. In this study, we explore and evaluate a metric developed during the BraTS 2019-2020 task on uncertainty quantification (QU-BraTS), and designed to assess and rank uncertainty estimates for brain tumor multi-compartment segmentation. This metric (1) rewards uncertainty estimates that produce high confidence in correct assertions, and those that assign low confidence levels at incorrect assertions, and (2) penalizes uncertainty measures that lead to a higher percentages of under-confident correct assertions. We further benchmark the segmentation uncertainties generated by 14 independent participating teams of QU-BraTS 2020, all of which also participated in the main BraTS segmentation task. Overall, our findings confirm the importance and complementary value that uncertainty estimates provide to segmentation algorithms, and hence highlight the need for uncertainty quantification in medical image analyses. Our evaluation code is made publicly available at https://github.com/RagMeh11/QU-BraTS.

Zero/Few Shot|迁移|域适配|自适应(5篇)

【1】 Reciprocal Normalization for Domain Adaptation 标题：领域自适应的互易归一化方法链接：https://arxiv.org/abs/2112.10474

作者：Zhiyong Huang,Kekai Sheng,Ke Li,Jian Liang,Taiping Yao,Weiming Dong,Dengwen Zhou,Xing Sun 备注：The best feature normalization module for domain adaptation 摘要：批处理规范化（BN）广泛应用于现代深度神经网络中，它可以表示领域相关的知识，因此对于无监督领域自适应（UDA）等跨领域任务是无效的。现有的BN变体方法在规范化模块中将源域和目标域知识聚集在同一通道中。然而，跨域的相应通道的特征之间的不一致通常导致次优可转移性。在本文中，我们利用跨域关系，并提出了一种新的归一化方法，互惠归一化（RN）。具体地说，RN首先提出了一个互易补偿（RC）模块，以基于跨域信道相关获取两个域中每个信道的补偿。然后，RN开发了一个交互聚合（RA）模块，以自适应地聚合具有跨域补偿成分的特征。作为BN的替代方案，RN更适合UDA问题，并且可以很容易地集成到流行的域自适应方法中。实验表明，该方法的性能大大优于现有的归一化方法，有助于最先进的自适应方法获得更好的结果。源代码可在https://github.com/Openning07/reciprocal-normalization-for-DA. 摘要：Batch normalization (BN) is widely used in modern deep neural networks, which has been shown to represent the domain-related knowledge, and thus is ineffective for cross-domain tasks like unsupervised domain adaptation (UDA). Existing BN variant methods aggregate source and target domain knowledge in the same channel in normalization module. However, the misalignment between the features of corresponding channels across domains often leads to a sub-optimal transferability. In this paper, we exploit the cross-domain relation and propose a novel normalization method, Reciprocal Normalization (RN). Specifically, RN first presents a Reciprocal Compensation (RC) module to acquire the compensatory for each channel in both domains based on the cross-domain channel-wise correlation. Then RN develops a Reciprocal Aggregation (RA) module to adaptively aggregate the feature with its cross-domain compensatory components. As an alternative to BN, RN is more suitable for UDA problems and can be easily integrated into popular domain adaptation methods. Experiments show that the proposed RN outperforms existing normalization counterparts by a large margin and helps state-of-the-art adaptation approaches achieve better results. The source code is available on https://github.com/Openning07/reciprocal-normalization-for-DA.

【2】 Face Deblurring Based on Separable Normalization and Adaptive Denormalization 标题：基于可分离归一化和自适应反归一化的人脸去模糊链接：https://arxiv.org/abs/2112.09833

作者：Xian Zhang,Hao Zhang,Jiancheng Lv,Xiaojie Li 机构： Li are with the Chengdu University of InformationTechnology, Lv is with the Sichuan University 摘要：人脸去模糊的目的是从模糊的输入图像中恢复清晰的人脸图像，具有更清晰的结构和面部细节。然而，大多数传统的图像和人脸去模糊方法都只关注生成的图像的整体分辨率，而不考虑特定的人脸部分纹理，并且通常会产生不充分的细节。考虑到人脸和背景具有不同的分布信息，在本研究中，我们设计了一种基于可分离归一化和自适应去规范化的有效人脸去模糊网络（SNADNet）。首先，我们对人脸解析网络进行了微调，以获得准确的人脸结构。然后，将人脸解析特征分为人脸前景和人脸背景。此外，我们还构造了一种新的特征自适应去规范化方法来正则化人脸结构，作为生成更和谐、更不失真的人脸结构的辅助条件。此外，我们还提出了一种纹理提取器和多面片鉴别器来增强生成的人脸纹理信息。在CelebA和CelebA HQ数据集上的实验结果表明，所提出的人脸去模糊网络能够恢复具有更多人脸细节的人脸结构，并且在结构化相似性索引方法（SSIM）、峰值信噪比（PSNR）和，Frechet起始距离（FID）和L1，以及定性比较。摘要：Face deblurring aims to restore a clear face image from a blurred input image with more explicit structure and facial details. However, most conventional image and face deblurring methods focus on the whole generated image resolution without consideration of special face part texture and generally produce unsufficient details. Considering that faces and backgrounds have different distribution information, in this study, we designed an effective face deblurring network based on separable normalization and adaptive denormalization (SNADNet). First, We fine-tuned the face parsing network to obtain an accurate face structure. Then, we divided the face parsing feature into face foreground and background. Moreover, we constructed a new feature adaptive denormalization to regularize fafcial structures as a condition of the auxiliary to generate more harmonious and undistorted face structure. In addition, we proposed a texture extractor and multi-patch discriminator to enhance the generated facial texture information. Experimental results on both CelebA and CelebA-HQ datasets demonstrate that the proposed face deblurring network restores face structure with more facial details and performs favorably against state-of-the-art methods in terms of structured similarity indexing method (SSIM), peak signal-to-noise ratio (PSNR), Frechet inception distance (FID) and L1, and qualitative comparisons.

【3】 Improving Multi-Domain Generalization through Domain Re-labeling 标题：通过域重标注改进多域泛化链接：https://arxiv.org/abs/2112.09802

作者：Kowshik Thopalli,Sameeksha Katoch,Andreas Spanias,Pavan Turaga,Jayaraman J. Thiagarajan 机构： Arizona StateUniversity 摘要：领域泛化（DG）方法旨在开发可泛化到测试分布不同于训练数据的设置的模型。在本文中，我们重点研究了具有挑战性的多源零炮DG问题，其中来自多个源域的标记训练数据可用，但无法访问来自目标域的数据。尽管这个问题已经成为一个重要的研究课题，但令人惊讶的是，将所有源数据汇集在一起并训练单个分类器的简单解决方案在标准基准上具有很强的竞争力。更重要的是，即使是针对不同领域的不变性进行显式优化的复杂方法，也不一定能在ERM上获得不小的收益。在本文中，我们首次研究了预先指定的域标签与泛化性能之间的重要联系。通过一个激励性的案例研究和分布式稳健优化算法的一个新变体GroupDRO++，我们首先演示了推断自定义域组如何能够导致与数据集附带的原始域标签相比的一致性改进。随后，我们介绍了一种通用的多域泛化方法MulDEns，该方法使用基于ERM的深层加密主干，并通过元优化算法执行隐式域重新标记。通过对多个标准基准的实证研究，我们发现MulDEns不需要针对数据集定制增强策略或训练过程，其表现始终显著优于ERM，并产生最先进的泛化性能，即使与利用域标签的现有方法相比。摘要：Domain generalization (DG) methods aim to develop models that generalize to settings where the test distribution is different from the training data. In this paper, we focus on the challenging problem of multi-source zero-shot DG, where labeled training data from multiple source domains is available but with no access to data from the target domain. Though this problem has become an important topic of research, surprisingly, the simple solution of pooling all source data together and training a single classifier is highly competitive on standard benchmarks. More importantly, even sophisticated approaches that explicitly optimize for invariance across different domains do not necessarily provide non-trivial gains over ERM. In this paper, for the first time, we study the important link between pre-specified domain labels and the generalization performance. Using a motivating case-study and a new variant of a distributional robust optimization algorithm, GroupDRO++, we first demonstrate how inferring custom domain groups can lead to consistent improvements over the original domain labels that come with the dataset. Subsequently, we introduce a general approach for multi-domain generalization, MulDEns, that uses an ERM-based deep ensembling backbone and performs implicit domain re-labeling through a meta-optimization algorithm. Using empirical studies on multiple standard benchmarks, we show that MulDEns does not require tailoring the augmentation strategy or the training process specific to a dataset, consistently outperforms ERM by significant margins, and produces state-of-the-art generalization performance, even when compared to existing methods that exploit the domain labels.

【4】 Adaptive Subsampling for ROI-based Visual Tracking: Algorithms and FPGA Implementation 标题：基于感兴趣区域的自适应亚采样视觉跟踪算法及FPGA实现链接：https://arxiv.org/abs/2112.09775

作者：Odrika Iqbal,Victor Isaac Torres Muro,Sameeksha Katoch,Andreas Spanias,Suren Jayasuriya 机构：Jayasuriya, Member, IEEE 摘要：通过在图像传感器设计中加入可编程感兴趣区域（ROI）读出，可以极大地提高嵌入式视觉系统的能效。在这项工作中，我们研究了如何通过预测ROI在未来帧中的位置以及在该区域之外关闭像素来利用ROI可编程性来跟踪应用程序。我们将ROI预测过程和相应的传感器配置称为自适应子采样。我们的自适应子采样算法包括一个对象检测器和一个ROI预测器（卡尔曼滤波器），它们协同工作以优化视觉管道的能量效率，最终任务是对象跟踪。为了进一步促进我们的自适应算法在现实生活中的实现，我们选择了一个候选算法并将其映射到FPGA上。利用Xilinx Vitis AI工具，我们设计并加速了基于YOLO对象检测器的自适应子采样算法。为了进一步改进部署后的算法，我们评估了OTB100和LaSOT数据集上的几个竞争基线。我们发现，在OTB100和LaSOT数据集上，ECO跟踪器与卡尔曼滤波器的耦合具有0.4568和0.3471的竞争AUC分数。此外，该算法的功率效率与其他基线相比，优于其他两个基线。基于ECO的算法在两个数据集中产生约4W的平均功耗，而基于YOLO的方法需要约6W的功耗（根据我们的功耗模型）。在精度-延迟权衡方面，基于ECO的算法提供了接近实时的性能（19.23 FPS），同时能够获得具有竞争力的跟踪精度。摘要：There is tremendous scope for improving the energy efficiency of embedded vision systems by incorporating programmable region-of-interest (ROI) readout in the image sensor design. In this work, we study how ROI programmability can be leveraged for tracking applications by anticipating where the ROI will be located in future frames and switching pixels off outside of this region. We refer to this process of ROI prediction and corresponding sensor configuration as adaptive subsampling. Our adaptive subsampling algorithms comprise an object detector and an ROI predictor (Kalman filter) which operate in conjunction to optimize the energy efficiency of the vision pipeline with the end task being object tracking. To further facilitate the implementation of our adaptive algorithms in real life, we select a candidate algorithm and map it onto an FPGA. Leveraging Xilinx Vitis AI tools, we designed and accelerated a YOLO object detector-based adaptive subsampling algorithm. In order to further improve the algorithm post-deployment, we evaluated several competing baselines on the OTB100 and LaSOT datasets. We found that coupling the ECO tracker with the Kalman filter has a competitive AUC score of 0.4568 and 0.3471 on the OTB100 and LaSOT datasets respectively. Further, the power efficiency of this algorithm is on par with, and in a couple of instances superior to, the other baselines. The ECO-based algorithm incurs a power consumption of approximately 4 W averaged across both datasets while the YOLO-based approach requires power consumption of approximately 6 W (as per our power consumption model). In terms of accuracy-latency tradeoff, the ECO-based algorithm provides near-real-time performance (19.23 FPS) while managing to attain competitive tracking precision.

【5】 Cross-Domain Federated Learning in Medical Imaging 标题：医学影像学中的跨域联合学习链接：https://arxiv.org/abs/2112.10001

作者：Vishwa S Parekh,Shuhao Lai,Vladimir Braverman,Jeff Leal,Steven Rowe,Jay J Pillai,Michael A Jacobs 机构：Jay J. Pillai , Department of Computer Science, The Johns Hopkins University, Baltimore, MD , The Russell H. Morgan Department of Radiology and Radiological Sciences, The Johns Hopkins, University School of Medicine, Baltimore, MD , USA 备注：Under Review for MIDL 2022 摘要：在医学成像领域，联邦学习正日益得到探索，以在分布于不同数据中心的大规模数据集上训练深度学习模型，同时避免传输敏感患者信息，从而保护隐私。在这篇手稿中，我们探讨了多领域、多任务环境中的联合学习，其中不同的参与节点可能包含来自不同领域的数据集，并经过训练以解决不同的任务。我们评估了跨域联邦学习在两种不同实验环境下的目标检测和分割任务：多模态和多器官。我们在跨域联邦学习框架上的实验结果非常令人鼓舞，器官定位的重叠相似度为0.79，病变分割的重叠相似度为0.65。我们的结果证明了联邦学习在开发多领域、多任务深度学习模型方面的潜力，而无需共享来自不同领域的数据。摘要：Federated learning is increasingly being explored in the field of medical imaging to train deep learning models on large scale datasets distributed across different data centers while preserving privacy by avoiding the need to transfer sensitive patient information. In this manuscript, we explore federated learning in a multi-domain, multi-task setting wherein different participating nodes may contain datasets sourced from different domains and are trained to solve different tasks. We evaluated cross-domain federated learning for the tasks of object detection and segmentation across two different experimental settings: multi-modal and multi-organ. The result from our experiments on cross-domain federated learning framework were very encouraging with an overlap similarity of 0.79 for organ localization and 0.65 for lesion segmentation. Our results demonstrate the potential of federated learning in developing multi-domain, multi-task deep learning models without sharing data from different domains.

半弱无监督|主动学习|不确定性(7篇)

【1】 Are Large-scale Datasets Necessary for Self-Supervised Pre-training? 标题：自我监督预训练是否需要大规模数据集？链接：https://arxiv.org/abs/2112.10740

作者：Alaaeldin El-Nouby,Gautier Izacard,Hugo Touvron,Ivan Laptev,Hervé Jegou,Edouard Grave 机构：Herv´e J´egou, Meta AI, Inria, Sorbonne University 摘要：像ImageNet这样的大规模数据集上的预训练模型是计算机视觉中的标准实践。这种范式对于训练集较小的任务尤其有效，因为对于这些任务，高容量的模型往往过于适合。在这项工作中，我们考虑一个自我监督的预训练方案，只利用目标任务数据。我们认为数据集，如斯坦福汽车，素描或可可，这是数量级（S）小于Imagenet。我们的研究表明，去噪自动编码器（如BEiT或本文介绍的变体）比通过比较图像嵌入训练的常用自监督方法对预训练数据的类型和大小更具鲁棒性。与ImageNet预训练相比，我们在不同领域的各种分类数据集上获得了具有竞争力的性能。在COCO上，当仅使用COCO图像进行预训练时，检测和实例分割性能在可比环境下优于监督的ImageNet预训练。摘要：Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than Imagenet. Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains. On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.

【2】 Wiener Guided DIP for Unsupervised Blind Image Deconvolution 标题：无监督盲图像反卷积的Wiener引导倾角算法链接：https://arxiv.org/abs/2112.10271

作者：Gustav Bredell,Ertunc Erdil,Bruno Weber,Ender Konukoglu 机构： Department of Information Technology and Electrical Engineering, ETH-Zurich, Zurich, Switzerland, Institute of Pharmacology and Toxicology, University of Zurich, Zurich, Switzerland 摘要：盲反褶积是一个不适定问题，它出现在从显微镜到天文学的各个领域。问题的不适定性质要求有足够的先验知识才能得到理想的解决方案。最近的研究表明，在无监督盲反卷积优化过程中，深度学习体系结构可以作为图像生成的先决条件，但即使在单个图像上，也常常表现出性能波动。我们建议在优化过程中使用维纳反卷积来指导图像生成器，通过使用从高斯函数开始的辅助核估计为其提供模糊图像的锐化版本。我们观察到，与低频特征相比，反卷积的高频伪影被延迟再现。此外，图像生成器比模糊图像更快地再现解卷积图像的低频特征。我们将计算过程嵌入到一个约束优化框架中，并表明所提出的方法在多个数据集上具有更高的稳定性和性能。此外，我们还提供了代码。摘要：Blind deconvolution is an ill-posed problem arising in various fields ranging from microscopy to astronomy. The ill-posed nature of the problem requires adequate priors to arrive to a desirable solution. Recently, it has been shown that deep learning architectures can serve as an image generation prior during unsupervised blind deconvolution optimization, however often exhibiting a performance fluctuation even on a single image. We propose to use Wiener-deconvolution to guide the image generator during optimization by providing it a sharpened version of the blurry image using an auxiliary kernel estimate starting from a Gaussian. We observe that the high-frequency artifacts of deconvolution are reproduced with a delay compared to low-frequency features. In addition, the image generator reproduces low-frequency features of the deconvolved image faster than that of a blurry image. We embed the computational process in a constrained optimization framework and show that the proposed method yields higher stability and performance across multiple datasets. In addition, we provide the code.

【3】 Denoised Labels for Financial Time-Series Data via Self-Supervised Learning 标题：基于自监督学习的金融时间序列数据去噪标签链接：https://arxiv.org/abs/2112.10139

作者：Yanqing Ma,Carmine Ventre,Maria Polukarov 机构：King’s College London, London, United Kingdom 摘要：电子交易平台的引入有效地改变了传统系统性交易的组织结构，从报价驱动的市场转变为订单驱动的市场。其便利性导致金融数据量呈指数增长，但由于低信噪比和金融时间序列的非平稳性，很难将其用于预测未来价格。更简单的分类任务——目标是通过监督学习算法预测未来价格运动的方向——需要足够可靠的标签才能很好地概括。然而，与其他领域相比，标记金融数据的定义不那么明确：价格上涨是因为噪音还是因为信号？现有的标签方法在抗噪声和改进学习算法方面的效果有限。这项工作的灵感来自于交易中的图像分类和自监督学习的成功。我们研究将计算机视觉技术应用于金融时间序列的想法，以减少噪声暴露，从而生成正确的标签。我们将标签生成视为自我监督学习方法的借口任务，并将文献中常用的朴素（和噪声）标签与相同下游分类任务的去噪自动编码器生成的标签进行比较。我们的结果表明，我们的去噪标签提高了下游学习算法的性能，无论是对于小型还是大型数据集。我们进一步表明，我们获得的信号可以用于有效地与二进制策略进行交易。我们认为，利用所提出的技术，自我监督学习构成了一个强大的框架，可以生成“更好”的金融标签，有助于研究市场的基本模式。摘要：The introduction of electronic trading platforms effectively changed the organisation of traditional systemic trading from quote-driven markets into order-driven markets. Its convenience led to an exponentially increasing amount of financial data, which is however hard to use for the prediction of future prices, due to the low signal-to-noise ratio and the non-stationarity of financial time series. Simpler classification tasks -- where the goal is to predict the directions of future price movement -- via supervised learning algorithms, need sufficiently reliable labels to generalise well. Labelling financial data is however less well defined than other domains: did the price go up because of noise or because of signal? The existing labelling methods have limited countermeasures against noise and limited effects in improving learning algorithms. This work takes inspiration from image classification in trading and success in self-supervised learning. We investigate the idea of applying computer vision techniques to financial time-series to reduce the noise exposure and hence generate correct labels. We look at the label generation as the pretext task of a self-supervised learning approach and compare the naive (and noisy) labels, commonly used in the literature, with the labels generated by a denoising autoencoder for the same downstream classification task. Our results show that our denoised labels improve the performances of the downstream learning algorithm, for both small and large datasets. We further show that the signals we obtain can be used to effectively trade with binary strategies. We suggest that with proposed techniques, self-supervised learning constitutes a powerful framework for generating "better" financial labels that are useful for studying the underlying patterns of the market.

【4】 Space Non-cooperative Object Active Tracking with Deep Reinforcement Learning 标题：基于深度强化学习的空间非合作目标主动跟踪链接：https://arxiv.org/abs/2112.09854

作者：Dong Zhou,Guanghui Sun,Wenxiao Lei 机构：Harbin Institute of Technology, Harbin, China 摘要：空间非合作目标的主动视觉跟踪对于未来智能航天器实现空间碎片清除、小行星探测、自主交会对接具有重要意义。然而，现有的工作经常把这个任务考虑到不同的子问题（例如图像预处理、特征提取和匹配、位置和姿态估计、控制律设计）和单独优化每个模块，这是微不足道的和次优的。为此，我们提出了一种基于DQN算法的端到端主动视觉跟踪方法DRLAVT。它可以仅依靠彩色或RGBD图像引导追踪航天器接近任意空间非合作目标，这明显优于采用最先进的二维单目跟踪器SiamRPN的基于位置的视觉伺服基线算法。在不同的网络结构、不同的扰动和多个目标下进行的大量实验证明了DRLAVT的先进性和鲁棒性。此外，通过数百次尝试和错误，我们进一步证明了我们的方法确实通过深度强化学习学习了目标的运动模式。摘要：Active visual tracking of space non-cooperative object is significant for future intelligent spacecraft to realise space debris removal, asteroid exploration, autonomous rendezvous and docking. However, existing works often consider this task into different subproblems (e.g. image preprocessing, feature extraction and matching, position and pose estimation, control law design) and optimize each module alone, which are trivial and sub-optimal. To this end, we propose an end-to-end active visual tracking method based on DQN algorithm, named as DRLAVT. It can guide the chasing spacecraft approach to arbitrary space non-cooperative target merely relied on color or RGBD images, which significantly outperforms position-based visual servoing baseline algorithm that adopts state-of-the-art 2D monocular tracker, SiamRPN. Extensive experiments implemented with diverse network architectures, different perturbations and multiple targets demonstrate the advancement and robustness of DRLAVT. In addition, We further prove our method indeed learnt the motion patterns of target with deep reinforcement learning through hundreds of trial-and-errors.

【5】 Can uncertainty boost the reliability of AI-based diagnostic methods in digital pathology? 标题：不确定性能提高数字病理学中基于人工智能的诊断方法的可靠性吗？链接：https://arxiv.org/abs/2112.09693

作者：Milda Pocevičiūtė,Gabriel Eilertsen,Sofia Jarkman,Claes Lundström 机构：Lundstr¨om, Department of Science and Technology, Link¨oping University, Sweden, Center for Medical Image Science and Visualization, Link¨oping University, Sweden, Department of Clinical Pathology, and Department of Biomedical and Clinical 摘要：深度学习（DL）在数字病理学应用中显示出巨大潜力。基于诊断DL的解决方案的健壮性对于安全的临床部署至关重要。在这项工作中，我们评估了在数字病理学中增加DL预测的不确定性估计是否可以通过提高总体预测性能或检测预测失误而增加临床应用价值。我们比较了模型集成方法（MC退出和深度集成）与模型不可知方法（测试时间增强，TTA）的有效性。此外，还比较了四种不确定性度量。我们的实验集中于两种领域转移场景：转移到不同的医疗中心和一种代表性不足的癌症亚型。我们的结果表明，不确定性估计可以增加一些可靠性，降低分类阈值选择的敏感性。虽然高级度量和深度集成在我们的比较中表现最好，但与简单度量和TTA相比，附加值很小。重要的是，所有评估的不确定性估计方法的好处都会因域转移而减少。摘要：Deep learning (DL) has shown great potential in digital pathology applications. The robustness of a diagnostic DL-based solution is essential for safe clinical deployment. In this work we evaluate if adding uncertainty estimates for DL predictions in digital pathology could result in increased value for the clinical applications, by boosting the general predictive performance or by detecting mispredictions. We compare the effectiveness of model-integrated methods (MC dropout and Deep ensembles) with a model-agnostic approach (Test time augmentation, TTA). Moreover, four uncertainty metrics are compared. Our experiments focus on two domain shift scenarios: a shift to a different medical center and to an underrepresented subtype of cancer. Our results show that uncertainty estimates can add some reliability and reduce sensitivity to classification threshold selection. While advanced metrics and deep ensembles perform best in our comparison, the added value over simpler metrics and TTA is small. Importantly, the benefit of all evaluated uncertainty estimation methods is diminished by domain shift.

【6】 Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis 标题：用于自监督医学CT合成的增量交叉视图互蒸馏链接：https://arxiv.org/abs/2112.10325

作者：Chaowei Fang,Liang Wang,Dingwen Zhang,Jun Xu,Yixuan Yuan,Junwei Han 机构：Xidian University, Northwestern Polytechnical University, Nankai University, City University of Hong Kong 摘要：由于成像设备的限制和操作时间的高成本，计算机断层扫描（CT）通常以较低的层内分辨率获得。提高切片内分辨率有利于人类专家和计算机辅助系统的疾病诊断。为此，本文构建了一种新的医学切片合成方法，以提高切片间的分辨率。考虑到临床实践中经常缺少基本真实的中间医学切片，我们引入了增量交叉视图互蒸馏策略，以自监督学习的方式完成这项任务。具体来说，我们从三个不同的视图对这个问题进行建模：轴向视图的切片插值和冠状面和矢状面视图的像素插值。在这种情况下，从不同角度学习的模型可以提取有价值的知识来指导彼此的学习过程。我们可以重复这个过程，使模型合成中间切片数据，提高切片间分辨率。为了证明该方法的有效性，我们在一个大规模CT数据集上进行了综合实验。定量和定性的比较结果表明，我们的方法明显优于最先进的算法。摘要：Due to the constraints of the imaging device and high cost in operation time, computer tomography (CT) scans are usually acquired with low intra-slice resolution. Improving the intra-slice resolution is beneficial to the disease diagnosis for both human experts and computer-aided systems. To this end, this paper builds a novel medical slice synthesis to increase the between-slice resolution. Considering that the ground-truth intermediate medical slices are always absent in clinical practice, we introduce the incremental cross-view mutual distillation strategy to accomplish this task in the self-supervised learning manner. Specifically, we model this problem from three different views: slice-wise interpolation from axial view and pixel-wise interpolation from coronal and sagittal views. Under this circumstance, the models learned from different views can distill valuable knowledge to guide the learning processes of each other. We can repeat this process to make the models synthesize intermediate slice data with increasing inter-slice resolution. To demonstrate the effectiveness of the proposed approach, we conduct comprehensive experiments on a large-scale CT dataset. Quantitative and qualitative comparison results show that our method outperforms state-of-the-art algorithms by clear margins.

【7】 Supervised laser-speckle image sampling of skin tissue to detect very early stage of diabetes by its effects on skin subcellular properties 标题：监督下皮肤组织激光散斑图像采样检测糖尿病极早期对皮肤亚细胞特性的影响链接：https://arxiv.org/abs/2112.10024

作者：Ahmet Orun,Luke Vella Critien,Jennifer Carter,Martin Stacey 机构：De Montfort University, Leicester, LE,BH, UK 摘要：本文研究了基于K近邻算法的激光散斑图像采集专家系统在糖尿病早期检测中的有效性。随着人工智能引导激光散斑成像技术的最新发展，有可能优化激光参数，如波长，能量水平和图像纹理测量与适当的人工智能技术相结合，有效地与皮肤组织的亚细胞特性相互作用，以检测糖尿病的早期迹象。新方法可能比传统的皮肤葡萄糖水平观察更有效，因为它优化了激光物理和人工智能技术的组合，此外，它允许非专家个人进行更频繁的皮肤组织测试，以早期检测糖尿病。摘要：This paper investigates the effectiveness of an expert system based on K-nearest neighbors algorithm for laser speckle image sampling applied to the early detection of diabetes. With the latest developments in artificial intelligent guided laser speckle imaging technologies, it may be possible to optimise laser parameters, such as wavelength, energy level and image texture measures in association with a suitable AI technique to interact effectively with the subcellular properties of a skin tissue to detect early signs of diabetes. The new approach is potentially more effective than the classical skin glucose level observation because of its optimised combination of laser physics and AI techniques, and additionally, it allows non-expert individuals to perform more frequent skin tissue tests for an early detection of diabetes.

时序|行为识别|姿态|视频|运动估计(8篇)

【1】 BAPose: Bottom-Up Pose Estimation with Disentangled Waterfall Representations 标题：BAPose：基于解缠瀑布表示的自底向上姿态估计链接：https://arxiv.org/abs/2112.10716

作者：Bruno Artacho,Andreas Savakis 机构：Rochester Institute of Technology, Rochester, NY 摘要：我们提出了BAPose，一种新的自底向上的方法，可以获得最先进的多人姿势估计结果。我们的端到端可训练框架利用分离的多尺度瀑布式体系结构，并结合自适应卷积，以更精确地推断具有遮挡的拥挤场景中的关键点。由BAPose中的解纠缠瀑布模块获得的多尺度表示利用了级联体系结构中渐进过滤的效率，同时保持了与空间金字塔配置相当的多尺度视野。我们在具有挑战性的COCO和CrowdPose数据集上的结果表明，BAPose是一个高效、稳健的多人姿势估计框架，在最先进的精确度方面取得了显著的改进。摘要：We propose BAPose, a novel bottom-up approach that achieves state-of-the-art results for multi-person pose estimation. Our end-to-end trainable framework leverages a disentangled multi-scale waterfall architecture and incorporates adaptive convolutions to infer keypoints more precisely in crowded scenes with occlusions. The multi-scale representations, obtained by the disentangled waterfall module in BAPose, leverage the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Our results on the challenging COCO and CrowdPose datasets demonstrate that BAPose is an efficient and robust framework for multi-person pose estimation, achieving significant improvements on state-of-the-art accuracy.

【2】 Learning Spatio-Temporal Specifications for Dynamical Systems 标题：学习动态系统的时空规范链接：https://arxiv.org/abs/2112.10714

作者：Suhail Alsalehi,Erfan Aasi,Ron Weiss,Calin Belta 机构：: Division of Systems Engineering, Boston University, Boston, MA, USA, : Mechanical Engineering Department, Boston University, Boston, MA, USA, : Biological Engineering Department, Massachusetts Institute of Technology, Cambridge, MA, USA 备注：12 pages, submitted to L4DC 2021 摘要：从数据中学习动力系统的属性提供了重要的见解，帮助我们理解此类系统并缓解不期望的结果。在这项工作中，我们提出了一个从数据中学习时空（ST）属性作为形式逻辑规范的框架。我们介绍了SVM-STL，它是信号时序逻辑（STL）的一个扩展，能够描述各种具有时变空间模式的动态系统的时空特性。我们的框架利用机器学习技术从空间模式序列给出的系统执行中学习SVM-STL规范。我们提出了处理标记和未标记数据的方法。此外，给定SVM-STL规范形式的系统需求，我们提供了一种参数合成方法，以找到最大程度满足此类规范的参数。我们的学习框架和参数综合方法在一个反应扩散系统的例子中得到了展示。摘要：Learning dynamical systems properties from data provides important insights that help us understand such systems and mitigate undesired outcomes. In this work, we propose a framework for learning spatio-temporal (ST) properties as formal logic specifications from data. We introduce SVM-STL, an extension of Signal Signal Temporal Logic (STL), capable of specifying spatial and temporal properties of a wide range of dynamical systems that exhibit time-varying spatial patterns. Our framework utilizes machine learning techniques to learn SVM-STL specifications from system executions given by sequences of spatial patterns. We present methods to deal with both labeled and unlabeled data. In addition, given system requirements in the form of SVM-STL specifications, we provide an approach for parameter synthesis to find parameters that maximize the satisfaction of such specifications. Our learning framework and parameter synthesis approach are showcased in an example of a reaction-diffusion system.

【3】 A Multi-user Oriented Live Free-viewpoint Video Streaming System Based On View Interpolation 标题：一种基于视图插值的面向多用户的实时自由视点视频流系统链接：https://arxiv.org/abs/2112.10603

作者：Jingchuan Hu,Shuai Guo,Yu Dong,Kai Zhou,Jun Xu,Li Song 机构：Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai , China 备注：6 pages, 4 figures 摘要：自由视点视频（FVV）作为沉浸式多媒体服务的一种重要应用形式，通过强大的交互性为用户提供沉浸式体验。然而，虚拟视图合成算法的计算复杂性对FVV系统的实时性能提出了重大挑战。此外，用户交互的个性化使得传统体系结构的系统很难同时为多个用户提供服务。本文介绍了一种基于CNN的视图插值算法，用于实时合成密集的虚拟视图。在此基础上，我们还构建了一个端到端的实时免费视点系统，该系统采用了面向多用户的流媒体策略。我们的系统可以利用一个边缘服务器同时为多个用户服务，而不必在客户端带来大的视图合成负载。我们分析了整个系统，并表明我们的方法在视觉质量和延迟方面为用户提供了愉快的沉浸式体验。摘要：As an important application form of immersive multimedia services, free-viewpoint video(FVV) enables users with great immersive experience by strong interaction. However, the computational complexity of virtual view synthesis algorithms poses a significant challenge to the real-time performance of an FVV system. Furthermore, the individuality of user interaction makes it difficult to serve multiple users simultaneously for a system with conventional architecture. In this paper, we novelly introduce a CNN-based view interpolation algorithm to synthesis dense virtual views in real time. Based on this, we also build an end-to-end live free-viewpoint system with a multi-user oriented streaming strategy. Our system can utilize a single edge server to serve multiple users at the same time without having to bring a large view synthesis load on the client side. We analysis the whole system and show that our approaches give the user a pleasant immersive experience, in terms of both visual quality and latency.

【4】 Real-Time Optical Flow for Vehicular Perception with Low- and High-Resolution Event Cameras 标题：低分辨率和高分辨率事件相机的车辆感知实时光流链接：https://arxiv.org/abs/2112.10591

作者：Vincent Brebion,Julien Moreau,Franck Davoine 机构： Davoine are with Universit´e de technolo-gie de Compiegne (UTC) 备注：13 pages, journal paper 摘要：事件摄影机捕捉观察到的场景中照明的变化，而不是积累光线来创建图像。因此，它们允许在高速运动和复杂光照条件下应用，传统的基于帧的传感器通过模糊和过度曝光或曝光不足的像素显示其局限性。由于这些独特的特性，它们在相关应用中成为当今极具吸引力的传感器。基于事件的光流（EBOF）随着这些神经形态相机的普及而被研究。然而，最近出现的高清晰度神经形态传感器挑战了现有的方法，因为事件像素阵列的分辨率提高了，吞吐量也提高了很多。作为对这些问题的回答，我们提出了一个优化的框架，用于使用低分辨率和高分辨率事件摄影机实时计算光流。我们以“逆指数距离曲面”的形式为稀疏事件流建立了一种新的稠密表示。它作为一个过渡框架，设计用于使用经过验证的、最先进的基于框架的光流计算方法。我们在低分辨率和高分辨率驱动序列上对我们的方法进行了评估，结果表明，它通常比目前的技术状态获得更好的结果，同时也达到了更高的帧速率，346 x 260像素时为250Hz，1280 x 720像素时为77Hz。摘要：Event cameras capture changes of illumination in the observed scene rather than accumulating light to create images. Thus, they allow for applications under high-speed motion and complex lighting conditions, where traditional framebased sensors show their limits with blur and over- or underexposed pixels. Thanks to these unique properties, they represent nowadays an highly attractive sensor for ITS-related applications. Event-based optical flow (EBOF) has been studied following the rise in popularity of these neuromorphic cameras. The recent arrival of high-definition neuromorphic sensors, however, challenges the existing approaches, because of the increased resolution of the events pixel array and a much higher throughput. As an answer to these points, we propose an optimized framework for computing optical flow in real-time with both low- and high-resolution event cameras. We formulate a novel dense representation for the sparse events flow, in the form of the "inverse exponential distance surface". It serves as an interim frame, designed for the use of proven, state-of-the-art frame-based optical flow computation methods. We evaluate our approach on both low- and high-resolution driving sequences, and show that it often achieves better results than the current state of the art, while also reaching higher frame rates, 250Hz at 346 x 260 pixels and 77Hz at 1280 x 720 pixels.

【5】 DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction 标题：DMS-GCN：用于人体运动预测的动态多尺度时空图卷积网络链接：https://arxiv.org/abs/2112.10365

作者：Zigeng Yan,Di-Hua Zhai,Yuanqing Xia 机构：School of Automation, Beijing Institute of Technology 摘要：在许多计算机视觉应用领域中，人体运动预测是一项重要而富有挑战性的任务。最近的工作集中于利用递归神经网络（RNN）的定时处理能力，以在短期预测中获得平滑和可靠的结果。然而，正如以前的工作所证明的，RNN存在误差累积，导致不可靠的结果。在本文中，我们提出了一种简单的前馈深度神经网络用于运动预测，它考虑了人体关节之间的时间平滑性和空间依赖性。我们设计了一个多尺度时空图卷积网络（GCNs），隐式地建立了人体运动过程中的时空依赖关系，其中不同尺度在训练过程中动态融合。整个模型适用于所有操作，并遵循编码器-解码器的框架。编码器由捕获帧间运动特征的时间GCN和提取关节轨迹间空间结构的半自治学习空间GCN组成。解码器使用时间卷积网络（TCN）来保持其扩展能力。大量实验表明，在Human3的数据集上，我们的方法优于SOTA方法。6M和CMU Mocap，而只需要少得多的参数。代码将在https://github.com/yzg9353/DMSGCN. 摘要：Human motion prediction is an important and challenging task in many computer vision application domains. Recent work concentrates on utilizing the timing processing ability of recurrent neural networks (RNNs) to achieve smooth and reliable results in short-term prediction. However, as evidenced by previous work, RNNs suffer from errors accumulation, leading to unreliable results. In this paper, we propose a simple feed-forward deep neural network for motion prediction, which takes into account temporal smoothness and spatial dependencies between human body joints. We design a Multi-scale Spatio-temporal graph convolutional networks (GCNs) to implicitly establish the Spatio-temporal dependence in the process of human movement, where different scales fused dynamically during training. The entire model is suitable for all actions and follows a framework of encoder-decoder. The encoder consists of temporal GCNs to capture motion features between frames and semi-autonomous learned spatial GCNs to extract spatial structure among joint trajectories. The decoder uses temporal convolution networks (TCNs) to maintain its extensive ability. Extensive experiments show that our approach outperforms SOTA methods on the datasets of Human3.6M and CMU Mocap while only requiring much lesser parameters. Code will be available at https://github.com/yzg9353/DMSGCN.

【6】 End-to-End Learning of Multi-category 3D Pose and Shape Estimation 标题：多类别三维位姿和形状估计的端到端学习链接：https://arxiv.org/abs/2112.10196

作者：Yigit Baran Can,Alexander Liniger,Danda Pani Paudel,Luc Van Gool 机构：Computer Vision Lab, ETH Zurich, VISICS, ESATPSI, KU Leuven 摘要：在本文中，我们研究了利用物体的关键点来表示物体的形状和姿势。因此，我们提出了一种端到端的方法，可以同时检测图像中的二维关键点并将其提升到三维。该方法仅从二维关键点标注中学习二维检测和三维提升。在这方面，首次提出了一种通过基于增强的循环自我监控来明确分离姿势和三维形状的新方法。除了从图像到3D的端到端学习外，我们的方法还使用单个神经网络处理来自多个类别的对象。我们使用基于转换器的体系结构来检测关键点，以及总结图像的视觉上下文。然后在将关键点提升到3D时使用此视觉上下文信息，以便基于上下文的推理获得更好的性能。在提升过程中，我们的方法学习一小部分基本形状及其稀疏的非负系数，以表示标准框架中的三维形状。我们的方法可以处理遮挡以及各种各样的对象类。我们在三个基准上的实验表明，我们的方法比最先进的方法性能更好。我们的源代码将公开提供。摘要：In this paper, we study the representation of the shape and pose of objects using their keypoints. Therefore, we propose an end-to-end method that simultaneously detects 2D keypoints from an image and lifts them to 3D. The proposed method learns both 2D detection and 3D lifting only from 2D keypoints annotations. In this regard, a novel method that explicitly disentangles the pose and 3D shape by means of augmentation-based cyclic self-supervision is proposed, for the first time. In addition of being end-to-end in image to 3D learning, our method also handles objects from multiple categories using a single neural network. We use a Transformer-based architecture to detect the keypoints, as well as to summarize the visual context of the image. This visual context information is then used while lifting the keypoints to 3D, so as to allow the context-based reasoning for better performance. While lifting, our method learns a small set of basis shapes and their sparse non-negative coefficients to represent the 3D shape in canonical frame. Our method can handle occlusions as well as wide variety of object classes. Our experiments on three benchmarks demonstrate that our method performs better than the state-of-the-art. Our source code will be made publicly available.

【7】 Adversarial Memory Networks for Action Prediction 标题：用于动作预测的对抗性记忆网络链接：https://arxiv.org/abs/2112.09875

作者：Zhiqiang Tao,Yue Bai,Handong Zhao,Sheng Li,Yu Kong,Yun Fu 机构：com¶University of Georgia 摘要：动作预测的目的是通过部分观察到的视频推断即将发生的人类动作，这是一项具有挑战性的任务，因为早期观察所包含的信息有限。现有的方法主要采用重建策略来处理这一任务，期望学习从部分观测到完整视频的单一映射函数，以便于预测过程。在本研究中，我们从两个新的方面提出了对抗性记忆网络（AMemNet）来生成部分视频查询的“完整视频”特征条件。首先，设计了一个键值结构记忆生成器，将不同的部分视频作为键值记忆进行存储，并通过选通机制和注意查询在键值记忆中动态写入完整视频。其次，我们开发了一个类感知鉴别器来引导内存生成器在对抗性训练中提供不仅真实而且具有鉴别能力的完整视频特征。AMemNet的最终预测结果通过RGB和光流的后期融合给出。在两个基准视频数据集UCF-101和HMDB51上的大量实验结果证明了所提出的AMemNet模型相对于最新方法的有效性。摘要：Action prediction aims to infer the forthcoming human action with partially-observed videos, which is a challenging task due to the limited information underlying early observations. Existing methods mainly adopt a reconstruction strategy to handle this task, expecting to learn a single mapping function from partial observations to full videos to facilitate the prediction process. In this study, we propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query from two new aspects. Firstly, a key-value structured memory generator is designed to memorize different partial videos as key memories and dynamically write full videos in value memories with gating mechanism and querying attention. Secondly, we develop a class-aware discriminator to guide the memory generator to deliver not only realistic but also discriminative full video features upon adversarial training. The final prediction result of AMemNet is given by late fusion over RGB and optical flow streams. Extensive experimental results on two benchmark video datasets, UCF-101 and HMDB51, are provided to demonstrate the effectiveness of the proposed AMemNet model over state-of-the-art methods.

【8】 Soundify: Matching Sound Effects to Video 标题：Soundify：将音效与视频进行匹配链接：https://arxiv.org/abs/2112.09726

作者：David Chuan-En Lin,Anastasis Germanidis,Cristóbal Valenzuela,Yining Shi,Nikolas Martelaro 机构：Carnegie Mellon University,Runway 备注：NeurIPS 2021 Workshop on Machine Learning for Creativity and Design 摘要：在视频编辑艺术中，声音实际上是故事的一半。熟练的视频编辑器将声音（如效果和环境）覆盖在画面上，为对象添加角色或将观众沉浸在空间中。然而，通过与专业视频编辑的形成性访谈，我们发现这个过程可能非常乏味和耗时。我们介绍Soundify，一个将声音效果与视频匹配的系统。通过利用带标签的录音棚音效库和将CLIP（一种具有令人印象深刻的Zero-Shot图像分类功能的神经网络）扩展到“Zero-Shot检测器”中，我们能够在无需资源密集型通信学习或音频生成的情况下产生高质量的结果。我们鼓励您查看，或者更好的是，在https://chuanenlin.com/soundify. 摘要：In the art of video editing, sound is really half the story. A skilled video editor overlays sounds, such as effects and ambients, over footage to add character to an object or immerse the viewer within a space. However, through formative interviews with professional video editors, we found that this process can be extremely tedious and time-consuming. We introduce Soundify, a system that matches sound effects to video. By leveraging labeled, studio-quality sound effects libraries and extending CLIP, a neural network with impressive zero-shot image classification capabilities, into a "zero-shot detector", we are able to produce high-quality results without resource-intensive correspondence learning or audio generation. We encourage you to have a look at, or better yet, have a listen to the results at https://chuanenlin.com/soundify.

GAN|对抗|攻击|生成相关(6篇)

【1】 3D-aware Image Synthesis via Learning Structural and Textural Representations 标题：基于结构和纹理表示学习的3D感知图像合成链接：https://arxiv.org/abs/2112.10759

作者：Yinghao Xu,Sida Peng,Ceyuan Yang,Yujun Shen,Bolei Zhou 机构：The Chinese University of Hong Kong, Zhejiang University, Bytedance Inc. 备注：Project page: this https URL; Code: this https URL 摘要：生成三维感知模型是连接二维图像空间和三维物理世界的桥梁，但仍然具有挑战性。最近的尝试为生成性对抗网络（GAN）配备了一个神经辐射场（NeRF），它将3D坐标映射到像素值，作为3D先验。然而，NeRF中的隐式函数有一个非常局部的感受野，使得发生器很难意识到全局结构。同时，NeRF建立在体绘制的基础上，其成本太高，无法生成高分辨率结果，增加了优化难度。为了缓解这两个问题，我们提出了一个新的框架，称为VolumeGAN，用于高保真3D感知图像合成，通过明确学习结构表示和纹理表示。我们首先学习一个表示底层结构的特征体，然后使用类NeRF模型将其转换为特征场。特征场进一步累积到2D特征图中作为纹理表示，然后是用于外观合成的神经渲染器。这种设计能够独立控制形状和外观。在大量数据集上的大量实验表明，与以前的方法相比，我们的方法实现了足够高的图像质量和更好的三维控制。摘要：Making generative models 3D-aware bridges the 2D image space and the 3D physical world yet remains challenging. Recent attempts equip a Generative Adversarial Network (GAN) with a Neural Radiance Field (NeRF), which maps 3D coordinates to pixel values, as a 3D prior. However, the implicit function in NeRF has a very local receptive field, making the generator hard to become aware of the global structure. Meanwhile, NeRF is built on volume rendering which can be too costly to produce high-resolution results, increasing the optimization difficulty. To alleviate these two problems, we propose a novel framework, termed as VolumeGAN, for high-fidelity 3D-aware image synthesis, through explicitly learning a structural representation and a textural representation. We first learn a feature volume to represent the underlying structure, which is then converted to a feature field using a NeRF-like model. The feature field is further accumulated into a 2D feature map as the textural representation, followed by a neural renderer for appearance synthesis. Such a design enables independent control of the shape and the appearance. Extensive experiments on a wide range of datasets show that our approach achieves sufficiently higher image quality and better 3D control than the previous methods.

【2】 High-Resolution Image Synthesis with Latent Diffusion Models 标题：基于潜在扩散模型的高分辨率图像合成链接：https://arxiv.org/abs/2112.10752

作者：Robin Rombach,Andreas Blattmann,Dominik Lorenz,Patrick Esser,Björn Ommer 机构：Bj¨orn Ommer, Ludwig Maximilian University of Munich & IWR, Heidelberg University, Germany, Runway ML 摘要：通过将图像形成过程分解为去噪自动编码器的顺序应用，扩散模型（DMs）在图像数据和其他方面实现了最先进的合成结果。此外，它们的公式允许引导机制来控制图像生成过程而无需重新训练。然而，由于这些模型通常直接在像素空间中运行，因此对功能强大的DMs进行优化通常需要数百天的GPU时间，并且由于顺序评估，推理成本很高。为了在有限的计算资源上实现DM训练，同时保持其质量和灵活性，我们将其应用于强大的预训练自动编码器的潜在空间。与以前的工作不同，基于这种表示的训练扩散模型首次允许在复杂性降低和细节保留之间达到接近最佳点，从而大大提高了视觉保真度。通过在模型架构中引入交叉注意层，我们将扩散模型转化为功能强大且灵活的生成器，用于文本或边界框等一般条件输入，并以卷积方式实现高分辨率合成。我们的潜在扩散模型（LDM）在图像修复方面达到了新的水平，在各种任务上具有很强的竞争力，包括无条件图像生成、语义场景合成和超分辨率，同时与基于像素的DMs相比，显著降低了计算要求。代码可在https://github.com/CompVis/latent-diffusion . 摘要：By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .

【3】 GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models 标题：Glide：使用文本引导扩散模型生成和编辑照片级真实感图像链接：https://arxiv.org/abs/2112.10741

作者：Alex Nichol,Prafulla Dhariwal,Aditya Ramesh,Pranav Shyam,Pamela Mishkin,Bob McGrew,Ilya Sutskever,Mark Chen 备注：20 pages, 18 figures 摘要：最近的研究表明，扩散模型可以生成高质量的合成图像，特别是当与导航技术结合使用时，可以在保真度上权衡多样性。我们探索了文本条件图像合成问题的扩散模型，并比较了两种不同的引导策略：剪辑引导和无分类器引导。我们发现，对于照片真实性和标题相似性，人类评估者更喜欢后者，并且经常产生照片真实性样本。使用无分类器指导的35亿参数文本条件扩散模型中的样本比来自DALL-E的样本更受人类评估者的青睐，即使后者使用昂贵的剪辑重新排序。此外，我们发现，我们的模型可以微调以执行图像修复，从而实现强大的文本驱动图像编辑。我们在过滤后的数据集上训练一个较小的模型，并在https://github.com/openai/glide-text2im. 摘要：Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

【4】 Calorie Aware Automatic Meal Kit Generation from an Image 标题：从图像中自动生成卡路里感知套餐链接：https://arxiv.org/abs/2112.09839

作者：Ahmad Babaeian Jelodar,Yu Sun 机构： Sun are with the Department of Computer Scienceand Engineering, University of South Florida 摘要：近年来，热量和营养研究受到越来越多的关注。但是，由于问题的复杂性，该领域的文献主要集中在有限的配料子集或菜肴类型以及简单的卷积神经网络或传统的机器学习。同时，对配料部分的估计有助于从给定图像改进热量估计和膳食再生产。在本文中，给出了一个单一的烹饪图像，提出了一个用于热量估算和不同份量的膳食再生产的管道。管道包括两个阶段。在第一阶段，预测与给定图像中的膳食相关联的一组成分。在第二阶段，给定的图像特征和成分，成分的部分和最终的总热量同时使用基于深度变换的模型进行估计。模型中引入的部分估算有助于改进热量估算，也有助于不同食用量的膳食再生产。为了证明管道的好处，该模型可用于餐盒的生成。为了评估管道，使用了大规模数据集Recipe1M。在实验之前，对Recipe1M数据集进行解析，并使用部分成分进行显式注释。实验表明，使用配料及其份量可以显著提高热量估算。此外，还创建了一个可视化界面，用户可以在该界面中与管道交互，以获得准确的卡路里估计值，并生成用于烹饪目的的套餐。摘要：Calorie and nutrition research has attained increased interest in recent years. But, due to the complexity of the problem, literature in this area focuses on a limited subset of ingredients or dish types and simple convolutional neural networks or traditional machine learning. Simultaneously, estimation of ingredient portions can help improve calorie estimation and meal re-production from a given image. In this paper, given a single cooking image, a pipeline for calorie estimation and meal re-production for different servings of the meal is proposed. The pipeline contains two stages. In the first stage, a set of ingredients associated with the meal in the given image are predicted. In the second stage, given image features and ingredients, portions of the ingredients and finally the total meal calorie are simultaneously estimated using a deep transformer-based model. Portion estimation introduced in the model helps improve calorie estimation and is also beneficial for meal re-production in different serving sizes. To demonstrate the benefits of the pipeline, the model can be used for meal kits generation. To evaluate the pipeline, the large scale dataset Recipe1M is used. Prior to experiments, the Recipe1M dataset is parsed and explicitly annotated with portions of ingredients. Experiments show that using ingredients and their portions significantly improves calorie estimation. Also, a visual interface is created in which a user can interact with the pipeline to reach accurate calorie estimations and generate a meal kit for cooking purposes.

【5】 Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs 标题：利用长期依赖关系生成动态场景图链接：https://arxiv.org/abs/2112.09828

作者：Shengyu Feng,Subarna Tripathi,Hesham Mostafa,Marcel Nassar,Somdeb Majumdar 机构：University of Illinois at Urbana Champaign, Intel Labs 摘要：动态场景图形式的结构化视频表示是实现多个视频理解任务的有效工具。与从图像生成场景图的任务相比，由于场景的时间动态性和预测的固有时间波动性，动态场景图生成更具挑战性。我们表明，捕获长期依赖关系是有效生成动态场景图的关键。我们通过从视频中构建一致的长期对象轨迹，然后通过变换捕捉对象的动态和视觉关系，提出了检测轨迹识别范式。实验结果表明，我们的动态场景图检测转换器（DSG-DETR）在基准数据集Action Genome上显著优于最先进的方法。我们还进行了消融研究，并验证了所提出方法的每个组成部分的有效性。摘要：Structured video representation in the form of dynamic scene graphs is an effective tool for several video understanding tasks. Compared to the task of scene graph generation from images, dynamic scene graph generation is more challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We show that capturing long-term dependencies is the key to effective generation of dynamic scene graphs. We present the detect-track-recognize paradigm by constructing consistent long-term object tracklets from a video, followed by transformers to capture the dynamics of objects and visual relations. Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods by a significant margin on the benchmark dataset Action Genome. We also perform ablation studies and validate the effectiveness of each component of the proposed approach.

【6】 A Streaming Volumetric Image Generation Framework for Development and Evaluation of Out-of-Core Methods 标题：一种用于开发和评估外置方法的流传输体积图像生成框架链接：https://arxiv.org/abs/2112.09809

作者：Dominik Drees,Xiaoyi Jiang 机构： University of M¨unster 摘要：近年来，随着三维成像技术的发展，大样本的高分辨率体积图像越来越多。由此产生的数百GB大小的数据集要求在图像处理领域采用新的可扩展和内存有效的方法，在这方面已经取得了一些进展。同时，就具体数据量的可用性和相关地面真实数据的生成而言，对这些新方法进行定量评估是困难的。在本文中，我们提出了一个算法框架，可以用来有效地生成测试（和地面真相）卷数据，甚至可以选择以流式方式。由于所提出的嵌套扫描算法速度快，因此可以根据需要生成测试数据。我们分析了该算法的渐近运行时间，并通过实验将其与其他方法以及假设的最佳情况基线方法进行了比较。在一个案例研究中，该框架被应用于流行的血管图像生成的VascuSynth软件，使其能够高效地生成大于主存储容量的图像，这通过生成万亿体素（1TB）图像来证明。该框架的实现以Vascusynth的修改版本和用于实验评估的代码的形式在线提供。此外，测试数据生成过程已集成到流行的体绘制和处理框架Voreen中。摘要：Advances in 3D imaging technology in recent years have allowed for increasingly high resolution volumetric images of large specimen. The resulting datasets of hundreds of Gigabytes in size call for new scalable and memory efficient approaches in the field of image processing, where some progress has been made already. At the same time, quantitative evaluation of these new methods is difficult both in terms of the availability of specific data sizes and in the generation of associated ground truth data. In this paper we present an algorithmic framework that can be used to efficiently generate test (and ground truth) volume data, optionally even in a streaming fashion. As the proposed nested sweeps algorithm is fast, it can be used to generate test data on demand. We analyze the asymptotic run time of the presented algorithm and compare it experimentally to alternative approaches as well as a hypothetical best-case baseline method. In a case study, the framework is applied to the popular VascuSynth software for vascular image generation, making it capable of efficiently producing larger-than-main memory volumes which is demonstrated by generating a trillion voxel (1TB) image. Implementations of the presented framework are available online in the form of the modified version of Vascusynth and the code used for the experimental evaluation. In addition, the test data generation procedure has been integrated into the popular volume rendering and processing framework Voreen.

Attention注意力(3篇)

【1】 Contrastive Attention Network with Dense Field Estimation for Face Completion 标题：基于密度场估计的对比注意力网络人脸补全链接：https://arxiv.org/abs/2112.10310

作者：Xin Ma,Xiaoqiang Zhou,Huaibo Huang,Gengyun Jia,Zhenhua Chai,Xiaolin Wei 机构：School of Artificial Intelligence, University of Chinese Academy of Sciences, NLPR & CEBSIT & CRIPAC, CASIA, Vision Intelligence Department, Meituan, University of Science and Technology of China 备注：Accepted by Pattern Recognition 2021. arXiv admin note: substantial text overlap with arXiv:2010.15643 摘要：大多数现代人脸补全方法采用自动编码器或其变体来恢复人脸图像中缺失的区域。编码器通常用于学习功能强大的表示，这些表示在应对复杂学习任务的挑战中起着重要作用。具体地说，各种掩模经常出现在野外的人脸图像中，形成复杂的模式，特别是在COVID-19的这一困难时期。在这种复杂的情况下，编码器很难捕捉到如此强大的表示。为了应对这一挑战，我们提出了一种自监督暹罗推理网络，以提高编码器的泛化性和鲁棒性。它可以从全分辨率图像中编码上下文语义，并获得更具区别性的表示。为了处理人脸图像的几何变化，将稠密的对应场集成到网络中。我们进一步提出了一种多尺度译码器，该译码器具有一种新的双注意融合模块（DAF），可以自适应地将恢复的区域和已知的区域结合起来。这种多尺度体系结构有利于解码器将从编码器学习到的区别性表示用于图像。大量实验表明，该方法不仅取得了比现有方法更具吸引力的结果，而且显著提高了蒙面人脸识别的性能。摘要：Most modern face completion approaches adopt an autoencoder or its variants to restore missing regions in face images. Encoders are often utilized to learn powerful representations that play an important role in meeting the challenges of sophisticated learning tasks. Specifically, various kinds of masks are often presented in face images in the wild, forming complex patterns, especially in this hard period of COVID-19. It's difficult for encoders to capture such powerful representations under this complex situation. To address this challenge, we propose a self-supervised Siamese inference network to improve the generalization and robustness of encoders. It can encode contextual semantics from full-resolution images and obtain more discriminative representations. To deal with geometric variations of face images, a dense correspondence field is integrated into the network. We further propose a multi-scale decoder with a novel dual attention fusion module (DAF), which can combine the restored and known regions in an adaptive manner. This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images. Extensive experiments clearly demonstrate that the proposed approach not only achieves more appealing results compared with state-of-the-art methods but also improves the performance of masked face recognition dramatically.

【2】 Improving Face-Based Age Estimation with Attention-Based Dynamic Patch Fusion 标题：基于注意力的动态斑块融合改进基于人脸的年龄估计链接：https://arxiv.org/abs/2112.10167

作者：Haoyi Wang,Victor Sanchez,Chang-Tsun Li 备注：IEEE Transactions on Image Processing (accepted for publication) 摘要：随着卷积神经网络（CNN）的日益普及，最近的基于人脸的年龄估计工作将这些网络作为主干。然而，最先进的基于CNN的方法平等地对待每个面部区域，因此完全忽略了一些可能包含丰富年龄特定信息的面部贴片的重要性。在本文中，我们提出了一个基于人脸的年龄估计框架，称为基于注意的动态面片融合（ADPF）。在ADPF中，实现了两个独立的CNN，即AttentionNet和FusionNet。AttentionNet采用一种新的排序引导多头部混合注意（RMHHA）机制，动态定位和排序年龄特定的斑块。FusionNet使用发现的面片和面部图像来预测受试者的年龄。由于建议的RMHHA机制根据发现的补丁的重要性对其进行排序，因此FusionNet中每个补丁的学习路径长度与它所承载的信息量成正比（越长，越重要）。ADPF还引入了一种新的多样性丢失来指导注意网的训练，减少斑块之间的重叠，从而发现多样性和重要的斑块。通过大量的实验，我们证明了我们提出的框架在多个年龄估计基准数据集上优于最新的方法。摘要：With the increasing popularity of convolutional neural networks (CNNs), recent works on face-based age estimation employ these networks as the backbone. However, state-of-the-art CNN-based methods treat each facial region equally, thus entirely ignoring the importance of some facial patches that may contain rich age-specific information. In this paper, we propose a face-based age estimation framework, called Attention-based Dynamic Patch Fusion (ADPF). In ADPF, two separate CNNs are implemented, namely the AttentionNet and the FusionNet. The AttentionNet dynamically locates and ranks age-specific patches by employing a novel Ranking-guided Multi-Head Hybrid Attention (RMHHA) mechanism. The FusionNet uses the discovered patches along with the facial image to predict the age of the subject. Since the proposed RMHHA mechanism ranks the discovered patches based on their importance, the length of the learning path of each patch in the FusionNet is proportional to the amount of information it carries (the longer, the more important). ADPF also introduces a novel diversity loss to guide the training of the AttentionNet and reduce the overlap among patches so that the diverse and important patches are discovered. Through extensive experiments, we show that our proposed framework outperforms state-of-the-art methods on several age estimation benchmark datasets.

【3】 A-ESRGAN: Training Real-World Blind Super-Resolution with Attention U-Net Discriminators 标题：A-ESRGAN：训练带注意U网鉴别器的真实世界盲超分辨率链接：https://arxiv.org/abs/2112.10046

作者：Zihao Wei,Yidong Huang,Yuang Chen,Chenhao Zheng,Jinnan Gao 机构： Department of Computer Science Engineering, University of Michigan, Ann Arbor, USA, UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China 备注：6 pages, 9 figures 摘要：盲图像超分辨率（SR）是CV中的一项长期任务，旨在恢复遭受未知和复杂失真的低分辨率图像。最近的工作主要集中在采用更复杂的退化模型来模拟真实世界的退化。由此产生的模型在感知损失方面取得了突破，并产生了令人信服的感知结果。然而，当前生成性对抗性网络结构带来的限制仍然很明显：平等对待像素会导致忽略图像的结构特征，并导致性能缺陷，如扭曲线和背景过度锐化或模糊。在本文中，我们提出了一种用于盲SR任务的GAN模型A-ESRGAN，该模型具有一个基于注意U网络的多尺度鉴别器，可以与其他生成器无缝集成。据我们所知，这是首次引入注意U网络结构作为GAN的鉴别器来解决盲SR问题。本文还解释了多尺度注意U-Net背后的机制，它给模型带来了性能上的突破。通过与先前工作的对比实验，我们的模型在非参考自然图像质量评估指标上呈现出最先进的性能。我们的消融研究表明，使用我们的鉴别器，基于RRDB的发生器可以在多个尺度上利用图像的结构特征，因此与以前的工作相比，可以产生更逼真的高分辨率图像。摘要：Blind image super-resolution(SR) is a long-standing task in CV that aims to restore low-resolution images suffering from unknown and complex distortions. Recent work has largely focused on adopting more complicated degradation models to emulate real-world degradations. The resulting models have made breakthroughs in perceptual loss and yield perceptually convincing results. However, the limitation brought by current generative adversarial network structures is still significant: treating pixels equally leads to the ignorance of the image's structural features, and results in performance drawbacks such as twisted lines and background over-sharpening or blurring. In this paper, we present A-ESRGAN, a GAN model for blind SR tasks featuring an attention U-Net based, multi-scale discriminator that can be seamlessly integrated with other generators. To our knowledge, this is the first work to introduce attention U-Net structure as the discriminator of GAN to solve blind SR problems. And the paper also gives an interpretation for the mechanism behind multi-scale attention U-Net that brings performance breakthrough to the model. Through comparison experiments with prior works, our model presents state-of-the-art level performance on the non-reference natural image quality evaluator metric. And our ablation studies have shown that with our discriminator, the RRDB based generator can leverage the structural features of an image in multiple scales, and consequently yields more perceptually realistic high-resolution images compared to prior works.

人脸|人群计数(6篇)

【1】 SelFSR: Self-Conditioned Face Super-Resolution in the Wild via Flow Field Degradation Network 标题：SelFSR：基于流场退化网络的野外自条件人脸超分辨率链接：https://arxiv.org/abs/2112.10683

作者：Xianfang Zeng,Jiangning Zhang,Liang Liu,Guangzhong Tian,Yong Liu 机构： Tian is with Ningbo Research Institute, Zhejiang University 摘要：尽管在基准数据集上取得了成功，但大多数先进的人脸超分辨率模型在真实场景中表现不佳，因为真实图像和合成训练对之间存在显著的领域差距。为了解决这个问题，我们提出了一种新的域自适应退化网络用于野外人脸超分辨率检测。该退化网络预测流场以及中低分辨率图像。然后，通过扭曲中间图像生成退化的对应物。由于偏好捕捉运动模糊，这种模型在保持原始图像和退化图像之间的身份一致性方面表现得更好。我们进一步提出了超分辨率网络的自适应块。该块将输入图像作为有效利用面部结构信息的条件项，消除了对显式先验的依赖，例如面部地标或边界。我们的模型在CelebA和真实人脸数据集上都达到了最先进的性能。前者展示了我们提出的架构强大的生成能力，而后者在现实世界的图像中显示了巨大的身份一致性和感知质量。摘要：In spite of the success on benchmark datasets, most advanced face super-resolution models perform poorly in real scenarios since the remarkable domain gap between the real images and the synthesized training pairs. To tackle this problem, we propose a novel domain-adaptive degradation network for face super-resolution in the wild. This degradation network predicts a flow field along with an intermediate low resolution image. Then, the degraded counterpart is generated by warping the intermediate image. With the preference of capturing motion blur, such a model performs better at preserving identity consistency between the original images and the degraded. We further present the self-conditioned block for super-resolution network. This block takes the input image as a condition term to effectively utilize facial structure information, eliminating the reliance on explicit priors, e.g. facial landmarks or boundary. Our model achieves state-of-the-art performance on both CelebA and real-world face dataset. The former demonstrates the powerful generative ability of our proposed architecture while the latter shows great identity consistency and perceptual quality in real-world images.

【2】 Fusion and Orthogonal Projection for Improved Face-Voice Association 标题：改进的人脸-语音关联的融合和正交投影链接：https://arxiv.org/abs/2112.10483

作者：Muhammad Saad Saeed,Muhammad Haris Khan,Shah Nawaz,Muhammad Haroon Yousaf,Alessio Del Bue 机构：University of Engineering and Technology Taxila,Muhammad Bin Zayed University of Artificial Intelligence, Pattern Analysis & Computer Vision (PAVIS) - Istituto Italiano di Tecnologia (IIT) 摘要：我们研究了人脸和语音之间的学习关联问题，这是最近计算机视觉界越来越感兴趣的问题。先前的工作采用两两或三重态损失公式来学习适合相关匹配和验证任务的嵌入空间。尽管取得了一些进展，但由于依赖于距离相关的保证金参数、运行时训练复杂性差以及依赖于精心编制的负面挖掘程序，此类损失公式具有限制性。在这项工作中，我们假设丰富的特征表示与有效但高效的监督相结合是实现改进的人脸-语音关联的鉴别联合嵌入空间的必要条件。为此，我们提出了一种轻量级的即插即用机制，利用两种模式中的互补线索形成丰富的融合嵌入，并通过正交性约束基于其身份标签对其进行聚类。我们将我们提出的机制称为融合和正交投影（FOP），并在双流管道中实例化。在大规模VoxCeleb数据集上评估整个结果框架，该数据集包含大量任务，包括跨模式验证和匹配。结果表明，与目前最先进的方法相比，我们的方法表现良好，并且我们提出的监管公式比当代方法更有效。摘要：We study the problem of learning association between face and voice, which is gaining interest in the computer vision community lately. Prior works adopt pairwise or triplet loss formulations to learn an embedding space amenable for associated matching and verification tasks. Albeit showing some progress, such loss formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that enriched feature representation coupled with an effective yet efficient supervision is necessary in realizing a discriminative joint embedding space for improved face-voice association. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream pipeline. The overall resulting framework is evaluated on a large-scale VoxCeleb dataset with a multitude of tasks, including cross-modal verification and matching. Results show that our method performs favourably against the current state-of-the-art methods and our proposed supervision formulation is more effective and efficient than the ones employed by the contemporary methods.

【3】 HVTR: Hybrid Volumetric-Textural Rendering for Human Avatars 标题：HVTR：用于人体头像的混合体纹理绘制链接：https://arxiv.org/abs/2112.10203

作者：Tao Hu,Tao Yu,Zerong Zheng,He Zhang,Yebin Liu,Matthias Zwicker 机构：University of Maryland, College Park, Tsinghua University, Beihang University 备注：Project page: this https URL 摘要：我们提出了一种新的神经渲染管道，混合体纹理渲染（HVTR），它可以高效、高质量地从任意姿势合成虚拟人化身。首先，我们学习在人体表面密集的UV流形上编码有关节的人体运动。为了处理复杂的运动（例如，自遮挡），我们利用UV流形上的编码信息构建基于动态姿势调节神经辐射场的3D体积表示。虽然这允许我们通过改变拓扑来表示三维几何体，但体积渲染的计算量很大。因此，我们仅使用姿态调节下采样神经辐射场（PD NeRF）的粗略体积表示，可以在低分辨率下高效渲染。此外，我们还学习了2D纹理特征，这些特征与图像空间中的渲染体积特征相融合。我们的方法的关键优势是，我们可以通过基于GAN的快速纹理渲染器将融合的特征转换为高分辨率、高质量的化身。我们证明了混合渲染使HVTR能够处理复杂的运动，在用户控制的姿势/形状下渲染高质量的化身，甚至是宽松的衣服，最重要的是，在推理时速度很快。我们的实验结果也证明了最先进的定量结果。摘要：We propose a novel neural rendering pipeline, Hybrid Volumetric-Textural Rendering (HVTR), which synthesizes virtual human avatars from arbitrary poses efficiently and at high quality. First, we learn to encode articulated human motions on a dense UV manifold of the human body surface. To handle complicated motions (e.g., self-occlusions), we then leverage the encoded information on the UV manifold to construct a 3D volumetric representation based on a dynamic pose-conditioned neural radiance field. While this allows us to represent 3D geometry with changing topology, volumetric rendering is computationally heavy. Hence we employ only a rough volumetric representation using a pose-conditioned downsampled neural radiance field (PD-NeRF), which we can render efficiently at low resolutions. In addition, we learn 2D textural features that are fused with rendered volumetric features in image space. The key advantage of our approach is that we can then convert the fused features into a high resolution, high-quality avatar by a fast GAN-based textural renderer. We demonstrate that hybrid rendering enables HVTR to handle complicated motions, render high-quality avatars under user-controlled poses/shapes and even loose clothing, and most importantly, be fast at inference time. Our experimental results also demonstrate state-of-the-art quantitative results.

【4】 Initiative Defense against Facial Manipulation 标题：针对面部操作的主动防御链接：https://arxiv.org/abs/2112.10098

作者：Qidong Huang,Jie Zhang,Wenbo Zhou,WeimingZhang,Nenghai Yu 机构：University of Science and Technology in China 备注：None 摘要：得益于生成性对抗网络的发展，近年来，面部操纵在学术界和工业界都取得了重大进展。它激发了越来越多的娱乐应用，但同时也给个人隐私甚至政治安全带来了严重威胁。为了减轻这种风险，人们提出了许多对策。然而，绝大多数方法都是被动设计的，即检测人脸图像或视频在广泛传播后是否被篡改。这些基于检测的方法有一个致命的局限性，即它们只适用于事后取证，但不能防止恶意行为的产生。针对这一局限性，本文提出了一种新的主动防御框架，以降低恶意用户控制的面部操作模型的性能。其基本思想是在操作之前主动向目标面部数据中注入难以察觉的毒液。为此，我们首先用代理模型模拟目标操纵模型，然后设计毒液扰动发生器以获得所需毒液。进一步利用交替训练策略训练代理模型和扰动发生器。在我们的主动防御框架中，我们考虑了两个典型的人脸操作任务：人脸属性编辑和人脸再现。大量的实验证明了我们的框架在不同环境下的有效性和鲁棒性。最后，我们希望这项工作能够为针对更具对抗性场景的主动对策提供一些启示。摘要：Benefiting from the development of generative adversarial networks (GAN), facial manipulation has achieved significant progress in both academia and industry recently. It inspires an increasing number of entertainment applications but also incurs severe threats to individual privacy and even political security meanwhile. To mitigate such risks, many countermeasures have been proposed. However, the great majority methods are designed in a passive manner, which is to detect whether the facial images or videos are tampered after their wide propagation. These detection-based methods have a fatal limitation, that is, they only work for ex-post forensics but can not prevent the engendering of malicious behavior. To address the limitation, in this paper, we propose a novel framework of initiative defense to degrade the performance of facial manipulation models controlled by malicious users. The basic idea is to actively inject imperceptible venom into target facial data before manipulation. To this end, we first imitate the target manipulation model with a surrogate model, and then devise a poison perturbation generator to obtain the desired venom. An alternating training strategy are further leveraged to train both the surrogate model and the perturbation generator. Two typical facial manipulation tasks: face attribute editing and face reenactment, are considered in our initiative defense framework. Extensive experiments demonstrate the effectiveness and robustness of our framework in different settings. Finally, we hope this work can shed some light on initiative countermeasures against more adversarial scenarios.

【5】 Reasoning Structural Relation for Occlusion-Robust Facial Landmark Localization 标题：基于结构关系的遮挡鲁棒人脸标志点定位方法链接：https://arxiv.org/abs/2112.10087

作者：Congcong Zhu,Xiaoqiang Li,Jide Li,Songmin Dai,Weiqin Tong 机构：School of Computer Engineering and Science, Shanghai University, Shanghai, China 备注：Accepted by Pattern recognition 摘要：在人脸地标定位任务中，由于人脸特征的部分可观测性，各种遮挡严重降低了定位精度。提出了一种基于结构关系网络（SRN）的遮挡鲁棒地标定位方法。与大多数现有的仅利用形状约束的方法不同，所提出的SRN旨在捕获不同面部组件之间的结构关系。这些关系可以被视为针对遮挡的更强大的形状约束。为了实现这一点，设计了一个层次结构关系模块（HSRM），对表示长距离和短距离空间依赖关系的结构关系进行层次化推理。与现有的网络体系结构相比，HSRM利用其几何感知的网络体系结构，可以有效地建模空间关系，从而减少因遮挡而导致的语义歧义。此外，SRN通过合成遮挡人脸来增强训练数据。为了进一步扩展遮挡视频数据的SRN，我们将遮挡人脸合成描述为一个马尔可夫决策过程（MDP）。具体地说，它基于与预先训练的SRN的性能退化相关联的累积奖励来规划动态遮挡的移动。此过程增加了硬样本，以实现鲁棒的面部地标跟踪。大量的实验结果表明，该方法在遮挡和遮罩人脸上都取得了很好的效果。代码可在https://github.com/zhuccly/SRN. 摘要：In facial landmark localization tasks, various occlusions heavily degrade the localization accuracy due to the partial observability of facial features. This paper proposes a structural relation network (SRN) for occlusion-robust landmark localization. Unlike most existing methods that simply exploit the shape constraint, the proposed SRN aims to capture the structural relations among different facial components. These relations can be considered a more powerful shape constraint against occlusion. To achieve this, a hierarchical structural relation module (HSRM) is designed to hierarchically reason the structural relations that represent both long- and short-distance spatial dependencies. Compared with existing network architectures, HSRM can efficiently model the spatial relations by leveraging its geometry-aware network architecture, which reduces the semantic ambiguity caused by occlusion. Moreover, the SRN augments the training data by synthesizing occluded faces. To further extend our SRN for occluded video data, we formulate the occluded face synthesis as a Markov decision process (MDP). Specifically, it plans the movement of the dynamic occlusion based on an accumulated reward associated with the performance degradation of the pre-trained SRN. This procedure augments hard samples for robust facial landmark tracking. Extensive experimental results indicate that the proposed method achieves outstanding performance on occluded and masked faces. Code is available at https://github.com/zhuccly/SRN.

【6】 A New Image Codec Paradigm for Human and Machine Uses 标题：一种新的人机结合的图像编解码模式链接：https://arxiv.org/abs/2112.10071

作者：Sien Chen,Jian Jin,Lili Meng,Weisi Lin,Zhuo Chen,Tsui-Shan Chang,Zhengguang Li,Huaxiang Zhang 摘要：随着人工智能（AIoT）的发展，在我们的日常工作和生活中产生了大量的视觉数据，如图像和视频。这些视觉数据不仅用于人类查看或理解，还用于机器分析或决策，例如智能监控、自动化车辆和许多其他智能城市应用。为此，一个新的图像编解码器范型为人类和机器使用提出了这项工作。首先，利用神经网络提取高层实例分割图和底层信号特征。然后，实例分割图进一步表示为具有所提出的16位灰度表示的轮廓。之后，16位灰度轮廓和信号特征都用无损编解码器编码。同时，设计并训练了一个图像预测器，实现了具有16位灰度轮廓和信号特征的高质量图像重建。最后，利用有损编解码器对原始图像和预测图像之间的残差映射进行压缩，用于高质量的图像重建。通过这样的设计，一方面可以实现可伸缩的图像压缩，以满足不同人类消费的需求；另一方面，利用解码后的16位灰度轮廓，我们可以直接在解码器端完成多个机器视觉任务，如目标分类、检测和分割。实验结果表明，该编解码器与大多数基于学习的编解码器相比，在PSNR和MS-SSIM图像重建方面的性能优于传统编解码器（如BPG和JPEG2000）。同时，它在对象检测和分割的映射方面优于现有的编解码器。摘要：With the AI of Things (AIoT) development, a huge amount of visual data, e.g., images and videos, are produced in our daily work and life. These visual data are not only used for human viewing or understanding but also for machine analysis or decision-making, e.g., intelligent surveillance, automated vehicles, and many other smart city applications. To this end, a new image codec paradigm for both human and machine uses is proposed in this work. Firstly, the high-level instance segmentation map and the low-level signal features are extracted with neural networks. Then, the instance segmentation map is further represented as a profile with the proposed 16-bit gray-scale representation. After that, both 16-bit gray-scale profile and signal features are encoded with a lossless codec. Meanwhile, an image predictor is designed and trained to achieve the general-quality image reconstruction with the 16-bit gray-scale profile and signal features. Finally, the residual map between the original image and the predicted one is compressed with a lossy codec, used for high-quality image reconstruction. With such designs, on the one hand, we can achieve scalable image compression to meet the requirements of different human consumption; on the other hand, we can directly achieve several machine vision tasks at the decoder side with the decoded 16-bit gray-scale profile, e.g., object classification, detection, and segmentation. Experimental results show that the proposed codec achieves comparable results as most learning-based codecs and outperforms the traditional codecs (e.g., BPG and JPEG2000) in terms of PSNR and MS-SSIM for image reconstruction. At the same time, it outperforms the existing codecs in terms of the mAP for object detection and segmentation.

跟踪(1篇)

【1】 Direct simple computation of middle surface between 3D point clouds and/or discrete surfaces by tracking sources in distance function calculation algorithms 标题：在距离函数计算算法中通过跟踪源直接计算三维点云和/或离散曲面之间的中面链接：https://arxiv.org/abs/2112.09808

作者：Balazs Kosa,Karol Mikula 机构：the date of receipt and acceptance should be inserted later 摘要：在本文中，我们介绍了计算各种三维数据集（如点云和/或离散曲面）之间的中间曲面的新方法。传统上，中间曲面是通过检测计算距离函数中的奇异点（如脊线、三结点等）来获得的。这需要计算二阶微分特性，并且必须应用一些启发式方法。与此相反，我们仅通过计算距离函数本身来确定中间曲面，这是一种快速而简单的方法。我们给出并比较了快速扫描法、矢量距离变换法、快速行进法和Dijkstra-Pythagoras法在寻找三维数据集中间曲面方面的结果。摘要：In this paper, we introduce novel methods for computing middle surfaces between various 3D data sets such as point clouds and/or discrete surfaces. Traditionally the middle surface is obtained by detecting singularities in computed distance function such as ridges, triple junctions, etc. It requires to compute second order differential characteristics and also some kinds of heuristics must be applied. Opposite to that, we determine the middle surface just from computing the distance function itself which is a fast and simple approach. We present and compare the results of the fast sweeping method, the vector distance transform algorithm, the fast marching method, and the Dijkstra-Pythagoras method in finding the middle surface between 3D data sets.

图像视频检索|Re-id相关(1篇)

【1】 Learning with Label Noise for Image Retrieval by Selecting Interactions 标题：基于标签噪声的选择交互学习在图像检索中的应用链接：https://arxiv.org/abs/2112.10453

作者：Sarah Ibrahimi,Arnaud Sors,Rafael Sampaio de Rezende,Stéphane Clinchant 机构：† University of Amsterdam, ‡ NAVER LABS Europe 摘要：带噪标签学习是图像分类领域的一个研究热点。然而，噪声标签对图像检索的影响研究较少。在这项工作中，我们提出了一种抗噪声的图像检索方法，名为基于教师的交互选择T-SINT，它识别有噪声的交互，即距离矩阵中的元素，并通过使用有助于稳定性的基于教师的训练设置，选择正确的正交互和负交互，以考虑检索损失。因此，在具有合成噪声和更真实噪声的基准数据集的高噪声率方面，它始终优于最先进的方法。摘要：Learning with noisy labels is an active research area for image classification. However, the effect of noisy labels on image retrieval has been less studied. In this work, we propose a noise-resistant method for image retrieval named Teacher-based Selection of Interactions, T-SINT, which identifies noisy interactions, ie. elements in the distance matrix, and selects correct positive and negative interactions to be considered in the retrieval loss by using a teacher-based training setup which contributes to the stability. As a result, it consistently outperforms state-of-the-art methods on high noise rates across benchmark datasets with synthetic noise and more realistic noise.

裁剪|量化|加速|压缩相关(1篇)

【1】 Elastic-Link for Binarized Neural Network 标题：二值化神经网络的弹性连接链接：https://arxiv.org/abs/2112.10149

作者：Jie Hu,Wu Ziheng,Vince Tan,Zhilin Lu,Mengze Zeng,Enhua Wu 机构： State Key Lab of Computer Science, ISCAS & University of Chinese Academy of Sciences, Department of Electronic Engineering, Tsinghua University, Alibaba Group, ByteDance Inc., University of Macau 备注：AAAI2022 摘要：最近的研究表明，二值化神经网络（BNN）能够极大地降低计算成本和内存占用，便于在资源受限的设备上部署模型。然而，与全精度对应设备相比，BNN的精度严重下降。到目前为止，旨在缩小这种精度差距的研究主要集中在具有很少或没有1x1卷积层的特定网络架构上，对于这些架构，标准的二值化方法无法很好地工作。由于1x1卷积在现代体系结构（如GoogleNet、ResNet、DenseNet）的设计中很常见，因此开发一种有效地对其进行二值化的方法对于BNN得到更广泛的采用至关重要。在这项工作中，我们提出了一个“弹性链接”（EL）模块，通过自适应地将实值输入特征添加到后续卷积输出特征来丰富BNN中的信息流。所提出的EL模块易于实现，可与BNN的其他方法结合使用。我们证明，在BNNs中添加EL可以显著改善具有挑战性的大规模ImageNet数据集。例如，我们将二值化ResNet26的顶级精度从57.9%提高到64.0%。EL也有助于二值化MobileNet训练的收敛，其精度达到了56.4%的top-1。最后，通过对ReActNet的集成，它产生了一个新的最先进的结果，即71.9%的top-1精度。摘要：Recent work has shown that Binarized Neural Networks (BNNs) are able to greatly reduce computational costs and memory footprints, facilitating model deployment on resource-constrained devices. However, in comparison to their full-precision counterparts, BNNs suffer from severe accuracy degradation. Research aiming to reduce this accuracy gap has thus far largely focused on specific network architectures with few or no 1x1 convolutional layers, for which standard binarization methods do not work well. Because 1x1 convolutions are common in the design of modern architectures (e.g. GoogleNet, ResNet, DenseNet), it is crucial to develop a method to binarize them effectively for BNNs to be more widely adopted. In this work, we propose an "Elastic-Link" (EL) module to enrich information flow within a BNN by adaptively adding real-valued input features to the subsequent convolutional output features. The proposed EL module is easily implemented and can be used in conjunction with other methods for BNNs. We demonstrate that adding EL to BNNs produces a significant improvement on the challenging large-scale ImageNet dataset. For example, we raise the top-1 accuracy of binarized ResNet26 from 57.9% to 64.0%. EL also aids convergence in the training of binarized MobileNet, for which a top-1 accuracy of 56.4% is achieved. Finally, with the integration of ReActNet, it yields a new state-of-the-art result of 71.9% top-1 accuracy.

表征学习(1篇)

【1】 Implicit Neural Representation Learning for Hyperspectral Image Super-Resolution 标题：高光谱图像超分辨率的隐式神经表示学习链接：https://arxiv.org/abs/2112.10541

作者：Kaiwei Zhang 机构： HSI super-resolutionKaiwei Zhang is with the Institute of Image Communication and NetworkEngineering, Shanghai Jiao Tong University 摘要：高光谱图像（HSI）由于其高维光谱模式，在没有附加辅助图像的情况下实现超分辨率仍然是一个持续的挑战，其中学习有效的空间和光谱表示是一个基本问题。近年来，隐式神经表征（INRs）作为一种新的、有效的表征方法正在取得长足的进展，尤其是在重建任务中。因此，在这项工作中，我们提出了一种新的基于INR的HSI重建模型，该模型通过将空间坐标映射到其相应的光谱辐射值的连续函数来表示HSI。特别是，作为INR的具体实现，参数模型的参数由一个超网络预测，该超网络使用卷积网络进行特征提取。它使连续函数以内容感知的方式将空间坐标映射到像素值。此外，周期性空间编码与重建过程紧密结合，使得我们的模型能够恢复更多的高频细节。为了验证我们模型的有效性，我们在三个HSI数据集（CAVE、NUS和NTIR2018）上进行了实验。实验结果表明，与现有的重建方法相比，本文提出的模型可以获得具有竞争力的重建性能。此外，我们还对我们模型的各个组成部分的影响进行了烧蚀研究。希望本文能为今后的研究提供有力的参考。摘要：Hyperspectral image (HSI) super-resolution without additional auxiliary image remains a constant challenge due to its high-dimensional spectral patterns, where learning an effective spatial and spectral representation is a fundamental issue. Recently, Implicit Neural Representations (INRs) are making strides as a novel and effective representation, especially in the reconstruction task. Therefore, in this work, we propose a novel HSI reconstruction model based on INR which represents HSI by a continuous function mapping a spatial coordinate to its corresponding spectral radiance values. In particular, as a specific implementation of INR, the parameters of parametric model are predicted by a hypernetwork that operates on feature extraction using convolution network. It makes the continuous functions map the spatial coordinates to pixel values in a content-aware manner. Moreover, periodic spatial encoding are deeply integrated with the reconstruction procedure, which makes our model capable of recovering more high frequency details. To verify the efficacy of our model, we conduct experiments on three HSI datasets (CAVE, NUS, and NTIRE2018). Experimental results show that the proposed model can achieve competitive reconstruction performance in comparison with the state-of-the-art methods. In addition, we provide an ablation study on the effect of individual components of our model. We hope this paper could server as a potent reference for future research.

蒸馏|知识提取(2篇)

【1】 MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding 标题：MuMuQA：基于跨媒体知识抽取和接地的多媒体多跳新闻问答链接：https://arxiv.org/abs/2112.10728

作者：Revanth Gangi Reddy,Xilin Rui,Manling Li,Xudong Lin,Haoyang Wen,Jaemin Cho,Lifu Huang,Mohit Bansal,Avirup Sil,Shih-Fu Chang,Alexander Schwing,Heng Ji 机构： University of Illinois Urbana-Champaign, Tsinghua University, Columbia University, University of North Carolina at Chapel Hill, Virginia Tech, IBM Research AI 备注：To be presented at AAAI 2022 摘要：最近，人们对建立问答（QA）模型越来越感兴趣，该模型可以跨多种模式进行推理，例如文本和图像。然而，使用图像的QA通常仅限于从预定义的选项集中选择答案。此外，现实世界中的图像，特别是新闻中的图像，具有与文本共同参照的对象，具有来自两种模式的补充信息。在本文中，我们提出了一个新的QA评估基准，针对需要将图像中的对象跨媒体接地到文本的新闻文章提出1384个问题。具体来说，该任务涉及多跳问题，需要对图像标题对进行推理，以识别所指的固定视觉对象，然后预测从新闻正文文本到回答问题的跨度。此外，我们还介绍了一种基于跨媒体知识提取和综合问答生成的多媒体数据扩充框架，以自动扩充能够为该任务提供弱监督的数据。我们在我们的基准上评估了基于管道和基于端到端预训练的多媒体QA模型，并表明它们实现了有希望的性能，同时大大落后于人的性能，因此为这项具有挑战性的新任务的未来工作留有很大的空间。摘要：Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.

【2】 Controlling the Quality of Distillation in Response-Based Network Compression 标题：基于响应的网络压缩中的蒸馏质量控制链接：https://arxiv.org/abs/2112.10047

作者：Vibhas Vats,David Crandall 机构：Indiana University Bloomington 备注：AAAI22-Workshop: 1st International Workshop on Practical Deep Learning in the Wild 摘要：基于蒸馏的压缩网络的性能取决于蒸馏的质量。大网络（教师）向小网络（学生）的次优升华主要归因于给定师生对学习能力的差距。虽然很难提取教师的所有知识，但提取的质量可以在很大程度上得到控制，以获得更好的绩效。我们的实验表明，蒸馏的质量在很大程度上取决于教师的反应质量，而教师的反应质量又受到其反应中相似信息的严重影响。一个训练有素的大容量教师在学习细粒度区分属性进行分类的过程中会丢失类间的相似信息。相似性信息的缺失导致蒸馏过程从一个示例多课堂学习减少到一个示例一课堂学习，从而限制了来自教师的各种知识的流动。基于只有灌输的知识才能被提炼的内隐假设，我们不再只关注知识提炼过程，而是审视知识灌输过程。我们认为，对于给定的师生对，可以通过在训练教师的同时找到批次大小和历次数之间的最佳点来提高蒸馏质量。我们讨论了找到最佳蒸馏点的步骤。我们还提出了蒸馏假说来区分知识蒸馏和正则化效应之间的蒸馏过程行为。我们在三个不同的数据集上进行所有实验。摘要：The performance of a distillation-based compressed network is governed by the quality of distillation. The reason for the suboptimal distillation of a large network (teacher) to a smaller network (student) is largely attributed to the gap in the learning capacities of given teacher-student pair. While it is hard to distill all the knowledge of a teacher, the quality of distillation can be controlled to a large extent to achieve better performance. Our experiments show that the quality of distillation is largely governed by the quality of teacher's response, which in turn is heavily affected by the presence of similarity information in its response. A well-trained large capacity teacher loses similarity information between classes in the process of learning fine-grained discriminative properties for classification. The absence of similarity information causes the distillation process to be reduced from one example-many class learning to one example-one class learning, thereby throttling the flow of diverse knowledge from the teacher. With the implicit assumption that only the instilled knowledge can be distilled, instead of focusing only on the knowledge distilling process, we scrutinize the knowledge inculcation process. We argue that for a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher. We discuss the steps to find this sweet spot for better distillation. We also propose the distillation hypothesis to differentiate the behavior of the distillation process between knowledge distillation and regularization effect. We conduct all our experiments on three different datasets.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 DeepUME: Learning the Universal Manifold Embedding for Robust Point Cloud Registration 标题：DeepUME：学习用于鲁棒点云配准的通用流形嵌入链接：https://arxiv.org/abs/2112.09938

作者：Natalie Lang,Joseph M. Francos 机构：Ben-Gurion University, Beer-Sheva, Israel 备注：BMVC 2021 摘要：刚性变换相关点云的配准是计算机视觉的基本问题之一。然而，对于在存在噪声的情况下对齐稀疏和不同采样观测值的实际场景，仍然缺乏解决方案。在这种情况下，我们采用闭合形式通用马尼褶皱嵌入（UME）方法和深度神经网络的融合来实现配准。这两种方法被组合成一个统一的框架，名为DeepUME，以无监督的方式进行端到端训练。为了在存在大变换的情况下成功地提供全局解决方案，我们使用SO（3）不变坐标系来学习点云的联合重采样策略和SO（3）不变特征。然后，几何UME方法利用这些特征进行变换估计。DeepUME的参数使用一个度量进行优化，该度量用于克服在考虑噪声场景时对称形状注册中出现的模糊问题。我们证明了我们的混合方法在各种情况下都优于最先进的注册方法，并且可以很好地推广到看不见的数据集。我们的代码是公开的。摘要：Registration of point clouds related by rigid transformations is one of the fundamental problems in computer vision. However, a solution to the practical scenario of aligning sparsely and differently sampled observations in the presence of noise is still lacking. We approach registration in this scenario with a fusion of the closed-form Universal Mani-fold Embedding (UME) method and a deep neural network. The two are combined into a single unified framework, named DeepUME, trained end-to-end and in an unsupervised manner. To successfully provide a global solution in the presence of large transformations, we employ an SO(3)-invariant coordinate system to learn both a joint-resampling strategy of the point clouds and SO(3)-invariant features. These features are then utilized by the geometric UME method for transformation estimation. The parameters of DeepUME are optimized using a metric designed to overcome an ambiguity problem emerging in the registration of symmetric shapes, when noisy scenarios are considered. We show that our hybrid method outperforms state-of-the-art registration methods in various scenarios, and generalizes well to unseen data sets. Our code is publicly available.

【2】 Fast and Robust Registration of Partially Overlapping Point Clouds 标题：快速稳健的部分重叠点云配准链接：https://arxiv.org/abs/2112.09922

作者：Eduardo Arnold,Sajjad Mozaffari,Mehrdad Dianati 机构： University of Warwick 备注：Accepted at IEEE Robotics and Automation Letters (RA-L). 8 pages, 6 figures, 3 tables 摘要：部分重叠点云的实时配准在自主车辆协作感知和多智能体SLAM中有着新兴的应用。在这些应用中，点云之间的相对平移比传统SLAM和里程计应用中的相对平移要高，这对通信的识别和成功注册提出了挑战。在本文中，我们提出了一种新的部分重叠点云配准方法，其中对应关系使用有效的逐点特征编码器学习，并使用基于图的注意网络进行细化。该注意网络利用关键点之间的几何关系来改善低重叠点云中的匹配。推理时，通过样本一致性对对应关系进行稳健拟合，得到相对位姿变换。评估是在KITTI数据集和一个新的合成数据集上进行的，该数据集包括位移高达30m的低重叠点云。该方法在KITTI数据集上实现了与最新方法相当的性能，并且优于现有的低重叠点云方法。此外，所提出的方法实现了显著更快的推理时间，低至410ms，比竞争方法快5到35倍。我们的代码和数据集可在https://github.com/eduardohenriquearnold/fastreg. 摘要：Real-time registration of partially overlapping point clouds has emerging applications in cooperative perception for autonomous vehicles and multi-agent SLAM. The relative translation between point clouds in these applications is higher than in traditional SLAM and odometry applications, which challenges the identification of correspondences and a successful registration. In this paper, we propose a novel registration method for partially overlapping point clouds where correspondences are learned using an efficient point-wise feature encoder, and refined using a graph-based attention network. This attention network exploits geometrical relationships between key points to improve the matching in point clouds with low overlap. At inference time, the relative pose transformation is obtained by robustly fitting the correspondences through sample consensus. The evaluation is performed on the KITTI dataset and a novel synthetic dataset including low-overlapping point clouds with displacements of up to 30m. The proposed method achieves on-par performance with state-of-the-art methods on the KITTI dataset, and outperforms existing methods for low overlapping point clouds. Additionally, the proposed method achieves significantly faster inference times, as low as 410ms, between 5 and 35 times faster than competing methods. Our code and dataset are available at https://github.com/eduardohenriquearnold/fastreg.

多模态(1篇)

【1】 Multimodal Adversarially Learned Inference with Factorized Discriminators 标题：基于因子分解鉴别器的多模态对抗性学习推理链接：https://arxiv.org/abs/2112.10384

作者：Wenxue Chen,Jianke Zhu 机构： Zhejiang University, Alibaba-Zhejiang University Joint Institute of Frontier Technologies 备注：9 pages, 6 figures 摘要：从多模态数据中学习是机器学习的一个重要研究课题，它有可能获得更好的表示。在这项工作中，我们提出了一种基于生成对抗网络的多模态数据生成建模的新方法。为了学习一个相干的多模态生成模型，我们证明有必要将不同的编码器分布与联合解码器分布同时对齐。为此，我们构造了一种特定形式的鉴别器，以使我们的模型能够有效地利用数据，并且可以对数据进行对比训练。通过对鉴别器进行因子分解，利用对比学习，我们在单峰数据上训练了我们的模型。我们在基准数据集上进行了实验，实验结果表明，我们提出的方法在各种指标上都优于最先进的方法。源代码将公开提供。摘要：Learning from multimodal data is an important research topic in machine learning, which has the potential to obtain better representations. In this work, we propose a novel approach to generative modeling of multimodal data based on generative adversarial networks. To learn a coherent multimodal generative model, we show that it is necessary to align different encoder distributions with the joint decoder distribution simultaneously. To this end, we construct a specific form of the discriminator to enable our model to utilize data efficiently, which can be trained constrastively. By taking advantage of contrastive learning through factorizing the discriminator, we train our model on unimodal data. We have conducted experiments on the benchmark datasets, whose promising results show that our proposed approach outperforms the-state-of-the-art methods on a variety of metrics. The source code will be made publicly available.

3D|3D重建等相关(3篇)

【1】 ScanQA: 3D Question Answering for Spatial Scene Understanding 标题：ScanQA：面向空间场景理解的3D问答链接：https://arxiv.org/abs/2112.10482

作者：Daichi Azuma,Taiki Miyanishi,Shuhei Kurita,Motoki Kawanabe 机构：Kyoto University, ATR, RIKEN AIP, RIKEN, JST PRESTO 备注：11 pages for the main paper and 9 pages for the supplementary material. 10 figures 摘要：我们提出了一种新的三维空间理解任务——三维问答（3D-QA）。在3D-QA任务中，模型从丰富的RGB-D室内扫描的整个3D场景接收视觉信息，并回答有关3D场景的给定文本问题。与VQA的2D问答不同，传统的2D-QA模型存在对象对齐和方向的空间理解问题，并且无法通过3D-QA中的文本问题进行对象定位。我们提出了一个3D-QA的基线模型，名为ScanQA模型，该模型从3D对象建议和编码句子嵌入中学习融合描述符。该学习描述符将语言表达式与三维扫描的基本几何特征相关联，并有助于三维边界框的回归，以确定文本问题中描述的对象。我们收集了具有自由形式答案的人工编辑问答对，这些答案固定在每个3D场景中的3D对象上。我们新的ScanQA数据集包含来自ScanNet数据集的800个室内场景的超过41K个问答对。据我们所知，ScanQA是第一个在3D环境中执行基于对象的问题回答的大规模尝试。摘要：We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene. Unlike the 2D-question answering of VQA, the conventional 2D-QA models suffer from problems with spatial understanding of object alignment and directions and fail the object localization from the textual questions in 3D-QA. We propose a baseline model for 3D-QA, named ScanQA model, where the model learns a fused descriptor from 3D object proposals and encoded sentence embeddings. This learned descriptor correlates the language expressions with the underlying geometric features of the 3D scan and facilitates the regression of 3D bounding boxes to determine described objects in textual questions. We collected human-edited question-answer pairs with free-form answers that are grounded to 3D objects in each 3D scene. Our new ScanQA dataset contains over 41K question-answer pairs from the 800 indoor scenes drawn from the ScanNet dataset. To the best of our knowledge, ScanQA is the first large-scale effort to perform object-grounded question-answering in 3D environments.

【2】 GPU optimization of the 3D Scale-invariant Feature Transform Algorithm and a Novel BRIEF-inspired 3D Fast Descriptor 标题：三维尺度不变特征变换算法的GPU优化及一种新的简明三维快速描述子链接：https://arxiv.org/abs/2112.10258

作者：Jean-Baptiste Carluer,Laurent Chauvin,Jie Luo,William M. Wells III,Ines Machado,Rola Harmouche,Matthew Toews 机构：École de technologie supérieure ÉTS, Montreal, Canada, Université de Nantes, Nantes, France, Brigham and Women’s Hospital, Harvard Medical School, Boston, U.S.A., Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal 摘要：本文详细介绍了三维尺度不变特征变换（SIFT）算法的一种高效实现方法，用于从大量医学图像数据中进行机器学习。3D SIFT代码的主要操作在图形处理单元（GPU）上实现，包括卷积、子采样和尺度空间金字塔的4D峰值检测。使用不同人群的3D MRI人脑体积，在关键点检测和图像到图像匹配实验中量化性能改进。基于二进制鲁棒独立基本特征（BRIENT）代码，提出了计算效率高的3D关键点描述符，包括我们称之为分级鲁棒独立基本特征（RREF）的新描述符，并与原始3D SIFT Rank方法\citep{TOEWS2013EFFECTION}进行了比较。GPU实现的加速比优化的CPU实现快约7倍，其中对于大小为（145、174、145）体素的3D体积（约3000个关键点），计算时间从1.4秒减少到0.2秒。值得注意的加速包括卷积运算（20X）、4D峰值检测（3X）、子采样（3X）和高斯金字塔结构差（2X）。与标准SIFT秩描述符相比，高效描述符提供了2倍的加速比和6倍的内存节省，但代价是减少了关键点对应的数量，从而揭示了计算效率和算法性能之间的权衡。通过我们的实现获得的加速将允许对更大的数据集进行更有效的分析。我们的3D SIFT秩提取器的优化GPU实现可在https://github.com/CarluerJB/3D_SIFT_CUDA. 摘要：This work details a highly efficient implementation of the 3D scale-invariant feature transform (SIFT) algorithm, for the purpose of machine learning from large sets of volumetric medical image data. The primary operations of the 3D SIFT code are implemented on a graphics processing unit (GPU), including convolution, sub-sampling, and 4D peak detection from scale-space pyramids. The performance improvements are quantified in keypoint detection and image-to-image matching experiments, using 3D MRI human brain volumes of different people. Computationally efficient 3D keypoint descriptors are proposed based on the Binary Robust Independent Elementary Feature (BRIEF) code, including a novel descriptor we call Ranked Robust Independent Elementary Features (RRIEF), and compared to the original 3D SIFT-Rank method\citep{toews2013efficient}. The GPU implementation affords a speedup of approximately 7X beyond an optimised CPU implementation, where computation time is reduced from 1.4 seconds to 0.2 seconds for 3D volumes of size (145, 174, 145) voxels with approximately 3000 keypoints. Notable speedups include the convolution operation (20X), 4D peak detection (3X), sub-sampling (3X), and difference-of-Gaussian pyramid construction (2X). Efficient descriptors offer a speedup of 2X and a memory savings of 6X compared to standard SIFT-Rank descriptors, at a cost of reduced numbers of keypoint correspondences, revealing a trade-off between computational efficiency and algorithmic performance. The speedups gained by our implementation will allow for a more efficient analysis on larger data sets. Our optimized GPU implementation of the 3D SIFT-Rank extractor is available at https://github.com/CarluerJB/3D_SIFT_CUDA.

【3】 3D Structural Analysis of the Optic Nerve Head to Robustly Discriminate Between Papilledema and Optic Disc Drusen 标题：视神经头三维结构分析对视乳头水肿和视盘玻璃疣的有力鉴别链接：https://arxiv.org/abs/2112.09970

作者：Michaël J. A. Girard,Satish K. Panda,Tin Aung Tun,Elisabeth A. Wibroe,Raymond P. Najjar,Aung Tin,Alexandre H. Thiéry,Steffen Hamann,Clare Fraser,Dan Milea 机构：., Ophthalmic Engineering & Innovation Laboratory, Singapore Eye Research Institute, Singapore, Duke-NUS Graduate Medical School, Singapore, Institute for Molecular and Clinical Ophthalmology, Basel, Switzerland 摘要：目的：（1）开发一种在三维光学相干断层扫描（OCT）中识别视神经头（ONH）主要组织结构的深度学习算法；（2）利用这些信息有力地区分健康人、视盘drusen（ODD）和视乳头水肿。这是一项横断面对比研究，包括确诊的ODD（105只眼）、高颅内压引起的乳头水肿（51只眼）和健康对照组（100只眼）。使用OCT获取ONHs的3D扫描，然后进行处理以提高深部组织的可见度。首先，使用984次B扫描（来自130只眼睛）开发了一种深度学习算法，以识别：主要神经/结缔组织和奇数区域。使用骰子系数（DC）评估我们算法的性能。在第二步中，设计了一种分类算法（随机森林），使用150个OCT体积，严格按照drusen和prelamina肿胀分数（源自分割）进行三级分类（1：奇数，2：乳头水肿，3：健康）。为了评估性能，我们报告了每个类别的接收器工作特性曲线（AUC）下的面积。我们的分割算法能够分离神经和结缔组织，以及任何时候出现的奇数区域。测试集上的平均DC为0.93$\pm$0.03证实了这一点，对应于良好的性能。通过高AUC实现分类，即ODD检测为0.99$\pm$0.01，乳头水肿检测为0.99$\pm$0.01，健康ONHs检测为0.98$\pm$0.02。我们的人工智能方法通过单次OCT扫描准确地区分了奇数和乳头水肿。我们的分类性能非常出色，但需要注意的是，在更大的人群中进行验证是必要的。我们的方法有可能使OCT成为神经眼科诊断成像的主要手段。摘要：Purpose: (1) To develop a deep learning algorithm to identify major tissue structures of the optic nerve head (ONH) in 3D optical coherence tomography (OCT) scans; (2) to exploit such information to robustly differentiate among healthy, optic disc drusen (ODD), and papilledema ONHs. It was a cross-sectional comparative study with confirmed ODD (105 eyes), papilledema due to high intracranial pressure (51 eyes), and healthy controls (100 eyes). 3D scans of the ONHs were acquired using OCT, then processed to improve deep-tissue visibility. At first, a deep learning algorithm was developed using 984 B-scans (from 130 eyes) in order to identify: major neural/connective tissues, and ODD regions. The performance of our algorithm was assessed using the Dice coefficient (DC). In a 2nd step, a classification algorithm (random forest) was designed using 150 OCT volumes to perform 3-class classifications (1: ODD, 2: papilledema, 3: healthy) strictly from their drusen and prelamina swelling scores (derived from the segmentations). To assess performance, we reported the area under the receiver operating characteristic curves (AUCs) for each class. Our segmentation algorithm was able to isolate neural and connective tissues, and ODD regions whenever present. This was confirmed by an average DC of 0.93$\pm$0.03 on the test set, corresponding to good performance. Classification was achieved with high AUCs, i.e. 0.99$\pm$0.01 for the detection of ODD, 0.99 $\pm$ 0.01 for the detection of papilledema, and 0.98$\pm$0.02 for the detection of healthy ONHs. Our AI approach accurately discriminated ODD from papilledema, using a single OCT scan. Our classification performance was excellent, with the caveat that validation in a much larger population is warranted. Our approach may have the potential to establish OCT as the mainstay of diagnostic imaging in neuro-ophthalmology.

其他神经网络|深度学习|模型|建模(13篇)

【1】 Learning Physics Properties of Fabrics and Garments with a Physics Similarity Neural Network 标题：用物理相似神经网络学习面料和服装的物理性质链接：https://arxiv.org/abs/2112.10727

作者：Li Duan,Lewis Boyd,Gerardo Aragon-Camarasa 机构：School of Computing Science, University of Glasgow, Glasgow, United Kingdom, National Manufacturing Institute Scotland, University of Strathclyde, Glasgow, United Kingdom, Article history:, Received 摘要：在本文中，我们建议通过物理相似网络（PhySNet）来学习模拟织物之间的物理相似性，从而预测真实织物和服装的物理参数。为此，我们估算了电风扇产生的风速和面积重量，以预测模拟和真实织物和服装的弯曲刚度。我们发现，PhySNet与贝叶斯优化算法相结合，可以预测物理参数，并将真实面料的最新水平提高34%，真实服装的最新水平提高68%。摘要：In this paper, we propose to predict the physics parameters of real fabrics and garments by learning their physics similarities between simulated fabrics via a Physics Similarity Network (PhySNet). For this, we estimate wind speeds generated by an electric fan and the area weight to predict bending stiffness of simulated and real fabrics and garments. We found that PhySNet coupled with a Bayesian optimiser can predict physics parameters and improve the state-of-art by 34%for real fabrics and 68% for real garments.

【2】 Raw High-Definition Radar for Multi-Task Learning 标题：用于多任务学习的RAW高清晰度雷达链接：https://arxiv.org/abs/2112.10646

作者：Julien Rebut,Arthur Ouaknine,Waqas Malik,Patrick Pérez 机构：Patrick P´erez, valeo.ai, Paris, France, LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, Palaiseau, France 备注：9 pages, 6 figures, 5 tables 摘要：由于雷达传感器对恶劣天气条件的鲁棒性和测量速度的能力，二十多年来，雷达传感器一直是汽车领域的一部分。高清晰度（HD）成像雷达的最新进展使角度分辨率低于度，从而接近激光扫描性能。然而，高清雷达提供的数据量和估计角度位置的计算成本仍然是一个挑战。在本文中，我们提出了一种新的高清雷达传感模型FFT-RadNet，该模型消除了计算距离-方位-多普勒三维张量的开销，而是学习从距离-多普勒频谱恢复角度。FFT-RadNet训练用于检测车辆和分割自由驾驶空间。在这两项任务上，它与最新的基于雷达的模型竞争，同时需要更少的计算和内存。此外，我们还从各种环境（城市街道、公路、乡村道路）中的同步汽车坡度传感器（摄像头、激光、高清雷达）收集并注释了2小时的原始数据。这一独特的数据集，nick为“雷达、激光雷达等”命名为RADIal，可在https://github.com/valeoai/RADIal. 摘要：With their robustness to adverse weather conditions and ability to measure speeds, radar sensors have been part of the automotive landscape for more than two decades. Recent progress toward High Definition (HD) Imaging radar has driven the angular resolution below the degree, thus approaching laser scanning performance. However, the amount of data a HD radar delivers and the computational cost to estimate the angular positions remain a challenge. In this paper, we propose a novel HD radar sensing model, FFT-RadNet, that eliminates the overhead of computing the Range-Azimuth-Doppler 3D tensor, learning instead to recover angles from a Range-Doppler spectrum. FFT-RadNet is trained both to detect vehicles and to segment free driving space. On both tasks, it competes with the most recent radar-based models while requiring less compute and memory. Also, we collected and annotated 2-hour worth of raw data from synchronized automotive-grade sensors (camera, laser, HD radar) in various environments (city street, highway, countryside road). This unique dataset, nick-named RADIal for "Radar, Lidar et al.", is available at https://github.com/valeoai/RADIal.

【3】 Learning to integrate vision data into road network data 标题：学习将视觉数据集成到道路网数据中链接：https://arxiv.org/abs/2112.10624

作者：Oliver Stromann,Alireza Razavi,Michael Felsberg 机构：⋆ Computer Vision Laboratory, Link¨oping University, Sweden, †Scania CV AB, Autonomous Transport Solution Research, Sweden 备注：Submitted to ICASSP'22 摘要：道路网络是互联和自动车辆的核心基础设施，但为机器学习应用程序创建有意义的表示是一项具有挑战性的任务。在这项工作中，我们建议将遥感视觉数据集成到道路网络数据中，以改进图形神经网络的嵌入。我们提出了一种基于时空道路和交通特征的道路边缘分割方法，可以利用卫星图像和数字表面模型的视觉特征丰富道路网络的属性集。我们证明，视觉数据的分割和集成都可以提高道路类型分类任务的性能，并且我们在中国成都的OSM+DiDi Chuxing数据集上实现了最先进的性能。摘要：Road networks are the core infrastructure for connected and autonomous vehicles, but creating meaningful representations for machine learning applications is a challenging task. In this work, we propose to integrate remote sensing vision data into road network data for improved embeddings with graph neural networks. We present a segmentation of road edges based on spatio-temporal road and traffic characteristics, which allows to enrich the attribute set of road networks with visual features of satellite imagery and digital surface models. We show that both, the segmentation and the integration of vision data can increase performance on a road type classification task, and we achieve state-of-the-art performance on the OSM+DiDi Chuxing dataset on Chengdu, China.

【4】 General Greedy De-bias Learning 标题：一般的贪婪去偏向学习链接：https://arxiv.org/abs/2112.10572

作者：Xinzhe Han,Shuhui Wang,Chi Su,Qingming Huang,Qi Tian 备注：under-review 摘要：神经网络通常依靠数据集的虚假相关性而不是感兴趣任务的内在属性进行预测，从而面临分布外（OOD）测试数据的急剧退化。现有的去偏见学习框架试图通过偏见注释捕获特定的数据集偏见，但它们无法处理复杂的OOD场景。其他人通过对低能力偏差模型或损失的特殊设计隐式地识别数据集偏差，但当训练和测试数据来自同一分布时，数据集偏差会降低。在本文中，我们提出了一个通用的贪婪去偏差学习框架（GGD），它在函数空间中贪婪地训练有偏差模型和梯度下降等基本模型。它鼓励基础模型将重点放在有偏模型难以解决的示例上，从而在测试阶段保持对虚假相关性的鲁棒性。GGD在很大程度上提高了模型在各种任务上的OOD泛化能力，但有时会高估偏差水平，并在分布内测试中退化。我们进一步重新分析了GGD的集成过程，并在课程学习的启发下，将课程规则化引入GGD，实现了分布内和分布外性能之间的良好平衡。大量的图像分类、对抗性问答和可视化问答实验证明了该方法的有效性。GGD可以在具有先验知识的任务特定偏差模型和不具有先验知识的自集成偏差模型的设置下学习一个更健壮的基础模型。摘要：Neural networks often make predictions relying on the spurious correlations from the datasets rather than the intrinsic properties of the task of interest, facing sharp degradation on out-of-distribution (OOD) test data. Existing de-bias learning frameworks try to capture specific dataset bias by bias annotations, they fail to handle complicated OOD scenarios. Others implicitly identify the dataset bias by the special design on the low capability biased model or the loss, but they degrade when the training and testing data are from the same distribution. In this paper, we propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. It encourages the base model to focus on examples that are hard to solve with biased models, thus remaining robust against spurious correlations in the test stage. GGD largely improves models' OOD generalization ability on various tasks, but sometimes over-estimates the bias level and degrades on the in-distribution test. We further re-analyze the ensemble process of GGD and introduce the Curriculum Regularization into GGD inspired by curriculum learning, which achieves a good trade-off between in-distribution and out-of-distribution performance. Extensive experiments on image classification, adversarial question answering, and visual question answering demonstrate the effectiveness of our method. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.

【5】 Scale-Net: Learning to Reduce Scale Differences for Large-Scale Invariant Image Matching 标题：Scale-Net：学习减小大规模不变图像匹配的尺度差异链接：https://arxiv.org/abs/2112.10485

作者：Yujie Fu,Yihong Wu 摘要：大多数图像匹配方法在遇到图像的大规模变化时表现不佳。为了解决这个问题，首先，我们提出了一种尺度差异感知的图像匹配方法（SDAIM），该方法在局部特征提取之前通过根据估计的尺度比调整图像对的两幅图像的大小来减小图像的尺度差异。其次，为了准确估计尺度比，我们提出了一种共视性注意增强匹配模块（CVARM），并基于CVARM设计了一种新的神经网络，称为尺度网。所提出的CVARM可以对图像对中的共焦区域施加更大的应力，并抑制仅在一幅图像中可见区域的分心。定量和定性实验表明，与现有的各种尺度比估计方法相比，该尺度网具有更高的尺度比估计精度和更好的泛化能力。图像匹配和相对姿态估计任务的进一步实验表明，我们的SDAIM和Scale网络能够极大地提高代表性局部特征和最先进的局部特征匹配方法的性能。摘要：Most image matching methods perform poorly when encountering large scale changes in images. To solve this problem, firstly, we propose a scale-difference-aware image matching method (SDAIM) that reduces image scale differences before local feature extraction, via resizing both images of an image pair according to an estimated scale ratio. Secondly, in order to accurately estimate the scale ratio, we propose a covisibility-attention-reinforced matching module (CVARM) and then design a novel neural network, termed as Scale-Net, based on CVARM. The proposed CVARM can lay more stress on covisible areas within the image pair and suppress the distraction from those areas visible in only one image. Quantitative and qualitative experiments confirm that the proposed Scale-Net has higher scale ratio estimation accuracy and much better generalization ability compared with all the existing scale ratio estimation methods. Further experiments on image matching and relative pose estimation tasks demonstrate that our SDAIM and Scale-Net are able to greatly boost the performance of representative local features and state-of-the-art local feature matching methods.

【6】 Topology Preserving Local Road Network Estimation from Single Onboard Camera Image 标题：基于单幅车载摄像机图像的局部路网拓扑保持估计链接：https://arxiv.org/abs/2112.10155

作者：Yigit Baran Can,Alexander Liniger,Danda Pani Paudel,Luc Van Gool 机构：Computer Vision Lab, ETH Zurich, VISICS, ESATPSI, KU Leuven 摘要：道路网络拓扑知识对于自主规划和导航至关重要。然而，从一张图像中恢复这样的拓扑结构只进行了部分探索。此外，它需要参考地平面，在地平面上也会执行驱动动作。本文旨在直接在鸟瞰图（BEV）中提取复杂城市环境中的局部道路网络拓扑。唯一的输入由单个车载前视摄像头图像组成。我们使用一组有向车道曲线及其相互作用来表示道路拓扑，这些曲线及其相互作用是使用它们的交点来捕获的。为了更好地捕获拓扑，我们引入了\emph{minimal cycles}及其覆盖的概念。最小循环是由有向曲线段（两个交点之间）形成的最小循环。覆盖是一组曲线，其线段参与形成最小循环。我们首先证明覆盖足以唯一地表示道路拓扑。然后，覆盖层与车道曲线监控一起用于监控深层神经网络。它们学习从单个输入图像预测道路拓扑。NuScenes和Argoverse基准测试的结果明显优于基线测试的结果。我们的源代码将公开提供。摘要：Knowledge of the road network topology is crucial for autonomous planning and navigation. Yet, recovering such topology from a single image has only been explored in part. Furthermore, it needs to refer to the ground plane, where also the driving actions are taken. This paper aims at extracting the local road network topology, directly in the bird's-eye-view (BEV), all in a complex urban setting. The only input consists of a single onboard, forward looking camera image. We represent the road topology using a set of directed lane curves and their interactions, which are captured using their intersection points. To better capture topology, we introduce the concept of \emph{minimal cycles} and their covers. A minimal cycle is the smallest cycle formed by the directed curve segments (between two intersections). The cover is a set of curves whose segments are involved in forming a minimal cycle. We first show that the covers suffice to uniquely represent the road topology. The covers are then used to supervise deep neural networks, along with the lane curve supervision. These learn to predict the road topology from a single input image. The results on the NuScenes and Argoverse benchmarks are significantly better than those obtained with baselines. Our source code will be made publicly available.

【7】 RoboAssembly: Learning Generalizable Furniture Assembly Policy in a Novel Multi-robot Contact-rich Simulation Environment 标题：RoboAssembly：在新型多机器人富接触仿真环境中学习泛化家具装配策略链接：https://arxiv.org/abs/2112.10143

作者：Mingxin Yu,Lin Shao,Zhehuan Chen,Tianhao Wu,Qingnan Fan,Kaichun Mo,Hao Dong 机构： Peking University, Stanford University 备注：Submitted to IEEE International Conference on Robotics and Automation (ICRA) 2022 摘要：零件组装是机器人技术中一项典型但具有挑战性的任务，机器人将一组单独的零件组装成一个完整的形状。在本文中，我们开发了一个用于家具装配的机器人装配仿真环境。我们将零件装配任务描述为一个具体的强化学习问题，并提出了一个机器人学习装配各种椅子的管道。实验表明，当使用看不见的椅子进行测试时，我们的方法在以对象为中心设置下的成功率为74.5%，在完全设置下的成功率为50.0%。我们采用RRT连接算法作为基线，在计算时间显著延长后，成功率仅为18.8%。补充材料和视频可在我们的项目网页上找到。摘要：Part assembly is a typical but challenging task in robotics, where robots assemble a set of individual parts into a complete shape. In this paper, we develop a robotic assembly simulation environment for furniture assembly. We formulate the part assembly task as a concrete reinforcement learning problem and propose a pipeline for robots to learn to assemble a diverse set of chairs. Experiments show that when testing with unseen chairs, our approach achieves a success rate of 74.5% under the object-centric setting and 50.0% under the full setting. We adopt an RRT-Connect algorithm as the baseline, which only achieves a success rate of 18.8% after a significantly longer computation time. Supplemental materials and videos are available on our project webpage.

【8】 MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks 标题：MoCaNet：基于规范化网络的野外运动重定向链接：https://arxiv.org/abs/2112.10082

作者：Wentao Zhu,Zhuoqian Yang,Ziang Di,Wayne Wu,Yizhou Wang,Chen Change Loy 机构： School of Computer Science, Peking University , Shanghai AI Laboratory , SenseTime Research, Southeast University , S-Lab, Nanyang Technological University 备注：Accepted by AAAI 2022. The first two authors contributed equally. Project page: this https URL 摘要：我们提出了一个新的框架，将三维运动重定目标任务从受控环境带入野外场景。特别是，我们的方法能够在不使用任何运动捕获系统或三维重建过程的情况下，将身体运动从二维单目视频中的角色重定目标为三维角色。它旨在利用大量在线视频进行无监督训练，不需要3D注释或运动身体配对信息。该方法基于两种新的规范化操作：结构规范化和视图规范化。通过正则化操作和导出的正则化，我们的方法学习将骨架序列分解为三个独立的语义子空间，即运动、结构和视角。解纠缠表示可以实现从2D到3D的高精度运动重定目标。我们的方法在具有较大身体变化和挑战性动作的运动传递基准上取得了优异的性能。值得注意的是，规范化的骨骼序列可以作为人类运动的解纠缠和可解释的表示，有利于动作分析和运动检索。摘要：We present a novel framework that brings the 3D motion retargeting task from controlled environments to in-the-wild scenarios. In particular, our method is capable of retargeting body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure. It is designed to leverage massive online videos for unsupervised training, needless of 3D annotations or motion-body pairing information. The proposed method is built upon two novel canonicalization operations, structure canonicalization and view canonicalization. Trained with the canonicalization operations and the derived regularizations, our method learns to factorize a skeleton sequence into three independent semantic subspaces, i.e., motion, structure, and view angle. The disentangled representation enables motion retargeting from 2D to 3D with high precision. Our method achieves superior performance on motion transfer benchmarks with large body variations and challenging actions. Notably, the canonicalized skeleton sequence could serve as a disentangled and interpretable representation of human motion that benefits action analysis and motion retrieval.

【9】 Continual Learning of a Mixed Sequence of Similar and Dissimilar Tasks 标题：相似和不相似任务混合序列的连续学习链接：https://arxiv.org/abs/2112.10017

作者：Zixuan Ke,Bing Liu,Xingchang Huang 机构： Department of Computer Science, University of Illinois at Chicago, ETH Zurich 备注：None 摘要：关于任务序列的持续学习的现有研究集中于处理灾难性遗忘，其中任务被假定为不同的，并且几乎没有共享的知识。当任务相似且共享知识时，还进行了一些工作，将以前学到的知识转移到新任务中。就我们所知，还没有人提出过一种技术来学习一系列类似和不同的混合任务，这些任务既可以处理遗忘，也可以前后传递知识。本文提出了一种在同一网络中学习这两类任务的方法。对于不同的任务，该算法侧重于处理遗忘，对于相似的任务，该算法侧重于有选择地转移从以前相似任务中学习到的知识，以改进新的任务学习。此外，该算法自动检测新任务是否与以前的任何任务相似。使用混合任务序列的实证评估证明了所提出模型的有效性。摘要：Existing research on continual learning of a sequence of tasks focused on dealing with catastrophic forgetting, where the tasks are assumed to be dissimilar and have little shared knowledge. Some work has also been done to transfer previously learned knowledge to the new task when the tasks are similar and have shared knowledge. To the best of our knowledge, no technique has been proposed to learn a sequence of mixed similar and dissimilar tasks that can deal with forgetting and also transfer knowledge forward and backward. This paper proposes such a technique to learn both types of tasks in the same network. For dissimilar tasks, the algorithm focuses on dealing with forgetting, and for similar tasks, the algorithm focuses on selectively transferring the knowledge learned from some similar previous tasks to improve the new task learning. Additionally, the algorithm automatically detects whether a new task is similar to any previous tasks. Empirical evaluation using sequences of mixed tasks demonstrates the effectiveness of the proposed model.

【10】 Does Explainable Machine Learning Uncover the Black Box in Vision Applications? 标题：可解释机器学习能揭开视觉应用中的黑匣子吗？链接：https://arxiv.org/abs/2112.09898

作者：Manish Narwaria 机构：Department of Electrical Engineering, Indian Institute of Technology Jodhpur 备注：None 摘要：机器学习（ML）特别是深度学习（DL）已经成为一些视觉应用（如目标检测、超分辨率、分割、目标跟踪等）中非常流行的工具。几乎同时，视觉中ML的可解释性问题（即解释/阐述经过训练的ML模型做出决策的方式的能力）也受到了各个方面相当大的关注。然而，我们认为，可解释ML背后的当前哲学存在某些局限性，由此产生的解释可能无法有意义地揭示黑盒ML模型。为了阐述我们的主张，我们首先提出了一些在相应文献中没有充分讨论的基本问题。我们还提供了ML中的可解释性如何通过依赖相关领域中更严格的原则而受益的观点。摘要：Machine learning (ML) in general and deep learning (DL) in particular has become an extremely popular tool in several vision applications (like object detection, super resolution, segmentation, object tracking etc.). Almost in parallel, the issue of explainability in ML (i.e. the ability to explain/elaborate the way a trained ML model arrived at its decision) in vision has also received fairly significant attention from various quarters. However, we argue that the current philosophy behind explainable ML suffers from certain limitations, and the resulting explanations may not meaningfully uncover black box ML models. To elaborate our assertion, we first raise a few fundamental questions which have not been adequately discussed in the corresponding literature. We also provide perspectives on how explainablity in ML can benefit by relying on more rigorous principles in the related areas.

【11】 LegoDNN: Block-grained Scaling of Deep Neural Networks for Mobile Vision 标题：乐高DNN：用于移动视觉的深度神经网络的挡路粒度缩放链接：https://arxiv.org/abs/2112.09852

作者：Rui Han,Qinglong Zhang,Chi Harold Liu,Guoren Wang,Jian Tang,Lydia Y. Chen 机构：Beijing Institute of Technology, Beijing, P.R. China, Midea Group, Shanghai, P.R. China, TU Delft, Delft, Netherlands 备注：None 摘要：深度神经网络（DNN）已成为移动和嵌入式系统中普遍存在的技术，用于图像/对象识别和分类等应用。同时执行多个DNN的趋势加剧了在资源受限的移动设备上满足严格的延迟/准确性要求的现有限制。现有技术通过根据资源动态缩放模型尺寸来探索精度资源权衡。然而，这种模型缩放方法面临着迫在眉睫的挑战：（i）模型尺寸的大空间探索，（ii）不同模型组合的训练时间过长。在本文中，我们介绍了LegoDNN，一种轻量级、块粒度的扩展解决方案，用于在移动视觉系统中运行多DNN工作负载。LegoDNN通过在DNN中仅提取和训练少量公共块（例如VGG中的5个块和ResNet中的8个块），保证了较短的模型训练时间。在运行时，LegoDNN优化组合这些块的子代模型，以在特定资源和延迟约束下最大限度地提高准确性，同时通过DNN的智能块级扩展减少切换开销。我们在TensorFlow Lite中实现了Legodn，并使用一组12种流行的DNN模型，根据最先进的技术（FLOP缩放、知识提取和模型压缩）对其进行了广泛的评估。评估结果表明，Legodn在不增加训练时间的情况下，在模型尺寸上提供了1296倍到279936倍的选项，从而实现了31.74%的推理精度提高和71.07%的缩放能耗降低。摘要：Deep neural networks (DNNs) have become ubiquitous techniques in mobile and embedded systems for applications such as image/object recognition and classification. The trend of executing multiple DNNs simultaneously exacerbate the existing limitations of meeting stringent latency/accuracy requirements on resource constrained mobile devices. The prior art sheds light on exploring the accuracy-resource tradeoff by scaling the model sizes in accordance to resource dynamics. However, such model scaling approaches face to imminent challenges: (i) large space exploration of model sizes, and (ii) prohibitively long training time for different model combinations. In this paper, we present LegoDNN, a lightweight, block-grained scaling solution for running multi-DNN workloads in mobile vision systems. LegoDNN guarantees short model training times by only extracting and training a small number of common blocks (e.g. 5 in VGG and 8 in ResNet) in a DNN. At run-time, LegoDNN optimally combines the descendant models of these blocks to maximize accuracy under specific resources and latency constraints, while reducing switching overhead via smart block-level scaling of the DNN. We implement LegoDNN in TensorFlow Lite and extensively evaluate it against state-of-the-art techniques (FLOP scaling, knowledge distillation and model compression) using a set of 12 popular DNN models. Evaluation results show that LegoDNN provides 1,296x to 279,936x more options in model sizes without increasing training time, thus achieving as much as 31.74% improvement in inference accuracy and 71.07% reduction in scaling energy consumptions.

【12】 Neurashed: A Phenomenological Model for Imitating Deep Learning Training 标题：Neurash：一种模仿深度学习训练的现象学模型链接：https://arxiv.org/abs/2112.09741

作者：Weijie J. Su 备注：8 pages 摘要：为了在未来十年推进深度学习方法，需要一个关于现代神经网络推理的理论框架。尽管人们越来越多地试图揭开深度学习为何如此有效的神秘面纱，但仍然缺乏一个全面的图景，表明更好的理论是可能的。我们认为，未来的深度学习理论应该继承三个特征：层次结构的网络结构、使用基于随机梯度的方法优化的参数、以及来自数据的信息。作为一个实例，我们将这些特性集成到一个名为\textit{neurashed}的图形模型中。该模型有效地解释了深度学习中一些常见的经验模式。特别是，neurashed能够深入了解隐式正则化、信息瓶颈和局部弹性。最后，我们讨论了neurashed如何指导深度学习理论的发展。摘要：To advance deep learning methodologies in the next decade, a theoretical framework for reasoning about modern neural networks is needed. While efforts are increasing toward demystifying why deep learning is so effective, a comprehensive picture remains lacking, suggesting that a better theory is possible. We argue that a future deep learning theory should inherit three characteristics: a \textit{hierarchically} structured network architecture, parameters \textit{iteratively} optimized using stochastic gradient-based methods, and information from the data that evolves \textit{compressively}. As an instantiation, we integrate these characteristics into a graphical model called \textit{neurashed}. This model effectively explains some common empirical patterns in deep learning. In particular, neurashed enables insights into implicit regularization, information bottleneck, and local elasticity. Finally, we discuss how neurashed can guide the development of deep learning theories.

【13】 Learned Half-Quadratic Splitting Network for Magnetic Resonance Image Reconstruction 标题：用于磁共振图像重建的学习型半二次分裂网络链接：https://arxiv.org/abs/2112.09760

作者：Bingyu Xin,Timothy S. Phan,Leon Axel,Dimitris N. Metaxas 机构： Department of Computer Science, Rutgers University, Piscataway, NJ , USA., Department of Radiology, New York University, New York, NY , USA., Editors: Under Review for MIDL 摘要：从高度欠采样的$k$空间数据重建磁共振（MR）图像在加速磁共振成像（MRI）技术中至关重要。近年来，基于深度学习的方法在这方面显示出巨大的潜力。本文提出了一种用于MR图像重建的学习半二次分割算法，并在一个展开式深度学习网络结构中实现了该算法。我们在一个公共心脏MR数据集上与DC-CNN和LPDNet进行了性能比较，我们的方法在定量结果和定性结果方面都优于其他方法，模型参数更少，重建速度更快。最后，我们放大了我们的模型以获得更好的重建质量，在5美元的加速度和10美元的加速度下，峰值信噪比分别比LPDNet提高了1.76美元dB和2.74美元dB。我们的方法的代码在https://github.com/hellopipu/HQS-Net. 摘要：Magnetic Resonance (MR) image reconstruction from highly undersampled $k$-space data is critical in accelerated MR imaging (MRI) techniques. In recent years, deep learning-based methods have shown great potential in this task. This paper proposes a learned half-quadratic splitting algorithm for MR image reconstruction and implements the algorithm in an unrolled deep learning network architecture. We compare the performance of our proposed method on a public cardiac MR dataset against DC-CNN and LPDNet, and our method outperforms other methods in both quantitative results and qualitative results with fewer model parameters and faster reconstruction speed. Finally, we enlarge our model to achieve superior reconstruction quality, and the improvement is $1.76$ dB and $2.74$ dB over LPDNet in peak signal-to-noise ratio on $5\times$ and $10\times$ acceleration, respectively. Code for our method is publicly available at https://github.com/hellopipu/HQS-Net.

其他(10篇)

【1】 Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs 标题：MEGA-NERF：面向虚拟飞翔的大规模NERF的可伸缩构造链接：https://arxiv.org/abs/2112.10703

作者：Haithem Turki,Deva Ramanan,Mahadev Satyanarayanan 机构：Carnegie Mellon University, Argo AI 备注：Project page: this https URL GitHub: this https URL 摘要：我们探索如何利用神经辐射场（NeRFs）从主要从无人机数据收集的跨越建筑物甚至多个城市街区的大规模视觉捕获构建交互式三维环境。与传统上对NeRFs进行评估的单对象场景相比，此设置带来了多个挑战，包括（1）需要将数千张具有不同照明条件的图像合并，所有这些图像仅捕获场景的一小部分，（2）高得令人望而却步的模型容量和光线采样要求，超出了在单个GPU上可以简单训练的范围；（3）任意多的可能视点，使得事先无法预计算所有相关信息（实时NeRF渲染器通常会这样做）。为了应对这些挑战，我们首先分析大规模场景的可见性统计数据，并激发稀疏网络结构，其中参数专用于场景的不同区域。我们介绍了一种简单的几何聚类算法，该算法将训练图像（或者说像素）划分为不同的NeRF子模块，这些子模块可以并行训练。我们评估了从Quad 6k和UrbanScene3D数据集中采集的场景以及我们自己的无人机镜头中的方法，并显示了3倍的训练加速，同时平均PSNR提高了11%以上。随后，我们在Mega-NeRF上对最近的NeRF快速渲染器进行了实证评估，并介绍了一种利用时间一致性的新方法。我们的技术在PSNR质量保持在0.5 db以内的情况下，比传统的NeRF渲染速度提高了40倍，超过了现有快速渲染器的保真度。摘要：We explore how to leverage neural radiance fields (NeRFs) to build interactive 3D environments from large-scale visual captures spanning buildings or even multiple city blocks collected primarily from drone data. In contrast to the single object scenes against which NeRFs have been traditionally evaluated, this setting poses multiple challenges including (1) the need to incorporate thousands of images with varying lighting conditions, all of which capture only a small subset of the scene, (2) prohibitively high model capacity and ray sampling requirements beyond what can be naively trained on a single GPU, and (3) an arbitrarily large number of possible viewpoints that make it unfeasible to precompute all relevant information beforehand (as real-time NeRF renderers typically do). To address these challenges, we begin by analyzing visibility statistics for large-scale scenes, motivating a sparse network structure where parameters are specialized to different regions of the scene. We introduce a simple geometric clustering algorithm that partitions training images (or rather pixels) into different NeRF submodules that can be trained in parallel. We evaluate our approach across scenes taken from the Quad 6k and UrbanScene3D datasets as well as against our own drone footage and show a 3x training speedup while improving PSNR by over 11% on average. We subsequently perform an empirical evaluation of recent NeRF fast renderers on top of Mega-NeRF and introduce a novel method that exploits temporal coherence. Our technique achieves a 40x speedup over conventional NeRF rendering while remaining within 0.5 db in PSNR quality, exceeding the fidelity of existing fast renderers.

【2】 DeePaste -- Inpainting for Pasting 标题：DeePaste--用于粘贴的修复链接：https://arxiv.org/abs/2112.10600

作者：Levi Kassel Michael Werman 机构：DeePaste - Inpainting for PastingLevi KasselThe Hebrew University of JerusalemJerusalem, ilMichael WermanThe Hebrew University of JerusalemJerusalem 摘要：监督学习训练的挑战之一是需要获取大量标记数据。解决这个问题的一个众所周知的方法是以复制粘贴的方式使用合成数据，这样我们就可以剪切对象并将它们粘贴到相关的背景上。简单地粘贴对象会导致工件，从而导致模型在实际数据上的结果不佳。我们提出了一种在不同背景上清晰粘贴对象的新方法，以便创建的数据集在真实数据上提供有竞争力的性能。主要重点是使用修复处理粘贴对象的边界。我们展示了实例检测和前景分割的最新结果摘要：One of the challenges of supervised learning training is the need to procure an substantial amount of tagged data. A well-known method of solving this problem is to use synthetic data in a copy-paste fashion, so that we cut objects and paste them onto relevant backgrounds. Pasting the objects naively results in artifacts that cause models to give poor results on real data. We present a new method for cleanly pasting objects on different backgrounds so that the dataset created gives competitive performance on real data. The main emphasis is on the treatment of the border of the pasted object using inpainting. We show state-of-the-art results both on instance detection and foreground segmentation

【3】 Image Animation with Keypoint Mask 标题：使用关键点蒙版制作图像动画链接：https://arxiv.org/abs/2112.10457

作者：Or Toledano,Yanir Marmor,Dov Gertz 机构： Theinput is a YouTube clip of a Thai-Chi artist (the drivingsubject) performing a series of complex motions in the top 1TelAvivUniversity 摘要：运动传输是根据给定驱动视频的运动合成单个源图像的未来视频帧的任务。由于运动表示的复杂性以及驱动视频和源图像之间的未知关系，该任务具有挑战性。尽管存在这一困难，但近年来这一问题引起了研究者的极大兴趣，并逐步得到改善。该问题可以看作是运动和外观的解耦，通常通过从关键点运动中提取运动来解决。我们选择处理一般的、无监督的设置，在这里我们需要将动画应用于任何任意对象，而不需要任何特定于域的输入结构模型。在这项工作中，我们从关键点热图中提取结构，而无需显式的运动表示。然后，通过深度生成器从图像和视频中提取结构，根据视频扭曲图像。摘要：Motion transfer is the task of synthesizing future video frames of a single source image according to the motion from a given driving video. This task is challenging due to the complexity of motion representation and the unknown relations between the driving video and the source image. Despite this difficulty, this problem attracted great interests from researches at the recent years, with gradual improvements. The problem can be thought as decoupling of motion and appearance, which is often solved by extracting the motion from keypoint movement. We chose to tackle the generic, unsupervised setting, where we need to apply animation to any arbitrary object, without any domain specific model for the structure of the input. In this work, we extract the structure from a keypoint heatmap, without an explicit motion representation. Then, the structures from the image and the video are extracted to warp the image according to the video, by a deep generator.

【4】 UnweaveNet: Unweaving Activity Stories 标题：UnweaveNet：拆开活动故事链接：https://arxiv.org/abs/2112.10194

作者：Will Price,Carl Vondrick,Dima Damen 机构：University of Bristol, Columbia University, Time 摘要：我们的生活可以被看作是一个复杂的活动交织；我们从一项活动切换到另一项活动，以最大限度地发挥我们的成就，或响应对我们提出的要求。通过观察未编写的日常活动的视频，我们通过一个我们称为“取消保存”的过程将视频解析为其组成的活动线程。为了实现这一点，我们引入了一个视频表示，它显式地捕获活动线程，称为线程库，以及一个能够检测目标变化和恢复过去活动的神经控制器，共同形成一个新的线程。我们根据未编写的自我中心数据集EPIC-KITCHENS的序列对Unwavenet进行训练和评估。我们以自我监督的方式提出并展示了预训练的效果。摘要：Our lives can be seen as a complex weaving of activities; we switch from one activity to another, to maximise our achievements or in reaction to demands placed upon us. Observing a video of unscripted daily activities, we parse the video into its constituent activity threads through a process we call unweaving. To accomplish this, we introduce a video representation explicitly capturing activity threads called a thread bank, along with a neural controller capable of detecting goal changes and resuming of past activities, together forming UnweaveNet. We train and evaluate UnweaveNet on sequences from the unscripted egocentric dataset EPIC-KITCHENS. We propose and showcase the efficacy of pretraining UnweaveNet in a self-supervised manner.

【5】 SAGA: Stochastic Whole-Body Grasping with Contact 标题：传奇：随机的全身接触抓握链接：https://arxiv.org/abs/2112.10103

作者：Yan Wu,Jiahao Wang,Yan Zhang,Siwei Zhang,Otmar Hilliges,Fisher Yu,Siyu Tang 机构：ETH Z¨urich, Switzerland 摘要：人类抓取合成有许多应用，包括AR/VR、视频游戏和机器人技术。虽然已经提出了一些方法来生成真实的手-对象交互对象抓取和操作，他们通常只考虑手与对象交互。在这项工作中，我们的目标是综合全身抓取运动。给定一个3D对象，我们的目标是生成接近和抓取该对象的各种自然全身人体运动。这项任务具有挑战性，因为它需要对全身动力学和灵巧的手指运动进行建模。为此，我们提出了SAGA（接触式随机全身抓取），它由两个关键部分组成：（a）静态全身抓取姿势生成。具体来说，我们提出了一个多任务生成模型，用于联合学习静态全身抓取姿势和人体与物体的接触。（b）抓取运动填充。给定初始姿势和生成的全身抓取姿势分别作为运动的起始姿势和结束姿势，我们设计了一个新的接触感知生成运动填充模块，以生成一组不同的面向抓取的运动。我们证明了我们的方法的有效性，这是第一个生成框架，用于合成接近和抓取随机放置的看不见对象的真实和富有表现力的全身运动。有关代码和视频，请访问：https://jiahaoplus.github.io/SAGA/saga.html. 摘要：Human grasping synthesis has numerous applications including AR/VR, video games, and robotics. While some methods have been proposed to generate realistic hand-object interaction for object grasping and manipulation, they typically only consider the hand interacting with objects. In this work, our goal is to synthesize whole-body grasping motion. Given a 3D object, we aim to generate diverse and natural whole-body human motions that approach and grasp the object. This task is challenging as it requires modeling both whole-body dynamics and dexterous finger movements. To this end, we propose SAGA (StochAstic whole-body Grasping with contAct) which consists of two key components: (a) Static whole-body grasping pose generation. Specifically, we propose a multi-task generative model, to jointly learn static whole-body grasping poses and human-object contacts. (b) Grasping motion infilling. Given an initial pose and the generated whole-body grasping pose as the starting and ending poses of the motion respectively, we design a novel contact-aware generative motion infilling module to generate a diverse set of grasp-oriented motions. We demonstrate the effectiveness of our method being the first generative framework to synthesize realistic and expressive whole-body motions that approach and grasp randomly placed unseen objects. The code and videos are available at: https://jiahaoplus.github.io/SAGA/saga.html.

【6】 ArcFace Knows the Gender, Too! 标题：ArcFace也知道性别！链接：https://arxiv.org/abs/2112.10101

作者：Majid Farzaneh 备注：9 pages, 4 images, 2 tables 摘要：本文的主要思想是，如果一个模型能够识别一个人，当然，它也必须能够知道这个人的性别。因此，本文不是定义一个新的性别分类模型，而是基于面部特征使用ArcFace特征来确定性别。将人脸图像提供给ArcFace，并为该人脸获取512个特征。然后，借助传统的机器学习模型，确定性别。诸如支持向量机（SVM）、线性判别法和逻辑回归等判别方法很好地证明了从弧面提取的特征在性别类别之间产生了显著的区别。在性别分类数据集上的实验表明，基于高斯核的支持向量机能够利用ArcFace特征对性别进行分类，准确率为96.4%。摘要：The main idea of this paper is that if a model can recognize a person, of course, it must be able to know the gender of that person, too. Therefore, instead of defining a new model for gender classification, this paper uses ArcFace features to determine gender, based on the facial features. A face image is given to ArcFace and 512 features are obtained for the face. Then, with the help of traditional machine learning models, gender is determined. Discriminative methods such as Support Vector Machine (SVM), Linear Discriminant, and Logistic Regression well demonstrate that the features extracted from the ArcFace create a remarkable distinction between the gender classes. Experiments on the Gender Classification Dataset show that SVM with Gaussian kernel is able to classify gender with an accuracy of 96.4% using ArcFace features.

【7】 Efficient Strong Scaling Through Burst Parallel Training 标题：通过突发并行训练实现高效的强扩展链接：https://arxiv.org/abs/2112.10065

作者：Seo Jin Park,Joshua Fried,Sunghyun Kim,Mohammad Alizadeh,Adam Belay 摘要：随着新兴的深度神经网络（DNN）模型不断扩大，使用大型GPU集群来训练DNN已成为实现可接受训练时间的基本要求。在本文中，我们考虑的情况下，未来的集群规模的增加将导致全球批量大小，可用于训练模型达到基本限制：超过某一点，较大的全球批量大小导致样本效率降低，增加整体时间到准确性。因此，为了实现训练性能的进一步改进，我们必须考虑保持全局批次大小恒定并向每个GPU分配更小批次的“强缩放”策略。不幸的是，这大大增加了高效使用集群资源的难度。我们介绍了DeepPool，一个通过两个关键思想解决这一效率挑战的系统。首先，突发并行将大量GPU以突发方式分配给前台作业，以利用层间并行性的不均匀性。其次，GPU多路复用优先考虑前台训练作业的吞吐量，同时打包后台训练作业以回收未充分利用的GPU资源，从而提高集群范围的利用率。这两种想法结合在一起，使DeepPool在集群规模较大时，通过单个任务，总集群吞吐量比标准数据并行性提高2.2-2.4倍。摘要：As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 2.2 - 2.4x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.

【8】 An effective coaxiality error measurement for twist drill based on line structured light sensor 标题：一种基于线结构光传感器的麻花钻同轴度误差有效测量方法链接：https://arxiv.org/abs/2112.09873

作者：Ailing Cheng,Jiaojiao Ye,Fei Yang,Shufang Lu,Fei Gao 机构： College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou , China, Research Center of Intelligent Computing Software, Zhejiang Lab, Hangzhou , China 备注：17 pages, 24 figures 摘要：麻花钻结构复杂，同轴度误差测量难度大，难度大。提出了一种麻花钻同轴度误差测量的新机构、新结构和新方法。该机构包括编码器、PLC控制器、线结构传感器和高精度转台。首先，在PLC的控制下，通过线结构光传感器采集麻花钻回转时的轮廓点云数据。其次，研究了一种基于局部深度特征的基于GMM的点云分割算法来提取叶片背面数据。为了提高测量精度，在目标区域提取过程中设计了一个统计滤波器来去除异常值。然后，根据同轴度误差的两个特点，提出了一种基于轴对称轮廓差正交综合的轴重构方法，该方法有利于预定位钻轴的最大偏差截面。最后，通过拟合基准轴和预定位最大偏差位置的轴来测量同轴度误差。最后，进行了大量的实验，结果表明该方法具有较高的准确性和鲁棒性。摘要：Since the structure of twist drill is complex, it is hard and challenging for its coaxiality error measurement. In this paper, a novel mechanism, framework and method of coaxiality error measurement for twist drill is proposed. The mechanism includes encoder, PLC controller, line structured sensor and high precision turntable. First, profile point cloud data of the twist drill is collected through the line structured light sensor when the drill turns around in the controlling of PLC. Second, a GMM-based point cloud segmentation algorithm based on local depth features is investigated to extract blade back data. To improve the measurement accuracy, a statistical filter is designed to remove outliers during the target region extraction. Then, according to two characteristics of coaxiality error, an axis reconstruction method based on orthogonal synthesis of axisymmetric contour differences is presented, which is facilitated to pre-position the maximum deviation cross sections of the drill axis. Finally, the coaxiality error is measured through fitting the benchmark axis and the axis at the pre-positioned maximum deviation position. At the end, a large number of experiments are carried out, and it shows that our method is accuracy and robust.

【9】 Discovering State Variables Hidden in Experimental Data 标题：发现实验数据中隐藏的状态变量链接：https://arxiv.org/abs/2112.10755

作者：Boyuan Chen,Kuang Huang,Sunand Raghupathi,Ishaan Chandratreya,Qiang Du,Hod Lipson 机构：Columbia University 备注：Project website with code, data, and overview video is at: this https URL 摘要：所有物理定律都被描述为状态变量之间的关系，这些关系给出了相关系统动力学的完整和非冗余描述。然而，尽管计算能力和人工智能的普及，识别隐藏状态变量本身的过程抵制了自动化。大多数数据驱动的物理现象建模方法仍然假设观测到的数据流已经对应于相关的状态变量。一个关键的挑战是，在只提供高维观测数据的情况下，从零开始确定可能的状态变量集。在这里，我们提出了一个新的原则来确定一个观测系统可能有多少状态变量，以及这些变量可能是什么，直接来自视频流。我们使用从弹性双摆到火焰的各种物理动力系统的视频记录来证明这种方法的有效性。在没有任何基础物理知识的情况下，我们的算法发现了观测到的动力学的内在维度，并确定了状态变量的候选集。我们认为，这种方法有助于促进对日益复杂的系统的理解、预测和控制。项目网站：https://www.cs.columbia.edu/~bchen/神经状态变量摘要：All physical laws are described as relationships between state variables that give a complete and non-redundant description of the relevant system dynamics. However, despite the prevalence of computing power and AI, the process of identifying the hidden state variables themselves has resisted automation. Most data-driven methods for modeling physical phenomena still assume that observed data streams already correspond to relevant state variables. A key challenge is to identify the possible sets of state variables from scratch, given only high-dimensional observational data. Here we propose a new principle for determining how many state variables an observed system is likely to have, and what these variables might be, directly from video streams. We demonstrate the effectiveness of this approach using video recordings of a variety of physical dynamical systems, ranging from elastic double pendulums to fire flames. Without any prior knowledge of the underlying physics, our algorithm discovers the intrinsic dimension of the observed dynamics and identifies candidate sets of state variables. We suggest that this approach could help catalyze the understanding, prediction and control of increasingly complex systems. Project website is at: https://www.cs.columbia.edu/~bchen/neural-state-variables

【10】 Deformable Registration of Brain MR Images via a Hybrid Loss 标题：基于混合损失的脑MR图像可变形配准链接：https://arxiv.org/abs/2110.15027

作者：Luyi Han,Haoran Dou,Yunzhi Huang,Pew-Thian Yap 机构： Department of Radiology and Nuclear Medicine, Radboud University Medical, Center, Geert Grooteplein , GA, Nijmegen, The Netherlands, Centre for Computational Imaging and Simulation Technologies in Biomedicine, (CISTIB), University of Leeds, UK 备注：Ranked fifth on the brain T1w deformable registration task organized by the MICCAI 2021 Learn2Reg challenge 摘要：由于缺乏变形场的地面真实性，无监督学习策略被广泛应用于可变形配准模型中。这些模型通常依赖于基于强度的相似性损失来获得学习收敛性。尽管取得了成功，但这种依赖是不够的。对于单模态图像的可变形配准，对齐良好的两幅图像不仅具有不可区分的强度差异，而且在统计分布和边界区域上也非常接近。考虑到设计良好的损失函数可以促进学习模型收敛，我们通过混合损失函数集成多个图像特征来学习T1加权MR图像的可变形配准模型。我们的方法在保持变形平滑的同时，以高精度注册了OASIS数据集。摘要：Unsupervised learning strategy is widely adopted by the deformable registration models due to the lack of ground truth of deformation fields. These models typically depend on the intensity-based similarity loss to obtain the learning convergence. Despite the success, such dependence is insufficient. For the deformable registration of mono-modality image, well-aligned two images not only have indistinguishable intensity differences, but also are close in the statistical distribution and the boundary areas. Considering that well-designed loss functions can facilitate a learning model into a desirable convergence, we learn a deformable registration model for T1-weighted MR images by integrating multiple image characteristics via a hybrid loss. Our method registers the OASIS dataset with high accuracy while preserving deformation smoothness.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-21，如有侵权请联系 cloudcommunity@tencent.com 删除

linux