计算机视觉学术速递[10.19]

公众号-arXiv每日学术速递

发布于 2021-10-21 16:24:48

2.6K0

发布于 2021-10-21 16:24:48

文章被收录于专栏：arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计151篇

Transformer(7篇)

【1】 HRFormer: High-Resolution Transformer for Dense Prediction 标题：HRFormer：用于密度预测的高分辨率Transformer 链接：https://arxiv.org/abs/2110.09408

作者：Yuhui Yuan,Rao Fu,Lang Huang,Weihong Lin,Chao Zhang,Xilin Chen,Jingdong Wang 机构：University of Chinese Academy of Sciences, Institute of Computing Technology, CAS, Peking University, Microsoft Research Asia, Baidu 备注：Accepted at NeurIPS 2021 摘要：我们提出了一种高分辨率变换器（HRT），用于学习密集预测任务的高分辨率表示，而原始视觉变换器产生低分辨率表示，并且具有较高的内存和计算成本。我们利用高分辨率卷积网络（HRNet）中引入的多分辨率并行设计，以及在小的非重叠图像窗口上执行自我注意的局部窗口自我注意，以提高内存和计算效率。此外，我们在FFN中引入卷积，以便在断开连接的图像窗口之间交换信息。我们证明了高分辨率转换器在人体姿势估计和语义分割任务上的有效性，例如，HRT在COCO姿势估计上比Swin Transformer高出1.3美元AP，参数减少50\%$，失败次数减少30\%$。代码可从以下网址获取：https://github.com/HRNet/HRFormer. 摘要：We present a High-Resolution Transformer (HRT) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRT outperforms Swin transformer by $1.3$ AP on COCO pose estimation with $50\%$ fewer parameters and $30\%$ fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.

【2】 CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification 标题：CMTR：用于可见光-红外人再识别的跨模态转换器链接：https://arxiv.org/abs/2110.08994

作者：Tengfei Liang,Yi Jin,Yajun Gao,Wu Liu,Songhe Feng,Tao Wang,Yidong Li 机构：School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China, JD AI Research, Beijing, China 备注：11 pages, 7 figures, 7 tables 摘要：可见红外跨模态人员再识别是一项具有挑战性的ReID任务，其目的是在异构可见光和红外模态之间检索和匹配相同身份的图像。因此，这项任务的核心是弥合这两种模式之间的巨大差距。现有的基于卷积神经网络的方法主要面临着对模态信息感知不足的问题，无法学习到良好的识别性模态不变嵌入，从而限制了其性能。为了解决这些问题，我们提出了一种基于跨模态变换器的可见光红外人员再识别方法（CMTR），该方法能够明确地挖掘每个模态的信息，并在此基础上生成更好的鉴别特征。具体来说，为了捕捉模态的特征，我们设计了新的模态嵌入，它与令牌嵌入相融合来编码模态的信息。此外，为了增强模态嵌入的表示和调整匹配嵌入的分布，我们提出了一种基于学习模态信息的模态感知增强损失，减少了类内距离，扩大了类间距离。据我们所知，这是首次将Transformer网络应用于跨模态再识别任务。我们在公共SYSU-MM01和RegDB数据集上进行了大量实验，我们提出的CMTR模型的性能显著优于现有的基于CNN的方法。摘要：Visible-infrared cross-modality person re-identification is a challenging ReID task, which aims to retrieve and match the same identity's images between the heterogeneous visible and infrared modalities. Thus, the core of this task is to bridge the huge gap between these two modalities. The existing convolutional neural network-based methods mainly face the problem of insufficient perception of modalities' information, and can not learn good discriminative modality-invariant embeddings for identities, which limits their performance. To solve these problems, we propose a cross-modality transformer-based method (CMTR) for the visible-infrared person re-identification task, which can explicitly mine the information of each modality and generate better discriminative features based on it. Specifically, to capture modalities' characteristics, we design the novel modality embeddings, which are fused with token embeddings to encode modalities' information. Furthermore, to enhance representation of modality embeddings and adjust matching embeddings' distribution, we propose a modality-aware enhancement loss based on the learned modalities' information, reducing intra-class distance and enlarging inter-class distance. To our knowledge, this is the first work of applying transformer network to the cross-modality re-identification task. We implement extensive experiments on the public SYSU-MM01 and RegDB datasets, and our proposed CMTR model's performance significantly surpasses existing outstanding CNN-based methods.

【3】 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers 标题：3D-RETR：使用Transformer进行端到端单视图和多视图三维重建链接：https://arxiv.org/abs/2110.08861

作者：Zai Shi,Zhao Meng,Yiran Xing,Yunpu Ma,Roger Wattenhofer 机构： ETH Zurich, RWTH Aachen, LMU Munich 备注：None 摘要：三维重建的目的是从二维视图重建三维对象。以往的三维重建工作主要集中在视图之间的特征匹配或以CNN为主干。最近，Transformer在计算机视觉的多种应用中显示出了有效性。然而，Transformer是否可以用于三维重建尚不清楚。在本文中，我们通过提出3D-RETR来填补这一空白，3D-RETR能够使用Transformer进行端到端的三维重建。3D-RETR首先使用预训练转换器从2D输入图像中提取视觉特征。3D-RETR然后使用另一个Transformer解码器获得体素特征。然后，CNN解码器将体素特征作为输入，以获得重建对象。3D-RETR能够从单个视图或多个视图进行三维重建。在两个数据集上的实验结果表明，3DRETR在三维重建方面达到了最先进的性能。额外的烧蚀研究也表明3D-DETR从使用Transformer中获益。摘要：3D reconstruction aims to reconstruct 3D objects from 2D views. Previous works for 3D reconstruction mainly focus on feature matching between views or using CNNs as backbones. Recently, Transformers have been shown effective in multiple applications of computer vision. However, whether or not Transformers can be used for 3D reconstruction is still unclear. In this paper, we fill this gap by proposing 3D-RETR, which is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. 3D-RETR then uses another Transformer Decoder to obtain the voxel features. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects. 3D-RETR is capable of 3D reconstruction from a single view or multiple views. Experimental results on two datasets show that 3DRETR reaches state-of-the-art performance on 3D reconstruction. Additional ablation study also demonstrates that 3D-DETR benefits from using Transformers.

【4】 Siamese Transformer Pyramid Networks for Real-Time UAV Tracking 标题：用于无人机实时跟踪的暹罗Transformer金字塔网络链接：https://arxiv.org/abs/2110.08822

作者：Daitao Xing,Nikolaos Evangeliou,Athanasios Tsoukalas,Anthony Tzes 机构：New York University , USA, New York University Abu Dhabi, UAE 备注：10 pages, 8 figures, accepted by WACV2022 摘要：最近的目标跟踪方法依赖于深度网络或复杂的体系结构。在计算资源有限的移动平台上，大多数跟踪器很难满足实时处理要求。在这项工作中，我们介绍了暹罗Transformer金字塔网络（SiamTPN），它继承了CNN和Transformer结构的优点。具体而言，我们利用轻量级网络（ShuffleNet V2）的固有特征金字塔，并使用转换器对其进行增强，以构建健壮的特定于目标的外观模型。开发了一种具有横向交叉注意的集中式体系结构，用于构建增强的高级特征图。为了避免将金字塔表示与转换器融合时的计算和内存强度，我们进一步引入了集中注意模块，该模块在提高鲁棒性的同时显著降低了内存和时间复杂性。在空中和通用跟踪基准上进行的综合实验在高速运行时取得了有竞争力的结果，证明了SiamTPN的有效性。此外，我们最快的变体跟踪器在单个CPU内核上运行超过30 Hz，在LaSOT数据集上获得58.1%的AUC分数。源代码可在https://github.com/RISCNYUAD/SiamTPNTracker 摘要：Recent object tracking methods depend upon deep networks or convoluted architectures. Most of those trackers can hardly meet real-time processing requirements on mobile platforms with limited computing resources. In this work, we introduce the Siamese Transformer Pyramid Network (SiamTPN), which inherits the advantages from both CNN and Transformer architectures. Specifically, we exploit the inherent feature pyramid of a lightweight network (ShuffleNetV2) and reinforce it with a Transformer to construct a robust target-specific appearance model. A centralized architecture with lateral cross attention is developed for building augmented high-level feature maps. To avoid the computation and memory intensity while fusing pyramid representations with the Transformer, we further introduce the pooling attention module, which significantly reduces memory and time complexity while improving the robustness. Comprehensive experiments on both aerial and prevalent tracking benchmarks achieve competitive results while operating at high speed, demonstrating the effectiveness of SiamTPN. Moreover, our fastest variant tracker operates over 30 Hz on a single CPU-core and obtaining an AUC score of 58.1% on the LaSOT dataset. Source codes are available at https://github.com/RISCNYUAD/SiamTPNTracker

【5】 ASFormer: Transformer for Action Segmentation 标题：ASFormer：用于动作分割的转换器链接：https://arxiv.org/abs/2110.08568

作者：Fangqiu Yi,Hongyu Wen,Tingting Jiang 机构：NELVT, Department of Computer Science, Peking University, China 备注：Accepted by BMVC 2021 摘要：动作分割任务的算法通常使用时间模型来预测一分钟的日常活动在每一帧发生的动作。最近的研究显示了Transformer在模拟序列数据中元素之间的关系方面的潜力。然而，当直接将Transformer应用于动作分段任务时，存在几个主要问题，例如，小训练集缺乏感应偏置，处理长输入序列的缺陷，解码器架构的局限性在于利用多个动作片段之间的时间关系来细化初始预测。为了解决这些问题，我们为动作分割任务设计了一个有效的基于转换器的模型，称为ASFORER，它具有三个显著的特征：（i）由于特征的高度局部性，我们明确引入了局部连通性归纳先验。它将假设空间限制在一个可靠的范围内，有利于动作分割任务在较小的训练集上学习合适的目标函数。（ii）我们采用预定义的分层表示模式，有效地处理长输入序列。（iii）我们仔细设计解码器，以改进编码器的初始预测。在三个公共数据集上的大量实验证明了我们方法的有效性。代码位于\url{https://github.com/ChinaYi/ASFormer}. 摘要：Algorithms for the action segmentation task typically use temporal models to predict what action is occurring at each frame for a minute-long daily activity. Recent studies have shown the potential of Transformer in modeling the relations among elements in sequential data. However, there are several major concerns when directly applying the Transformer to the action segmentation task, such as the lack of inductive biases with small training sets, the deficit in processing long input sequence, and the limitation of the decoder architecture to utilize temporal relations among multiple action segments to refine the initial predictions. To address these concerns, we design an efficient Transformer-based model for action segmentation task, named ASFormer, with three distinctive characteristics: (i) We explicitly bring in the local connectivity inductive priors because of the high locality of features. It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets. (ii) We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences. (iii) We carefully design the decoder to refine the initial predictions from the encoder. Extensive experiments on three public datasets demonstrate that effectiveness of our methods. Code is available at \url{https://github.com/ChinaYi/ASFormer}.

【6】 CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans 标题：CAE-Transformer：基于Transformer的非薄层三维CT扫描预测肺腺癌亚固体结节侵袭性的模型链接：https://arxiv.org/abs/2110.08721

作者：Shahin Heidarian,Parnian Afshar,Anastasia Oikonomou,Konstantinos N. Plataniotis,Arash Mohammadi 机构：†Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada, ‡Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Canada 摘要：肺癌是全世界癌症死亡的主要原因，其组织学类型多种多样，其中肺腺癌（LAUC）最近最为常见。肺腺癌分为侵袭前、微创和侵袭性腺癌。及时准确地了解肺结节的侵袭性，可以制定正确的治疗计划，降低不必要或晚期手术的风险。目前，评估和预测LAUCs侵袭性的主要成像方式是胸部CT。然而，基于CT图像的结果是主观的，与手术切除后提供的病理检查相比，准确性较低。在本文中，开发了一个基于预测Transformer的框架，称为“CAETransformer”，用于对LAUC进行分类。CAE变换器利用卷积自动编码器（CAE）从CT切片中自动提取信息特征，然后将信息特征反馈给修改后的变换器模型，以捕获全局切片间关系。在114个病理证实的亚实体结节（SSN）的内部数据集上的实验结果表明，CAE Transformer优于基于直方图/放射组学的模型及其基于深度学习的模型，准确率为87.73%，敏感性为88.67%，特异性为86.33%，AUC为0.913，使用10倍交叉验证。摘要：Lung cancer is the leading cause of mortality from cancer worldwide and has various histologic types, among which Lung Adenocarcinoma (LAUC) has recently been the most prevalent. Lung adenocarcinomas are classified as pre-invasive, minimally invasive, and invasive adenocarcinomas. Timely and accurate knowledge of the invasiveness of lung nodules leads to a proper treatment plan and reduces the risk of unnecessary or late surgeries. Currently, the primary imaging modality to assess and predict the invasiveness of LAUCs is the chest CT. The results based on CT images, however, are subjective and suffer from a low accuracy compared to the ground truth pathological reviews provided after surgical resections. In this paper, a predictive transformer-based framework, referred to as the "CAE-Transformer", is developed to classify LAUCs. The CAE-Transformer utilizes a Convolutional Auto-Encoder (CAE) to automatically extract informative features from CT slices, which are then fed to a modified transformer model to capture global inter-slice relations. Experimental results on our in-house dataset of 114 pathologically proven Sub-Solid Nodules (SSNs) demonstrate the superiority of the CAE-Transformer over the histogram/radiomics-based models and its deep learning-based counterparts, achieving an accuracy of 87.73%, sensitivity of 88.67%, specificity of 86.33%, and AUC of 0.913, using a 10-fold cross-validation.

【7】 COVID-19 Detection in Chest X-ray Images Using Swin-Transformer and Transformer in Transformer 标题：利用猪Transformer和Transformer中的Transformer检测胸片中的冠状病毒链接：https://arxiv.org/abs/2110.08427

作者：Juntao Jiang,Shuyi Lin 机构：Hangzhou, China, Khoury College of Computer Sciences, Northeastern University, Boston, United States 备注：COVID-19, Chest X-ray Images, Swin-Transformer, Transformer in Transformer, Model Ensembling, Image Classification 摘要：2019年冠状病毒病（COVID-19）已在全球蔓延并造成严重损害。胸部X射线图像被广泛用于新冠肺炎的诊断，人工智能方法有助于提高诊断效率和准确性。在2021年负责任数据科学（EE-RDS）会议伦理和可解释性方面胸部XR COVID-19检测的挑战中，我们提出了一种结合SWN Transformer和Transformer-In-Transformer的方法，将胸部X射线图像分为三类：COVID-19、肺炎和正常（健康），并在测试集上实现了0.9475的准确度。摘要：The Coronavirus Disease 2019 (COVID-19) has spread globally and caused serious damages. Chest X-ray images are widely used for COVID-19 diagnosis and Artificial Intelligence method can assist to increase the efficiency and accuracy. In the Challenge of Chest XR COVID-19 detection in Ethics and Explainability for Responsible Data Science (EE-RDS) conference 2021, we proposed a method which combined Swin Transformer and Transformer in Transformer to classify chest X-ray images as three classes: COVID-19, Pneumonia and Normal (healthy) and achieved 0.9475 accuracy on test set.

检测相关(8篇)

【1】 Asymmetric Modality Translation For Face Presentation Attack Detection 标题：基于非对称模态转换的人脸呈现攻击检测链接：https://arxiv.org/abs/2110.09108

作者：Zhi Li,Haoliang Li,Xin Luo,Yongjian Hu,Kwok-Yan Lam,Alex C. Kot 摘要：人脸呈现攻击检测（PAD）是防止人脸识别系统被恶意用户欺骗的重要手段，受到了学术界和工业界的广泛关注。尽管现有的大多数方法都能在一定程度上达到预期的性能，但是在跨域设置（例如，设置不可见的攻击和变化的照明）下人脸呈现攻击检测的泛化问题仍然有待解决。本文提出了一种基于非对称模态转换的双模态人脸呈现攻击检测框架。在此框架下，我们建立了真实人脸的两个模态图像之间的联系。具体地说，提出了一种新的模态融合方案，即通过非对称模态转换器将一个模态的图像转换为另一个模态，然后与其对应的成对图像进行融合。融合结果作为输入输入输入到鉴别器进行推理。译者的训练是由不对称的情态翻译损失来监督的。此外，还使用了基于局部引力模式（PLGF）表示的光照归一化模块来减少光照变化的影响。我们在三个公共数据集上进行了大量实验，验证了我们的方法能够有效地检测各种类型的攻击，并在不同的评估协议下实现了最先进的性能。摘要：Face presentation attack detection (PAD) is an essential measure to protect face recognition systems from being spoofed by malicious users and has attracted great attention from both academia and industry. Although most of the existing methods can achieve desired performance to some extent, the generalization issue of face presentation attack detection under cross-domain settings (e.g., the setting of unseen attacks and varying illumination) remains to be solved. In this paper, we propose a novel framework based on asymmetric modality translation for face presentation attack detection in bi-modality scenarios. Under the framework, we establish connections between two modality images of genuine faces. Specifically, a novel modality fusion scheme is presented that the image of one modality is translated to the other one through an asymmetric modality translator, then fused with its corresponding paired image. The fusion result is fed as the input to a discriminator for inference. The training of the translator is supervised by an asymmetric modality translation loss. Besides, an illumination normalization module based on Pattern of Local Gravitational Force (PLGF) representation is used to reduce the impact of illumination variation. We conduct extensive experiments on three public datasets, which validate that our method is effective in detecting various types of attacks and achieves state-of-the-art performance under different evaluation protocols.

【2】 Unsupervised Shot Boundary Detection for Temporal Segmentation of Long Capsule Endoscopy Videos 标题：无监督镜头边界检测在长胶囊内窥镜视频时间分割中的应用链接：https://arxiv.org/abs/2110.09067

作者：Sodiq Adewole,Philip Fernandes,James Jablonski,Andrew Copland,Michael Porter,Sana Syed,Donald Brown 机构：∗ Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA, USA, ∗ Department of Pediatrics, School of Medicine, University of Virginia, Charlottesville, VA, USA 摘要：内科医生使用胶囊内镜（CE）作为一种非侵入性和非手术程序来检查整个胃肠道（GI）是否存在疾病和异常。一次CE检查可以持续8到11小时，生成多达80000帧的视频。医生在做出诊断之前必须检查和分析整个视频，以识别异常或疾病。此审查任务可能非常繁琐、耗时且容易出错。虽然只有一个框架可以捕捉到与医生最终诊断相关的有用内容，但仅覆盖小肠区域的框架就可能高达50000个。为了尽可能减少医生的检查时间和工作量，本文提出了一种新的无监督且计算效率高的时间分割方法，将长CE视频自动分割为同质且可识别的视频片段。然而，使用高维帧特征矩阵在长视频中搜索时间边界在计算上是禁止的，并且不适用于实际的临床应用。因此，利用视频中的空间和时间信息，我们首先使用预训练的CNN模型提取高级帧特征，然后将高维帧特征矩阵投影到较低的一维嵌入。使用这种一维序列嵌入，我们应用修剪精确线性时间（PELT）算法来搜索时间边界，该边界指示从正常帧到异常帧的过渡点，反之亦然。我们对多个真实患者的CE视频进行了实验，我们的模型在多个测试视频上与专家提供的标签相比，AUC达到66%。摘要：Physicians use Capsule Endoscopy (CE) as a non-invasive and non-surgical procedure to examine the entire gastrointestinal (GI) tract for diseases and abnormalities. A single CE examination could last between 8 to 11 hours generating up to 80,000 frames which is compiled as a video. Physicians have to review and analyze the entire video to identify abnormalities or diseases before making diagnosis. This review task can be very tedious, time consuming and prone to error. While only as little as a single frame may capture useful content that is relevant to the physicians' final diagnosis, frames covering the small bowel region alone could be as much as 50,000. To minimize physicians' review time and effort, this paper proposes a novel unsupervised and computationally efficient temporal segmentation method to automatically partition long CE videos into a homogeneous and identifiable video segments. However, the search for temporal boundaries in a long video using high dimensional frame-feature matrix is computationally prohibitive and impracticable for real clinical application. Therefore, leveraging both spatial and temporal information in the video, we first extracted high level frame features using a pretrained CNN model and then projected the high-dimensional frame-feature matrix to lower 1-dimensional embedding. Using this 1-dimensional sequence embedding, we applied the Pruned Exact Linear Time (PELT) algorithm to searched for temporal boundaries that indicates the transition points from normal to abnormal frames and vice-versa. We experimented with multiple real patients' CE videos and our model achieved an AUC of 66\% on multiple test videos against expert provided labels.

【3】 Discovery-and-Selection: Towards Optimal Multiple Instance Learning for Weakly Supervised Object Detection 标题：发现选择：弱监督目标检测的最优多示例学习链接：https://arxiv.org/abs/2110.09060

作者：Shiwei Zhang,Wei Ke,Lin Yang,Qixiang Ye,Xiaopeng Hong,Yihong Gong,Tong Zhang 摘要：弱监督目标检测（WSOD）是一项具有挑战性的任务，需要在图像类别标签的监督下同时学习目标分类器和估计目标位置。WSOD方法的一条主线是多实例学习，它将图像视为实例包，并从每个包中选择正实例来学习检测器。然而，当探测器倾向于收敛到物体的辨别部分而不是整个物体时，一个巨大的挑战就出现了。本文在假设最优解包含在局部极小值中的前提下，提出了一种融合多实例学习的发现与选择方法（DS-MIL），该方法能发现丰富的局部极小值并从多个局部极小值中选择最优解。为了实现DS-MIL，设计了一个注意模块，以便通过特征图捕获更多的上下文信息，并在训练期间收集更多有价值的建议。针对候选方案，设计了一个重新排序模块，用于选择信息实例进行目标检测器训练。在常用基准上的实验结果表明，我们提出的DS-MIL方法可以持续改进基线，报告最先进的性能。摘要：Weakly supervised object detection (WSOD) is a challenging task that requires simultaneously learn object classifiers and estimate object locations under the supervision of image category labels. A major line of WSOD methods roots in multiple instance learning which regards images as bags of instance and selects positive instances from each bag to learn the detector. However, a grand challenge emerges when the detector inclines to converge to discriminative parts of objects rather than the whole objects. In this paper, under the hypothesis that optimal solutions are included in local minima, we propose a discoveryand-selection approach fused with multiple instance learning (DS-MIL), which finds rich local minima and select optimal solutions from multiple local minima. To implement DS-MIL, an attention module is designed so that more context information can be captured by feature maps and more valuable proposals can be collected during training. With proposal candidates, a re-rank module is designed to select informative instances for object detector training. Experimental results on commonly used benchmarks show that our proposed DS-MIL approach can consistently improve the baselines, reporting state-of-the-art performance.

【4】 On the Effect of Selfie Beautification Filters on Face Detection and Recognition 标题：自拍美容滤镜对人脸检测与识别的影响链接：https://arxiv.org/abs/2110.08934

作者：Pontus Hedman,Vasilios Skepetzis,Kevin Hernandez-Diaz,Josef Bigun,Fernando Alonso-Fernandez 机构：School of Information Science, Computer and Electrical Engineering, Halmstad University, Box , Halmstad SE ,-, Sweden, Article history: 摘要：美化和增强现实过滤器在使用智能手机或个人设备拍摄的自拍图像的应用程序中非常流行。然而，它们可以扭曲或修改生物特征，严重影响识别个人身份甚至检测人脸的能力。因此，我们讨论了这种滤波器对自动人脸检测和识别准确性的影响。所研究的社交媒体图像过滤器要么修改图像对比度或照明，要么用人造眼镜或动物鼻子等遮挡脸部部分。我们观察到，其中一些过滤器的效果对人脸检测和身份识别都是有害的，特别是当它们模糊眼睛或（在较小程度上）鼻子时。为了抵消这种影响，我们开发了一种方法，用改进版的U-NET分割网络重建应用的操作。这有助于提高人脸检测和识别的准确性。从识别的角度来看，我们使用距离度量和经过训练的机器学习算法来提取特征，这些特征使用经过训练的ResNet-34网络来识别人脸。我们还评估了将过滤后的图像纳入机器学习方法的训练集是否有利于身份识别。我们的结果表明，当滤光片不遮挡重要的标志点，特别是眼睛时（识别准确率>99%，EER<2%），识别效果良好。所提出的方法的组合效应还允许减轻遮挡部分面部的滤波器产生的效应，在评估大多数扰动的情况下，实现>92%的识别精度，EER<8%。尽管仍有改进的余地，但当既不使用U-NET重建也不使用过滤图像进行训练时，严重遮挡眼睛的过滤器的准确率<72%（识别率）和>12%（EER）摘要：Beautification and augmented reality filters are very popular in applications that use selfie images captured with smartphones or personal devices. However, they can distort or modify biometric features, severely affecting the capability of recognizing individuals' identity or even detecting the face. Accordingly, we address the effect of such filters on the accuracy of automated face detection and recognition. The social media image filters studied either modify the image contrast or illumination or occlude parts of the face with for example artificial glasses or animal noses. We observe that the effect of some of these filters is harmful both to face detection and identity recognition, specially if they obfuscate the eye or (to a lesser extent) the nose. To counteract such effect, we develop a method to reconstruct the applied manipulation with a modified version of the U-NET segmentation network. This is observed to contribute to a better face detection and recognition accuracy. From a recognition perspective, we employ distance measures and trained machine learning algorithms applied to features extracted using a ResNet-34 network trained to recognize faces. We also evaluate if incorporating filtered images to the training set of machine learning approaches are beneficial for identity recognition. Our results show good recognition when filters do not occlude important landmarks, specially the eyes (identification accuracy >99%, EER<2%). The combined effect of the proposed approaches also allow to mitigate the effect produced by filters that occlude parts of the face, achieving an identification accuracy of >92% with the majority of perturbations evaluated, and an EER <8%. Although there is room for improvement, when neither U-NET reconstruction nor training with filtered images is applied, the accuracy with filters that severely occlude the eye is <72% (identification) and >12% (EER)

【5】 A Deep Learning-based Approach for Real-time Facemask Detection 标题：一种基于深度学习的实时口罩检测方法链接：https://arxiv.org/abs/2110.08732

作者：Wadii Boulila,Ayyub Alzahem,Aseel Almoudi,Muhanad Afifi,Ibrahim Alturki,Maha Driss 机构：Robotics and Internet-of-Things Lab, Prince Sultan University, Riyadh, Saudi Arabia, National School of Computer Sciences, University of Manouba, Manouba, Tunisia, College of Computer Science and, Taibah University, Medina, Saudi Arabia, Security Engineering Lab 摘要：新冠病毒-19大流行正在造成全球健康危机。需要保护公共空间免受这一流行病的不利影响。戴口罩成为许多政府采取的有效保护措施之一。对于一大群人来说，人工实时监控口罩佩戴情况正成为一项艰巨的任务。本文的目标是使用深度学习（DL），它在许多实际应用中已经显示出良好的结果，以确保高效的实时面罩检测。提出的方法基于两个步骤。一个离线步骤，旨在创建一个DL模型，该模型能够检测和定位面罩以及面罩是否正确佩戴。在线步骤，在边缘计算时部署DL模型，以便实时检测遮罩。在本研究中，我们建议使用MobileNetV2实时检测面罩。实验表明，该方法具有良好的性能（99%的训练和测试准确率）。此外，与许多最先进的模型（即ResNet50、DenseNet和VGG16）进行的几次比较表明，MobileNet V2在训练时间和准确性方面表现良好。摘要：The COVID-19 pandemic is causing a global health crisis. Public spaces need to be safeguarded from the adverse effects of this pandemic. Wearing a facemask becomes one of the effective protection solutions adopted by many governments. Manual real-time monitoring of facemask wearing for a large group of people is becoming a difficult task. The goal of this paper is to use deep learning (DL), which has shown excellent results in many real-life applications, to ensure efficient real-time facemask detection. The proposed approach is based on two steps. An off-line step aiming to create a DL model that is able to detect and locate facemasks and whether they are appropriately worn. An online step that deploys the DL model at edge computing in order to detect masks in real-time. In this study, we propose to use MobileNetV2 to detect facemask in real-time. Several experiments are conducted and show good performances of the proposed approach (99% for training and testing accuracy). In addition, several comparisons with many state-of-the-art models namely ResNet50, DenseNet, and VGG16 show good performance of the MobileNetV2 in terms of training time and accuracy.

【6】 Trigger Hunting with a Topological Prior for Trojan Detection 标题：基于拓扑先验的触发器狩猎木马检测链接：https://arxiv.org/abs/2110.08335

作者：Xiaoling Hu,Xiao Lin,Michael Cogswell,Yi Yao,Susmit Jha,Chao Chen 机构：Stony Brook University, SRI International 备注：16 pages, 9 figures 摘要：尽管深度神经网络（DNN）取得了成功并广受欢迎，但它在面对后门攻击时仍然很脆弱。这阻碍了它们的广泛采用，特别是在任务关键型应用中。本文解决了特洛伊木马检测问题，即识别木马模型——用有毒数据训练的模型。一种流行的方法是逆向工程，即通过操纵模型的预测来恢复干净图像上的触发器。逆向工程方法的一个主要挑战是触发器的巨大搜索空间。为此，我们提出了创新的优先级，如多样性和拓扑简单性，以不仅增加找到合适触发器的机会，而且提高找到的触发器的质量。此外，通过鼓励不同的触发候选集，我们的方法可以在目标标签未知的情况下有效地执行。我们证明，这些优先级可以显著提高恢复触发器的质量，从而大大提高特洛伊木马检测的准确性，这在合成和公开的TrojAI基准测试中都得到了验证。摘要：Despite their success and popularity, deep neural networks (DNNs) are vulnerable when facing backdoor attacks. This impedes their wider adoption, especially in mission critical applications. This paper tackles the problem of Trojan detection, namely, identifying Trojaned models -- models trained with poisoned data. One popular approach is reverse engineering, i.e., recovering the triggers on a clean image by manipulating the model's prediction. One major challenge of reverse engineering approach is the enormous search space of triggers. To this end, we propose innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers. Moreover, by encouraging a diverse set of trigger candidates, our method can perform effectively in cases with unknown target labels. We demonstrate that these priors can significantly improve the quality of the recovered triggers, resulting in substantially improved Trojan detection accuracy as validated on both synthetic and publicly available TrojAI benchmarks.

【7】 Automatic Detection of COVID-19 and Pneumonia from Chest X-Ray using Deep Learning 标题：基于深度学习的胸片冠状病毒和肺炎的自动检测链接：https://arxiv.org/abs/2110.09384

作者：Sarath Pathari 摘要：在这项研究中，一组来自普通病毒性肺炎、细菌性肺炎、确诊的新冠病毒-19疾病患者的X射线图像被用于冠状病毒疾病的自动检测。这项研究的目的是评估近年来提出的临床图像顺序的尖端卷积神经系统结构的展示。特别是，收到了称为转移学习的系统。通过迁移学习，在临床图片数据集中找到与正常值不同的变异是一个可以实现的目标，通常会产生惊人的结果。本试验中使用的数据集。首先，收集了24000张X射线图像，包括6000张确诊的新冠病毒-19疾病图像、6000张确诊的普通细菌性肺炎图像和6000张正常情况图像。这些信息是从开放的临床商店上可获得的X光照片中收集和扩展的。结果表明，X射线成像的深度学习可以分离出与新冠病毒-19疾病相关的值得注意的生物标志物，而获得的最佳精确度、易感性和特异性分别为97.83%、96.81%和98.56%。摘要：In this study, a dataset of X-ray images from patients with common viral pneumonia, bacterial pneumonia, confirmed Covid-19 disease was utilized for the automatic detection of the Coronavirus disease. The point of the investigation is to assess the exhibition of cutting edge convolutional neural system structures proposed over the ongoing years for clinical picture order. In particular, the system called Transfer Learning was received. With transfer learning, the location of different variations from the norm in little clinical picture datasets is a reachable objective, regularly yielding amazing outcomes. The datasets used in this trial. Firstly, a collection of 24000 X-ray images includes 6000 images for confirmed Covid-19 disease,6000 confirmed common bacterial pneumonia and 6000 images of normal conditions. The information was gathered and expanded from the accessible X-Ray pictures on open clinical stores. The outcomes recommend that Deep Learning with X-Ray imaging may separate noteworthy biological markers identified with the Covid-19 sickness, while the best precision, affectability, and particularity acquired is 97.83%, 96.81%, and 98.56% individually.

【8】 Deep learning-based detection of intravenous contrast in computed tomography scans 标题：基于深度学习的CT扫描静脉造影剂检测链接：https://arxiv.org/abs/2110.08424

作者：Zezhong Ye,Jack M. Qian,Ahmed Hosny,Roman Zeleznik,Deborah Plana,Jirapat Likitlersuang,Zhongyi Zhang,Raymond H. Mak,Hugo J. W. L. Aerts,Benjamin H. Kann 机构：H. Kann,, Affiliations:, . Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical, . Department of Radiation Oncology, Dana-Farber Cancer Institute and Brigham and Women’s, Hospital, Harvard Medical School, Boston, MA, USA 摘要：目的：识别静脉注射（IV）CT扫描中的对比度使用是模型开发和测试数据整理的一个关键组成部分。目前，IV对比度在成像元数据中的记录很差，需要临床医生专家手动校正和注释，这是成像分析和算法部署的一个主要障碍。我们试图开发和验证基于进化神经网络（CNN）的深度学习（DL）平台，用于识别CT扫描中的IV对比度。方法：在模型开发和评估中，我们使用了头颈部（HN）CT扫描的独立数据集和肺癌患者，共133480张来自1979年CT扫描的轴向2D扫描切片，由临床专家手动注释对比度。采用五种不同的DL模型，并在用于切片水平对比度检测的HN训练数据集中进行训练。在保持集和独立验证集上对模型性能进行评估随后，DL模型在胸部CT数据上进行了微调，并在单独的胸部CT数据集上进行了外部验证。结果：1496次扫描中，静脉造影的初始DICOM元数据标签缺失或错误（75.6%）。基于EfficientNetB4的模型显示了最佳的整体检测性能。对于HN扫描，内部验证集的AUC为0.996（n=216），外部验证集的AUC为1.0（n=595）。胸部CT的微调模型得出内部验证集的AUC:1.0（n=53），外部验证集的AUC:0.980结论：DL模型能够准确检测HN和胸部CT扫描中的IV对比度，性能近乎完美。摘要：Purpose: Identifying intravenous (IV) contrast use within CT scans is a key component of data curation for model development and testing. Currently, IV contrast is poorly documented in imaging metadata and necessitates manual correction and annotation by clinician experts, presenting a major barrier to imaging analyses and algorithm deployment. We sought to develop and validate a convolutional neural network (CNN)-based deep learning (DL) platform to identify IV contrast within CT scans. Methods: For model development and evaluation, we used independent datasets of CT scans of head, neck (HN) and lung cancer patients, totaling 133,480 axial 2D scan slices from 1,979 CT scans manually annotated for contrast presence by clinical experts. Five different DL models were adopted and trained in HN training datasets for slice-level contrast detection. Model performances were evaluated on a hold-out set and on an independent validation set from another institution. DL models was then fine-tuned on chest CT data and externally validated on a separate chest CT dataset. Results: Initial DICOM metadata tags for IV contrast were missing or erroneous in 1,496 scans (75.6%). The EfficientNetB4-based model showed the best overall detection performance. For HN scans, AUC was 0.996 in the internal validation set (n = 216) and 1.0 in the external validation set (n = 595). The fine-tuned model on chest CTs yielded an AUC: 1.0 for the internal validation set (n = 53), and AUC: 0.980 for the external validation set (n = 402). Conclusion: The DL model could accurately detect IV contrast in both HN and chest CT scans with near-perfect performance.

分类|识别相关(16篇)

【1】 Deep CNNs for Peripheral Blood Cell Classification 标题：用于外周血细胞分类的深层CNN 链接：https://arxiv.org/abs/2110.09508

作者：Ekta Gavas,Kaustubh Olpadkar 机构：IIIT Hyderabad, India, Stony Brook University, NY, USA 备注：20 pages, 14 figures, Submitted at MIDL 2021 摘要：机器学习技术在医学领域的应用尤其具有挑战性，因为它具有所需的精度和微小错误的巨大风险。将这些技术应用于更复杂的血液学诊断子领域似乎很有希望，可以自动识别血细胞类型，这有助于检测血液学疾病。在本文中，我们在显微外周血细胞图像数据集上测试了27种流行的深度卷积神经网络结构。该数据集是公开的，使用CellaVision DM96分析仪采集了大量正常外周血细胞，并由病理专家鉴定为八种不同的细胞类型。我们微调了在ImageNet数据集上预训练的用于血细胞分类的最先进的图像分类模型。我们在训练期间利用数据增强技术来避免过度拟合并实现泛化。与过去出版的作品相比，一组性能最佳的模型取得了显著的改进，实现了最先进的结果，分类准确率达到99.51%。我们的工作为显微外周血细胞识别任务的标准深度学习架构提供了经验基线和基准。摘要：The application of machine learning techniques to the medical domain is especially challenging due to the required level of precision and the incurrence of huge risks of minute errors. Employing these techniques to a more complex subdomain of hematological diagnosis seems quite promising, with automatic identification of blood cell types, which can help in detection of hematologic disorders. In this paper, we benchmark 27 popular deep convolutional neural network architectures on the microscopic peripheral blood cell images dataset. The dataset is publicly available, with large number of normal peripheral blood cells acquired using the CellaVision DM96 analyzer and identified by expert pathologists into eight different cell types. We fine-tune the state-of-the-art image classification models pre-trained on the ImageNet dataset for blood cell classification. We exploit data augmentation techniques during training to avoid overfitting and achieve generalization. An ensemble of the top performing models obtains significant improvements over past published works, achieving the state-of-the-art results with a classification accuracy of 99.51%. Our work provides empirical baselines and benchmarks on standard deep-learning architectures for microscopic peripheral blood cell recognition task.

【2】 Gait-based Human Identification through Minimum Gait-phases and Sensors 标题：基于最小步态相位和传感器的步态识别链接：https://arxiv.org/abs/2110.09286

作者：Muhammad Zeeshan Arshad,Dawoon Jung,Mina Park,Kyung-Ryoul Mun,Jinwook Kim 机构：©, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media 备注：Accepted in 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2021) 摘要：人类识别是智能环境中状态监测、人机交互和提供辅助服务的最常见和关键任务之一。近年来，人类步态作为一种生物特征识别技术获得了新的关注，以实现从距离到物理外观的非接触式识别。然而，通过可穿戴设备和基于图像的系统进行步态识别的一个重要方面是，当可用信息有限时（例如，当仅能看到整个步态周期的一小部分或受试者身体的一部分时），准确识别步态。在本文中，我们提出了一种基于不同步态阶段的时间和描述性统计参数作为特征的步态识别技术，并研究了仅使用单个步态阶段，使用最少数量的传感器进行识别任务的性能。结果表明，通过仅通过一个传感器监测整个步态周期的一个阶段，可以实现95.5%以上的高精度。研究还表明，当通过骨盆和足部传感器联合监测整个步态周期时，所提出的方法可以实现100%的识别准确率。与支持向量机相比，人工神经网络对较少的数据特征具有更强的鲁棒性，并被认为是用于此目的的最佳机器算法。摘要：Human identification is one of the most common and critical tasks for condition monitoring, human-machine interaction, and providing assistive services in smart environments. Recently, human gait has gained new attention as a biometric for identification to achieve contactless identification from a distance robust to physical appearances. However, an important aspect of gait identification through wearables and image-based systems alike is accurate identification when limited information is available, for example, when only a fraction of the whole gait cycle or only a part of the subject body is visible. In this paper, we present a gait identification technique based on temporal and descriptive statistic parameters of different gait phases as the features and we investigate the performance of using only single gait phases for the identification task using a minimum number of sensors. It was shown that it is possible to achieve high accuracy of over 95.5 percent by monitoring a single phase of the whole gait cycle through only a single sensor. It was also shown that the proposed methodology could be used to achieve 100 percent identification accuracy when the whole gait cycle was monitored through pelvis and foot sensors combined. The ANN was found to be more robust to fewer data features compared to SVM and was concluded as the best machine algorithm for the purpose.

【3】 A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld Images 标题：一种轻量级准确的X射线焊缝图像识别框架链接：https://arxiv.org/abs/2110.09278

作者：Moyun Liu,Jingming Xie,Jing Hao,Yang Zhang,Xuzhan Chen,Youping Chen 机构：Chena, School of Mechanical Science and Engineering, Huazhong University of Science and, School of Mechanical Engineering, Hubei University of Technology, Wuhan, China 摘要：在质量检测行业中，X射线图像是保证设备安全的常用手段。在制造业的数字可追溯系统中，X射线焊缝图像上的标志识别起着至关重要的作用。然而，焊缝图像中目标的尺度差异很大，这阻碍了我们获得满意的识别效果。本文提出了一种基于卷积神经网络的焊缝图像符号识别框架。该框架首先包含一个用于纠正图像姿态的浅分类网络。此外，我们提出了一种新的空间和信道增强（SCE）模块来解决上述规模问题。该模块可以集成多尺度特征，并为每个特征源自适应地分配权重。基于SCE模块，设计了一种用于焊缝最终信息识别的狭义网络。为了增强框架的实用性，我们仔细设计了框架的体系结构，只需要少量的参数和计算。实验结果表明，我们的框架在分类阶段以1.1GA浮点运算（GFLOPs）达到99.7%的准确率，在识别阶段以176.1帧/秒（FPS）达到90.0的平均精度（mAP）。摘要：X-ray images are commonly used to ensure the security of devices in quality inspection industry. The recognition of signs printed on X-ray weld images plays an essential role in digital traceability system of manufacturing industry. However, the scales of objects vary different greatly in weld images, and it hinders us to achieve satisfactory recognition. In this paper, we propose a signs recognition framework based on convolutional neural networks (CNNs) for weld images. The proposed framework firstly contains a shallow classification network for correcting the pose of images. Moreover, we present a novel spatial and channel enhancement (SCE) module to address the above scale problem. This module can integrate multi-scale features and adaptively assign weights for each feature source. Based on SCE module, a narrow network is designed for final weld information recognition. To enhance the practicability of our framework, we carefully design the architecture of framework with a few parameters and computations. Experimental results show that our framework achieves 99.7% accuracy with 1.1 giga floating-point of operations (GFLOPs) on classification stage, and 90.0 mean average precision (mAP) with 176.1 frames per second (FPS) on recognition stage.

【4】 Learning Optimal Conformal Classifiers 标题：学习最优共形分类器链接：https://arxiv.org/abs/2110.09192

作者：David Stutz,Krishnamurthy,Dvijotham,Ali Taylan Cemgil,Arnaud Doucet 机构： DeepMind, Max Planck Institute for Informatics, Saarland Informatics Campus 摘要：现代基于深度学习的分类器在测试数据上显示出非常高的准确性，但这并不能为安全部署提供足够的保证，特别是在医疗诊断等高风险AI应用中。通常，预测是在没有可靠的不确定性估计或正式保证的情况下获得的。共形预测（CP）通过使用分类器的概率估计以用户指定的概率预测包含真实类的置信集来解决这些问题。然而，在训练后使用CP作为单独的处理步骤会阻止基础模型适应置信集的预测。因此，本文探讨了在训练过程中通过CP进行区分的策略，目标是使用保形包装器端到端构建训练模型。在我们的保形训练（ConfTr）方法中，我们在训练期间专门“模拟”小批量的保形训练。我们表明，CT通过减少平均置信集大小（效率低下）优于最先进的CP分类方法。此外，它允许“塑造”在测试时预测的置信集，这对于标准CP来说是困难的。在多个数据集的实验中，我们表明，在保留CP提供的保证的同时，ConfTr可以影响无效率在类之间的分布，或根据所包含的类指导置信集的组成。摘要：Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's probability estimates to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. We show that CT outperforms state-of-the-art CP methods for classification by reducing the average confidence set size (inefficiency). Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.

【5】 Domain Generalisation for Apparent Emotional Facial Expression Recognition across Age-Groups 标题：跨年龄组的表观情绪面部表情识别的领域泛化链接：https://arxiv.org/abs/2110.09168

作者：Rafael Poyiadzi,Jie Shen,Stavros Petridis,Yujiang Wang,Maja Pantic 机构：University of Bristol, UK, Facebook AI Applied Research, UK, Imperial College London, UK 摘要：表面情感表情识别近年来引起了人们的广泛关注。然而，大多数方法忽略了年龄差异，并针对所有年龄段训练通用模型。在这项工作中，我们研究了使用不同年龄组训练明显情绪面部表情识别模型的效果。为此，我们从不同年龄组的面部图像中，在明显的情绪面部表情识别的背景下研究领域概括。我们首先比较了几种基于域外泛化的域泛化算法，发现类条件域对抗性神经网络（CDANN）算法具有最好的性能。然后，我们研究了训练过程中使用的年龄组的种类和数量对概括为不可见年龄组的影响，并观察到训练年龄组数量的增加往往会增加不可见年龄组的表面情绪面部表情识别表现。我们还表明，在训练过程中排除某一年龄组往往会对邻近年龄组的表现产生更大的影响。摘要：Apparent emotional facial expression recognition has attracted a lot of research attention recently. However, the majority of approaches ignore age differences and train a generic model for all ages. In this work, we study the effect of using different age-groups for training apparent emotional facial expression recognition models. To this end, we study Domain Generalisation in the context of apparent emotional facial expression recognition from facial imagery across different age groups. We first compare several domain generalisation algorithms on the basis of out-of-domain-generalisation, and observe that the Class-Conditional Domain-Adversarial Neural Networks (CDANN) algorithm has the best performance. We then study the effect of variety and number of age-groups used during training on generalisation to unseen age-groups and observe that an increase in the number of training age-groups tends to increase the apparent emotional facial expression recognition performance on unseen age-groups. We also show that exclusion of an age-group during training tends to affect more the performance of the neighbouring age groups.

【6】 Abnormal Occupancy Grid Map Recognition using Attention Network 标题：基于注意力网络的异常入住率栅格地图识别链接：https://arxiv.org/abs/2110.09047

作者：Fuqin Deng,Hua Feng,Mingjian Liang,Qi Feng,Ningbo Yi,Yong Yang,Yuan Gao,Junfeng Chen,Tin Lun Lam 机构：Authors contributed equally 1School of Intelligent Manufacturing, the Wuyi University, 2School of Science and Engineering, the Chinese University of HongKong 摘要：占用栅格地图是移动机器人系统中自主定位和导航的关键组成部分，因为许多其他系统的性能在很大程度上取决于它。为了保证占有率栅格地图的质量，研究人员以前不得不长期进行繁琐的人工识别。本文主要研究利用残差神经网络和一种新的注意机制模块对异常占用栅格地图进行自动识别。我们提出了一个有效的通道和空间剩余SE（csRSE）注意模块，该模块包含一个用于产生层次特征的剩余块，然后是通道SE（cSE）块和空间SE（sSE）块，用于沿着通道和空间路径进行充分的信息提取。为了进一步总结占有率栅格地图的特征并使用我们的csRSE注意模块进行实验，我们构建了一个称为占有率栅格地图数据集（OGMD）的数据集用于我们的实验。在这个OGMD测试数据集上，我们测试了我们提出的结构的几个变体，并将它们与其他注意机制进行了比较。我们的实验结果表明，该注意网络可以推断出异常地图，对于异常占用栅格地图的识别，其最新技术（SOTA）的准确率为96.23%。摘要：The occupancy grid map is a critical component of autonomous positioning and navigation in the mobile robotic system, as many other systems' performance depends heavily on it. To guarantee the quality of the occupancy grid maps, researchers previously had to perform tedious manual recognition for a long time. This work focuses on automatic abnormal occupancy grid map recognition using the residual neural networks and a novel attention mechanism module. We propose an effective channel and spatial Residual SE(csRSE) attention module, which contains a residual block for producing hierarchical features, followed by both channel SE (cSE) block and spatial SE (sSE) block for the sufficient information extraction along the channel and spatial pathways. To further summarize the occupancy grid map characteristics and experiment with our csRSE attention modules, we constructed a dataset called occupancy grid map dataset (OGMD) for our experiments. On this OGMD test dataset, we tested few variants of our proposed structure and compared them with other attention mechanisms. Our experimental results show that the proposed attention network can infer the abnormal map with state-of-the-art (SOTA) accuracy of 96.23% for abnormal occupancy grid map recognition.

【7】 NYU-VPR: Long-Term Visual Place Recognition Benchmark with View Direction and Data Anonymization Influences 标题：NYU-VPR：受视角方向和数据匿名化影响的长期视觉场所识别基准链接：https://arxiv.org/abs/2110.09004

作者：Diwei Sheng,Yuxiang Chai,Xinru Li,Chen Feng,Jianzhe Lin,Claudio Silva,John-Ross Rizzo 备注：7 pages, 10 figures, published in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021) 摘要：视觉位置识别（VPR）不仅对自动驾驶车辆的定位和地图绘制至关重要，而且对视障人群的辅助导航也至关重要。为了实现大规模的长期VPR系统，需要解决几个挑战。首先，不同的应用可能需要不同的图像视图方向，例如自动驾驶汽车的前视图，而低视力人群的侧视图。其次，由于行人和车辆身份信息的成像，大都市场景中的VPR通常会引起隐私问题，需要在VPR查询和数据库构建之前进行数据匿名化。这两个因素都可能导致VPR性能的变化，目前还不清楚。为了研究它们的影响，我们展示了纽约大学VPR数据集，该数据集包含纽约大学校园附近2公里乘2公里区域的20多万张图像，这些图像是2016年全年拍摄的。我们给出了几种流行的VPR算法的基准测试结果，结果表明，对于当前的VPR方法，侧视图更具挑战性，而数据匿名化的影响几乎可以忽略不计，同时我们还提供了假设解释和深入分析。摘要：Visual place recognition (VPR) is critical in not only localization and mapping for autonomous driving vehicles, but also assistive navigation for the visually impaired population. To enable a long-term VPR system on a large scale, several challenges need to be addressed. First, different applications could require different image view directions, such as front views for self-driving cars while side views for the low vision people. Second, VPR in metropolitan scenes can often cause privacy concerns due to the imaging of pedestrian and vehicle identity information, calling for the need for data anonymization before VPR queries and database construction. Both factors could lead to VPR performance variations that are not well understood yet. To study their influences, we present the NYU-VPR dataset that contains more than 200,000 images over a 2km by 2km area near the New York University campus, taken within the whole year of 2016. We present benchmark results on several popular VPR algorithms showing that side views are significantly more challenging for current VPR methods while the influence of data anonymization is almost negligible, together with our hypothetical explanations and in-depth analysis.

【8】 Alleviating Noisy-label Effects in Image Classification via Probability Transition Matrix 标题：利用概率转移矩阵消除图像分类中的噪声标签效应链接：https://arxiv.org/abs/2110.08866

作者：Ziqi Zhang,Yuexiang Li,Hongxin Wei,Kai Ma,Tao Xu,Yefeng Zheng 机构： Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen, China, Tencent Jarvis Lab, School of Computer Science and, Engineering, Nanyang Technological, Singapore 摘要：基于深度学习的图像分类框架经常会遇到由观察者间变化引起的噪声标签问题。最近的研究采用了从学习到学习的模式（例如，合作教学和JoCoR）从训练集中过滤带有噪音标签的样本。然而，它们大多使用简单的交叉熵损失作为噪声标签识别的准则。有利于分类器学习的硬样本在这种情况下经常被错误地视为噪声，因为硬样本和带有噪声标签的样本都会导致比简单样本更大的损失值。在本文中，我们提出了一个插件模块，即噪声忽略块（NIB），由概率转移矩阵和类间相关性（IC）损失组成，用于分离硬样本和错误标记样本，并进一步提高用噪声标记训练的图像分类网络的准确度。具体来说，我们的IC损耗计算为网络预测和概率转移矩阵生成的累积软标签之间的Kullback-Leibler散度。这样，当IC损耗值较低时，可以很容易地将硬案例与标签错误的案例区分开来。在自然和医学图像数据集（CIFAR-10和ISIC 2019）上进行了大量实验。实验结果表明，我们的NIB模块持续改进了最先进的鲁棒训练方法的性能。摘要：Deep-learning-based image classification frameworks often suffer from the noisy label problem caused by the inter-observer variation. Recent studies employed learning-to-learn paradigms (e.g., Co-teaching and JoCoR) to filter the samples with noisy labels from the training set. However, most of them use a simple cross-entropy loss as the criterion for noisy label identification. The hard samples, which are beneficial for classifier learning, are often mistakenly treated as noises in such a setting since both the hard samples and ones with noisy labels lead to a relatively larger loss value than the easy cases. In this paper, we propose a plugin module, namely noise ignoring block (NIB), consisting of a probability transition matrix and an inter-class correlation (IC) loss, to separate the hard samples from the mislabeled ones, and further boost the accuracy of image classification network trained with noisy labels. Concretely, our IC loss is calculated as Kullback-Leibler divergence between the network prediction and the accumulative soft label generated by the probability transition matrix. Such that, with the lower value of IC loss, the hard cases can be easily distinguished from mislabeled cases. Extensive experiments are conducted on natural and medical image datasets (CIFAR-10 and ISIC 2019). The experimental results show that our NIB module consistently improves the performances of the state-of-the-art robust training methods.

【9】 Self-Supervised Learning for Binary Networks by Joint Classifier Training 标题：基于联合分类器训练的二值网络自监督学习链接：https://arxiv.org/abs/2110.08851

作者：Dahyun Kim,Jonghyun Choi 机构：Gwangju Institute of Science and Technology (GIST), South Korea; ,NAVER AI Lab 摘要：尽管使用大型浮点网络进行自我监督学习取得了巨大成功，但此类网络并不容易部署到边缘设备上。为了通过无监督表示学习加速模型部署到各种下游任务的边缘设备，我们提出了一种二进制网络的自监督学习方法。特别地，我们建议使用一个随机初始化的分类器作为目标连接到一个预训练的浮点特征提取器，并与一个二进制网络联合训练它。为了更好地训练二进制网络，我们提出了一种特征相似性损失、损失项动态平衡方案和改进的多级训练。我们称我们的方法为BSSL。我们的实验验证表明，BSSL在各种下游任务中优于二进制网络的自监督学习基线，在某些任务中优于监督预训练。摘要：Despite the great success of self-supervised learning with large floating point networks, such networks are not readily deployable to edge devices. To accelerate deployment of models to edge devices for various downstream tasks by unsupervised representation learning, we propose a self-supervised learning method for binary networks. In particular, we propose to use a randomly initialized classifier attached to a pretrained floating point feature extractor as targets and jointly train it with a binary network. For better training of the binary network, we propose a feature similarity loss, a dynamic balancing scheme of loss terms, and modified multi-stage training. We call our method as BSSL. Our empirical validations show that BSSL outperforms self-supervised learning baselines for binary networks in various downstream tasks and outperforms supervised pretraining in certain tasks.

【10】 TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding 标题：Team-Net：部分解码视频动作识别的多模态学习链接：https://arxiv.org/abs/2110.08814

作者：Zhengwei Wang,Qi She,Aljosa Smolic 机构：V-SENSE, School of Computer Science and Statistics, Trinity College Dublin, Ireland, ByteDance, China 备注：To appear in BMVC 2021 摘要：大多数现有的视频动作识别模型都会摄取原始RGB帧。然而，原始视频流需要巨大的存储空间，并且包含大量的时间冗余。视频压缩（例如，H.264、MPEG-4）通过使用图片组（GOP）的概念表示原始视频流来减少多余信息。每个GOP由第一个I帧（也称为RGB图像）和多个P帧组成，P帧由运动向量和残差表示，它们可以被视为并用作预提取的特征。在这项工作中，我们1）引入了基于GOP级别的部分解码视频的网络输入采样，2）提出了一个即插即用多模式学习模块（团队），用于以端到端的方式使用来自I帧和P帧的信息来训练网络。与仅使用RGB的基线相比，我们展示了TEAM Net的卓越性能。TEAM Net还通过部分解码实现了视频动作识别领域的最新性能。代码提供于https://github.com/villawang/TEAM-Net. 摘要：Most of existing video action recognition models ingest raw RGB frames. However, the raw video stream requires enormous storage and contains significant temporal redundancy. Video compression (e.g., H.264, MPEG-4) reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP). Each GOP is composed of the first I-frame (aka RGB image) followed by a number of P-frames, represented by motion vectors and residuals, which can be regarded and used as pre-extracted features. In this work, we 1) introduce sampling the input for the network from partially decoded videos based on the GOP-level, and 2) propose a plug-and-play mulTi-modal lEArning Module (TEAM) for training the network using information from I-frames and P-frames in an end-to-end manner. We demonstrate the superior performance of TEAM-Net compared to the baseline using RGB only. TEAM-Net also achieves the state-of-the-art performance in the area of video action recognition with partial decoding. Code is provided at https://github.com/villawang/TEAM-Net.

【11】 Towards Language-guided Visual Recognition via Dynamic Convolutions 标题：基于动态卷积的语言制导视觉识别研究链接：https://arxiv.org/abs/2110.08797

作者：Gen Luo,Yiyi Zhou,Xiaoshuai Sun,Xinghao Ding,Yongjian Wu,Feiyue Huang,Yue Gao,Rongrong Ji 机构：Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University, China. ,Youtu Lab, Tencent., Software School of Tsinghua University. 摘要：在本文中，我们致力于通过探索语言引导的视觉识别，建立一个统一的端到端的多模态网络。为了实现这一目标，我们首先提出了一种称为语言相关卷积（LaConv）的新型多模卷积模块。其卷积核是基于自然语言信息动态生成的，有助于对不同的多模态实例提取不同的视觉特征。基于LaConv模块，我们进一步构建了第一个完全由语言驱动的卷积网络，称为LaConvNet，它可以将视觉识别和多模态推理统一到一个正向结构中。为了验证LaConv和LaConvNet，我们在两种视觉和语言任务的四个基准数据集上进行了广泛的实验，即视觉问答（VQA）和指称表达理解（REC）。实验结果不仅显示了LaConv与现有多模态模块相比的性能提升，还证明了LaConvNet作为一个统一网络的优点，包括紧凑的网络、高泛化能力和优异的性能，例如在RefCOCO+上为+4.7%。摘要：In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental results not only shows the performance gains of LaConv compared to the existing multi-modal modules, but also witness the merits of LaConvNet as an unified network, including compact network, high generalization ability and excellent performance, e.g., +4.7% on RefCOCO+.

【12】 Robust Pedestrian Attribute Recognition Using Group Sparsity for Occlusion Videos 标题：基于组稀疏性的遮挡视频鲁棒行人属性识别链接：https://arxiv.org/abs/2110.08708

作者：Geonu Lee,Kimin Yun,Jungchan Cho 机构：College of Information Technology, Gachon University, Seongnamdaero, Sujeong-gu, Sengnam-si, Gyeonggi-do, South Korea, Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research 备注：35 pages, 9 figures 摘要：遮挡处理是行人属性识别（PAR）中的一个关键问题。然而，一些现有的基于视频的PAR方法还没有考虑遮挡处理的深度。在本文中，我们将寻找非遮挡帧作为拥挤视频中基于稀疏性的时间注意。以这种方式，引导模型不注意被遮挡的帧。但是，当发生遮挡时，时间稀疏性不能包括属性之间的相关性。例如，当脚不可见时，无法识别“靴子”和“鞋子颜色”。为了解决不相关注意问题，我们还提出了一种新的基于群体稀疏性的时间注意模块。组稀疏性应用于相关属性中的注意权重。因此，一组中的注意力权重被迫关注相同的帧。实验结果表明，在两个基于视频的PAR数据集和五个遮挡场景下，所提出的方法达到了比现有技术更高的F1得分。摘要：Occlusion processing is a key issue in pedestrian attribute recognition (PAR). Nevertheless, several existing video-based PAR methods have not yet considered occlusion handling in depth. In this paper, we formulate finding non-occluded frames as sparsity-based temporal attention of a crowded video. In this manner, a model is guided not to pay attention to the occluded frame. However, temporal sparsity cannot include a correlation between attributes when occlusion occurs. For example, "boots" and "shoe color" cannot be recognized when the foot is invisible. To solve the uncorrelated attention issue, we also propose a novel group sparsity-based temporal attention module. Group sparsity is applied across attention weights in correlated attributes. Thus, attention weights in a group are forced to pay attention to the same frames. Experimental results showed that the proposed method achieved a higher F1-score than the state-of-the-art methods on two video-based PAR datasets and five occlusion scenarios.

【13】 Mapping illegal waste dumping sites with neural-network classification of satellite imagery 标题：基于卫星影像神经网络分类的非法垃圾场制图链接：https://arxiv.org/abs/2110.08599

作者：Devesa,Maria Roberta,Vazquez Brust,H. Antonio 机构：Dymaxion Labs, Buenos Aires, Argentina, Fundación Bunge y Born 备注：5 pages, 3 figures, KDD Workshop on Data-driven Humanitarian Mapping held with the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 14, 2021 摘要：公共卫生和人居质量是城市规划的关键目标。近年来，非法垃圾倾倒场的严重社会和环境影响使其成为全球南方城市面临的最严重问题之一，因为决策所需的信息非常匮乏。为了帮助确定倾倒地点的位置并跟踪其随时间的演变，我们采用了来自机器学习领域的数据驱动模型，分析卫星图像。这使我们能够利用地理空间开放数据、高分辨率卫星图像和开放源代码工具的日益增加的可用性，利用布宜诺斯艾利斯的一小部分已知垃圾倾倒场来训练机器学习算法，然后以高速和低成本预测其他地点在广大地区的位置。本案例研究展示了Dymaxion实验室与Fundaci掼on Bunge y Born合作利用这一技术的结果，以创建该地区非法垃圾倾倒场潜在位置的综合地图。摘要：Public health and habitat quality are crucial goals of urban planning. In recent years, the severe social and environmental impact of illegal waste dumping sites has made them one of the most serious problems faced by cities in the Global South, in a context of scarce information available for decision making. To help identify the location of dumping sites and track their evolution over time we adopt a data-driven model from the machine learning domain, analyzing satellite images. This allows us to take advantage of the increasing availability of geo-spatial open-data, high-resolution satellite imagery, and open source tools to train machine learning algorithms with a small set of known waste dumping sites in Buenos Aires, and then predict the location of other sites over vast areas at high speed and low cost. This case study shows the results of a collaboration between Dymaxion Labs and Fundaci\'on Bunge y Born to harness this technique in order to create a comprehensive map of potential locations of illegal waste dumping sites in the region.

【14】 Hybrid Mutimodal Fusion for Dimensional Emotion Recognition 标题：用于空间情感识别的混合多模态融合链接：https://arxiv.org/abs/2110.08495

作者：Ziyu Ma,Fuyan Ma,Bin Sun,Shutao Li 机构：College of Electrical and Information Engineering, Hunan University, Key Laboratory of Visual Perception and Artificial, Intelligence of Hunan Province, Changsha, China 备注：8 pages, 2 figures, accepted by ACM MM2021 摘要：在本文中，我们广泛介绍了2021年多模态情绪挑战（MuSe）的MuSe压力子挑战和MuSe生理子挑战的解决方案。MuSe压力次级挑战的目标是从视听记录中以时间连续的方式预测情绪唤醒和效价的水平，MuSe生理次级挑战的目标是从a）人类注释与b）皮肤电反应的融合中预测心理生理唤醒的水平（也称为皮肤电活动（EDA））来自压力人群的信号。Ulm TSST数据集是视听文本Ulm Trier社会压力数据集的一个新子集，该数据集的特点是在Trier社会压力测试（TSST）中使用德语两个子挑战都使用了诱导应激情境。对于MuSe应激子挑战，我们从三个方面强调了我们的解决方案：1）视听特征和生物信号特征用于情绪状态识别。2）利用具有自我注意机制的长短时记忆（LSTM）来捕获特征序列中复杂的时间依赖关系。3）采用后融合策略，通过利用分散在多模态序列中的互补信息，进一步提高模型的识别性能。我们提出的模型在测试集上的效价和唤醒分别达到0.6159和0.4609的CCC，这两个测试集都排在前3位。对于MuSe生理亚挑战，我们首先从多种模式中提取视听特征和生物信号特征。然后，利用具有自我注意机制的LSTM模块、门控卷积神经网络（GCNN）以及LSTM网络对序列中的复杂时间依赖性进行建模。最后，采用了后期融合策略。我们提出的方法在测试集上也达到了0.5412的CCC，排在前3位。摘要：In this paper, we extensively present our solutions for the MuSe-Stress sub-challenge and the MuSe-Physio sub-challenge of Multimodal Sentiment Challenge (MuSe) 2021. The goal of MuSe-Stress sub-challenge is to predict the level of emotional arousal and valence in a time-continuous manner from audio-visual recordings and the goal of MuSe-Physio sub-challenge is to predict the level of psycho-physiological arousal from a) human annotations fused with b) galvanic skin response (also known as Electrodermal Activity (EDA)) signals from the stressed people. The Ulm-TSST dataset which is a novel subset of the audio-visual textual Ulm-Trier Social Stress dataset that features German speakers in a Trier Social Stress Test (TSST) induced stress situation is used in both sub-challenges. For the MuSe-Stress sub-challenge, we highlight our solutions in three aspects: 1) the audio-visual features and the bio-signal features are used for emotional state recognition. 2) the Long Short-Term Memory (LSTM) with the self-attention mechanism is utilized to capture complex temporal dependencies within the feature sequences. 3) the late fusion strategy is adopted to further boost the model's recognition performance by exploiting complementary information scattered across multimodal sequences. Our proposed model achieves CCC of 0.6159 and 0.4609 for valence and arousal respectively on the test set, which both rank in the top 3. For the MuSe-Physio sub-challenge, we first extract the audio-visual features and the bio-signal features from multiple modalities. Then, the LSTM module with the self-attention mechanism, and the Gated Convolutional Neural Networks (GCNN) as well as the LSTM network are utilized for modeling the complex temporal dependencies in the sequence. Finally, the late fusion strategy is used. Our proposed method also achieves CCC of 0.5412 on the test set, which ranks in the top 3.

【15】 Comparing Human and Machine Bias in Face Recognition 标题：人脸识别中人与机器偏差的比较链接：https://arxiv.org/abs/2110.08396

作者：Samuel Dooley,Ryan Downing,George Wei,Nathan Shankar,Bradon Thymes,Gudrun Thorkelsdottir,Tiye Kurtz-Miott,Rachel Mattson,Olufemi Obiwumi,Valeriia Cherepanova,Micah Goldblum,John P Dickerson,Tom Goldstein 机构：University of Maryland, University of Massachusetts, Amherst, Pomona College, Howard University, University of California, San Diego, University of Georgia, Haverford College 摘要：最近的许多研究发现并讨论了面部分析技术中存在的严重偏差问题，发现了基于感知性别、皮肤类型、照明条件的人群之间的表现差异，这些审计在测量算法偏差方面非常重要和成功，但有两个主要挑战：审计（1）使用缺乏高质量元数据的面部识别数据集，如LFW和CelebA，以及（2）不将观察到的算法偏差与其人类备选方案的偏差进行比较。在本文中，我们对LFW和CelebA数据集进行了改进，这将使未来的研究人员能够获得不受数据集中主要缺陷影响的算法偏差测量值（例如，画廊和测试集中出现的相同图像）。我们还利用这些新数据开发了一系列具有挑战性的面部识别和验证问题，并对各种算法和大量平衡的人类评论者样本进行了验证。我们发现，计算机模型和人类调查参与者在验证任务中的表现显著更好，黑皮肤或女性受试者在这两项任务中的准确率通常较低，并且当他们的人口统计数据与问题匹配时，准确率较高。在这两项任务上，观察到计算机模型比调查参与者的准确度更高，并且显示出与人类调查参与者相似的偏差程度。摘要：Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Computer models are observed to achieve a higher level of accuracy than the survey participants on both tasks and exhibit bias to similar degrees as the human survey participants.

【16】 Comparative Analysis of Deep Learning Algorithms for Classification of COVID-19 X-Ray Images 标题：深度学习算法在冠状病毒X射线图像分类中的比较分析链接：https://arxiv.org/abs/2110.09294

作者：Unsa Maheen,Khawar Iqbal Malik,Gohar Ali 机构：[,], [,], [,] Department of Computer Science, University of Lahore – Pakistan 摘要：冠状病毒于2019年12月在中国武汉首次出现，并在全球迅速传播。它对全球经济、教育、社会、日常生活和人类的总体健康都有非常有害的影响。要在初期限制疾病的快速扩展，主要困难是尽快探索阳性电晕患者。由于没有可访问的自动工具包，因此对辅助诊断工具的需求增加。先前的研究表明，从放射技术获得的结果表明，此类图像具有与冠状病毒相关的重要细节。将改进的人工智能（AI）系统与无线电图形图像结合使用，可以有效地精确、准确地解决该病毒，也有助于解决偏远村庄缺乏专业医生的问题。在我们的研究中，我们分析了使用胸部X射线照相图像检测新冠病毒-19的不同技术，我们检查了不同的预训练CNN模型AlexNet、VGG-16、MobileNet-V2、SqeezeNet、ResNet-34、ResNet-50和COVIDX-Net，以纠正新冠病毒分类系统的分析。我们的研究表明，使用ResNet-34技术预训练的CNN模型具有较高的准确率，即98.33、96.77%的准确率和98.36的F1分数，这优于其他CNN技术。我们的模型可能有助于研究人员对CNN模型进行精细训练，以便快速筛查新冠病毒患者。摘要：The Coronavirus was first emerged in December, in the city of China named Wuhan in 2019 and spread quickly all over the world. It has very harmful effects all over the global economy, education, social, daily living and general health of humans. To restrict the quick expansion of the disease initially, main difficulty is to explore the positive corona patients as quickly as possible. As there are no automatic tool kits accessible the requirement for supplementary diagnostic tools has risen up. Previous studies have findings acquired from radiological techniques proposed that this kind of images have important details related to the coronavirus. The usage of modified Artificial Intelligence (AI) system in combination with radio-graphical images can be fruitful for the precise and exact solution of this virus and can also be helpful to conquer the issue of deficiency of professional physicians in distant villages. In our research, we analyze the different techniques for the detection of COVID-19 using X-Ray radiographic images of the chest, we examined the different pre-trained CNN models AlexNet, VGG-16, MobileNet-V2, SqeezeNet, ResNet-34, ResNet-50 and COVIDX-Net to correct analytics for classification system of COVID-19. Our study shows that the pre trained CNN Model with ResNet-34 technique gives the higher accuracy rate of 98.33, 96.77% precision, and 98.36 F1-score, which is better than other CNN techniques. Our model may be helpful for the researchers to fine train the CNN model for the the quick screening of COVID patients.

分割|语义相关(12篇)

【1】 Boosting Image Outpainting with Semantic Layout Prediction 标题：利用语义布局预测促进图像外绘链接：https://arxiv.org/abs/2110.09267

作者：Ye Ma,Jin Ma,Min Zhou,Quan Chen,Tiezheng Ge,Yuning Jiang,Tong Lin 机构：Alibaba Group, Peking University 摘要：图像输出的目的是扩展图像的当前边界，并在已知边界的基础上生成新的区域。以前的方法采用生成对抗网络（GANs）来合成真实感图像。然而，由于缺乏明确的语义表示，当输出区域复杂且对象多样时，会导致图像像素模糊和异常。在这项工作中，我们将绘制任务分解为两个阶段。首先，我们训练GAN在语义分割域而不是图像域扩展区域。其次，基于扩展的语义布局，训练另一个GAN模型合成真实图像。第一个模型关注低频上下文，如大小、类别和其他语义线索，而第二个模型关注高频上下文，如颜色和纹理。通过这种设计，我们的方法可以更容易地处理语义线索，因此在复杂场景中工作得更好。我们在各种数据集上评估我们的框架，并进行定量和定性分析。实验表明，我们的方法生成合理的扩展语义布局和图像，优于现有的模型。摘要：The objective of image outpainting is to extend image current border and generate new regions based on known ones. Previous methods adopt generative adversarial networks (GANs) to synthesize realistic images. However, the lack of explicit semantic representation leads to blurry and abnormal image pixels when the outpainting areas are complex and with various objects. In this work, we decompose the outpainting task into two stages. Firstly, we train a GAN to extend regions in semantic segmentation domain instead of image domain. Secondly, another GAN model is trained to synthesize real images based on the extended semantic layouts. The first model focuses on low frequent context such as sizes, classes and other semantic cues while the second model focuses on high frequent context like color and texture. By this design, our approach can handle semantic clues more easily and hence works better in complex scenarios. We evaluate our framework on various datasets and make quantitative and qualitative analysis. Experiments demonstrate that our method generates reasonable extended semantic layouts and images, outperforming state-of-the-art models.

【2】 A Unified Framework for Generalized Low-Shot Medical Image Segmentation with Scarce Data 标题：一种基于稀缺数据的广义低镜头医学图像分割统一框架链接：https://arxiv.org/abs/2110.09260

作者：Hengji Cui,Dong Wei,Kai Ma,Shi Gu,Yefeng Zheng 机构： Gu are with the School of Computer Science and Engi-neering, University of Electronic Science and Technology of China (email 备注：Published in IEEE TRANSACTIONS ON MEDICAL IMAGING 摘要：利用深度神经网络（DNNs）进行医学图像分割已经取得了显著的进展。然而，DNN通常需要大量数据和注释进行训练，这两种数据和注释都很难获得，而且成本也很高。在这项工作中，我们提出了一个基于距离度量学习（DML）的通用低镜头（单镜头和Few-Shot）医学图像分割的统一框架。与大多数现有方法不同的是，这些方法只处理注释的缺乏，而假设数据丰富，我们的框架处理这两种方法的极端稀缺性，这对于罕见疾病来说是理想的。通过DML，该框架学习每个类别的多模态混合表示，并基于像素深度嵌入和类别表示之间的余弦距离执行密集预测。多模态表示有效地利用了受试者之间的相似性和组内变化，以克服由于数据极其有限而导致的过度拟合。此外，我们提出了多模态混合分布的自适应混合系数，以自适应地强调更适合当前输入的模式。这些表示隐式地嵌入为fc层的权重，使得可以通过前向传播有效地计算余弦距离。在我们对脑部MRI和腹部CT数据集的实验中，所提出的框架在低镜头分割方面取得了优于基于标准DNN（3D U-Net）和基于经典配准（ANTs）方法的优越性能，例如。，使用单个训练样本实现脑组织/腹部多器官分割的平均骰子系数为81%/69%，而U-Net和ANTs分别为52%/31%和72%/35%。摘要：Medical image segmentation has achieved remarkable advancements using deep neural networks (DNNs). However, DNNs often need big amounts of data and annotations for training, both of which can be difficult and costly to obtain. In this work, we propose a unified framework for generalized low-shot (one- and few-shot) medical image segmentation based on distance metric learning (DML). Unlike most existing methods which only deal with the lack of annotations while assuming abundance of data, our framework works with extreme scarcity of both, which is ideal for rare diseases. Via DML, the framework learns a multimodal mixture representation for each category, and performs dense predictions based on cosine distances between the pixels' deep embeddings and the category representations. The multimodal representations effectively utilize the inter-subject similarities and intraclass variations to overcome overfitting due to extremely limited data. In addition, we propose adaptive mixing coefficients for the multimodal mixture distributions to adaptively emphasize the modes better suited to the current input. The representations are implicitly embedded as weights of the fc layer, such that the cosine distances can be computed efficiently via forward propagation. In our experiments on brain MRI and abdominal CT datasets, the proposed framework achieves superior performances for low-shot segmentation towards standard DNN-based (3D U-Net) and classical registration-based (ANTs) methods, e.g., achieving mean Dice coefficients of 81%/69% for brain tissue/abdominal multiorgan segmentation using a single training sample, as compared to 52%/31% and 72%/35% by the U-Net and ANTs, respectively.

【3】 Color Image Segmentation Using Multi-Objective Swarm Optimizer and Multi-level Histogram Thresholding 标题：基于多目标群优化和多级直方图阈值的彩色图像分割链接：https://arxiv.org/abs/2110.09217

作者：Mohammadreza Naderi Boldaji,Samaneh Hosseini Semnani 机构：First affiliation, Address, City and Postcode, Country, Second affiliation, Address, City and Postcode, Country 备注：11 pages, 6 figures 摘要：群体智能优化器和计算机处理能力的快速发展为设计更精确、稳定和全面的彩色图像分割方法提供了机会。本文提出了一种新的无监督图像分割方法，将直方图阈值化方法（Kapur熵和Otsu方法）与不同的多目标群智能算法（MOPSO、MOGWO、MSSA和MOALO）相结合，对彩色图像的三维直方图进行阈值化。更准确地说，该方法首先结合传统阈值算法的目标函数来设计综合目标函数，然后在优化设计目标函数的过程中使用多目标优化器来寻找最佳阈值。此外，我们的方法使用三维空间中的向量目标函数，可以同时处理具有相同阈值的整个图像颜色通道的分割。为了优化这个向量目标函数，我们使用了多目标群优化器，可以同时优化多个目标函数。因此，我们的方法考虑通道之间的依赖关系，以找到同时满足颜色通道目标函数（我们称之为向量目标函数）的阈值。使用相同的阈值分割整个颜色通道也受益于这样一个事实，即我们提出的方法比其他阈值分割算法需要更少的阈值来分割图像；因此，它需要更少的内存空间来保存阈值。当我们想要将许多图像分割成许多区域时，它会有很大帮助。主观和客观结果表明，该方法优于传统的阈值分割方法，分别对彩色图像的直方图进行阈值分割。摘要：Rapid developments in swarm intelligence optimizers and computer processing abilities make opportunities to design more accurate, stable, and comprehensive methods for color image segmentation. This paper presents a new way for unsupervised image segmentation by combining histogram thresholding methods (Kapur's entropy and Otsu's method) and different multi-objective swarm intelligence algorithms (MOPSO, MOGWO, MSSA, and MOALO) to thresholding 3D histogram of a color image. More precisely, this method first combines the objective function of traditional thresholding algorithms to design comprehensive objective functions then uses multi-objective optimizers to find the best thresholds during the optimization of designed objective functions. Also, our method uses a vector objective function in 3D space that could simultaneously handle the segmentation of entire image color channels with the same thresholds. To optimize this vector objective function, we employ multiobjective swarm optimizers that can optimize multiple objective functions at the same time. Therefore, our method considers dependencies between channels to find the thresholds that satisfy objective functions of color channels (which we name as vector objective function) simultaneously. Segmenting entire color channels with the same thresholds also benefits from the fact that our proposed method needs fewer thresholds to segment the image than other thresholding algorithms; thus, it requires less memory space to save thresholds. It helps a lot when we want to segment many images to many regions. The subjective and objective results show the superiority of this method to traditional thresholding methods that separately threshold histograms of a color image.

【4】 FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time Semantic Segmentation 标题：FEANET：面向RGB-热度实时语义分割的特征增强型关注度网络链接：https://arxiv.org/abs/2110.08988

作者：Fuqin Deng,Hua Feng,Mingjian Liang,Hongmin Wang,Yong Yang,Yuan Gao,Junfeng Chen,Junjie Hu,Xiyue Guo,Tin Lun Lam 机构：Authors contributed equally 1School of Intelligent Manufacturing, the Wuyi University, 2School of Science and Engineering, the Chinese University of HongKong 备注：7 pages, 5 figures 摘要：用于语义分割的RGB热信息（RGB-T）近年来得到了广泛的研究。然而，现有的大多数RGB-T语义分割通常会牺牲空间分辨率以实现实时推理速度，导致性能低下。为了更好地提取细节空间信息，我们提出了一种用于RGB-T语义分割任务的两阶段特征增强注意网络（FEANet）。具体来说，我们引入了一个特征增强注意模块（FEAM），从通道和空间视图中挖掘和增强多层次特征。得益于所提出的FEAM模块，我们的FEANet可以保留空间信息，并将更多注意力转移到融合RGB-T图像的高分辨率特征上。在城市场景数据集上进行的大量实验表明，我们的FEANet在客观度量和主观视觉比较方面优于其他最先进的（SOTA）RGB-T方法（在全球mAcc中为+2.6%，在全球mIoU中为+0.8%）。对于480 x 640 RGB-T测试图像，我们的FEANet可以在NVIDIA GeForce RTX 2080 Ti卡上以实时速度运行。摘要：The RGB-Thermal (RGB-T) information for semantic segmentation has been extensively explored in recent years. However, most existing RGB-T semantic segmentation usually compromises spatial resolution to achieve real-time inference speed, which leads to poor performance. To better extract detail spatial information, we propose a two-stage Feature-Enhanced Attention Network (FEANet) for the RGB-T semantic segmentation task. Specifically, we introduce a Feature-Enhanced Attention Module (FEAM) to excavate and enhance multi-level features from both the channel and spatial views. Benefited from the proposed FEAM module, our FEANet can preserve the spatial information and shift more attention to high-resolution features from the fused RGB-T images. Extensive experiments on the urban scene dataset demonstrate that our FEANet outperforms other state-of-the-art (SOTA) RGB-T methods in terms of objective metrics and subjective visual comparison (+2.6% in global mAcc and +0.8% in global mIoU). For the 480 x 640 RGB-T test images, our FEANet can run with a real-time speed on an NVIDIA GeForce RTX 2080 Ti card.

【5】 Uncertainty-Aware Semi-Supervised Few Shot Segmentation 标题：不确定性感知的半监督Few-Shot分割链接：https://arxiv.org/abs/2110.08954

作者：Soopil Kim,Philip Chikontwe,Sang Hyun Park 机构：Department of Robotics Engineering, DGIST 备注：9 pages 摘要：少数镜头分割（FSS）的目的是仅使用几个带注释的支持样本来学习查询图像中目标对象的像素级分类。这是一项具有挑战性的工作，因为它需要对目标对象的外观变化进行建模，并在信息有限的情况下对查询图像和支持图像之间的各种视觉线索进行建模。为了解决这个问题，我们提出了一种半监督FSS策略，该策略利用来自未标记图像的附加原型，并采用不确定性引导的伪标签细化。为了从未标记的图像中获得可靠的原型，我们对神经网络进行元训练，以联合预测分割和估计预测的不确定性。我们使用不确定性估计来排除伪标签构造的高度不确定性预测，以获得基于改进伪标签的额外原型。在推理过程中，使用支持图像和未标记图像的原型（包括查询图像的低级特征）预测查询分割。我们的方法是端到端的，可以很容易地补充现有的方法，而不需要额外的训练来使用未标记的样本。在PASCAL-$5^i$和COCO-$20^i$上进行的大量实验表明，我们的模型可以有效地去除不可靠的预测，以细化伪标签，并显著改善最先进的性能。摘要：Few shot segmentation (FSS) aims to learn pixel-level classification of a target object in a query image using only a few annotated support samples. This is challenging as it requires modeling appearance variations of target objects and the diverse visual cues between query and support images with limited information. To address this problem, we propose a semi-supervised FSS strategy that leverages additional prototypes from unlabeled images with uncertainty guided pseudo label refinement. To obtain reliable prototypes from unlabeled images, we meta-train a neural network to jointly predict segmentation and estimate the uncertainty of predictions. We employ the uncertainty estimates to exclude predictions with high degrees of uncertainty for pseudo label construction to obtain additional prototypes based on the refined pseudo labels. During inference, query segmentation is predicted using prototypes from both support and unlabeled images including low-level features of the query images. Our approach is end-to-end and can easily supplement existing approaches without the requirement of additional training to employ unlabeled samples. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate that our model can effectively remove unreliable predictions to refine pseudo labels and significantly improve upon state-of-the-art performances.

【6】 Temporally stable video segmentation without video annotations 标题：无视频注释的时间稳定视频分割链接：https://arxiv.org/abs/2110.08893

作者：Aharon Azulay,Tavi Halperin,Orestis Vantzos,Nadav Bornstein,Ofir Bibi 机构：. Lightricks, . The Hebrew University 摘要：时间上一致的密集视频注释很少，而且很难收集。相比之下，图像分割数据集（和预先训练的模型）是普遍存在的，并且对于任何新任务都更容易标记。在本文中，我们介绍了一种方法来适应静止图像分割模型的视频在无监督的方式，通过使用基于光流的一致性度量。为了确保推断出的分段视频在实践中更稳定，我们通过用户研究验证了一致性度量与人类判断的良好相关性。训练一个新的多输入多输出解码器，使用这种方法作为损失，再加上细化当前图像分割数据集的技术和时间加权引导滤波器，我们观察到生成的分割视频的稳定性得到了改善，而准确度损失最小。摘要：Temporally consistent dense video annotations are scarce and hard to collect. In contrast, image segmentation datasets (and pre-trained models) are ubiquitous, and easier to label for any novel task. In this paper, we introduce a method to adapt still image segmentation models to video in an unsupervised manner, by using an optical flow-based consistency measure. To ensure that the inferred segmented videos appear more stable in practice, we verify that the consistency measure is well correlated with human judgement via a user study. Training a new multi-input multi-output decoder using this measure as a loss, together with a technique for refining current image segmentation datasets and a temporal weighted-guided filter, we observe stability improvements in the generated segmented videos with minimal loss of accuracy.

【7】 Contrastive Learning of Visual-Semantic Embeddings 标题：视觉语义嵌入的对比学习链接：https://arxiv.org/abs/2110.08872

作者：Anurag Jain,Yashaswi Verma 机构： Verma is with the Department of Computer Science and Engi-neering, Indian Institute of Technology 摘要：对比学习是一种强大的技术，可以学习语义独特且几何不变的表征。虽然大多数早期的方法已经证明了其在单模态学习任务（如图像分类）上的有效性，但最近有一些尝试将此想法扩展到多模态数据。在本文中，我们提出了两个基于归一化交叉熵的损失函数，通过批量对比训练来完成学习联合视觉语义嵌入的任务。在一个批次中，对于一个模态的给定锚点，我们仅从另一个模态考虑它的负值，并且基于所有负数所产生的预期违反来定义我们的第一个对比损失。接下来，我们更新此损失，并根据仅由最硬的负片引起的违反来定义第二个对比损失。我们将我们的结果与使用MS-COCO和Flickr30K数据集的跨模式图像到文本和文本到图像检索任务的现有视觉语义嵌入方法进行比较，我们在MS-COCO数据集上的表现优于最新技术，并在Flickr30K数据集上取得了可比的结果。摘要：Contrastive learning is a powerful technique to learn representations that are semantically distinctive and geometrically invariant. While most of the earlier approaches have demonstrated its effectiveness on single-modality learning tasks such as image classification, recently there have been a few attempts towards extending this idea to multi-modal data. In this paper, we propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding using batch contrastive training. In a batch, for a given anchor point from one modality, we consider its negatives only from another modality, and define our first contrastive loss based on expected violations incurred by all the negatives. Next, we update this loss and define the second contrastive loss based on the violation incurred only by the hardest negative. We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks using the MS-COCO and Flickr30K datasets, where we outperform the state-of-the-art on the MS-COCO dataset and achieve comparable results on the Flickr30K dataset.

【8】 Inconsistency-aware Uncertainty Estimation for Semi-supervised Medical Image Segmentation 标题：基于不一致性感知的半监督医学图像分割不确定性估计链接：https://arxiv.org/abs/2110.08762

作者：Yinghuan Shi,Jian Zhang,Tong Ling,Jiwen Lu,Yefeng Zheng,Qian Yu,Lei Qi,Yang Gao 机构： Tsinghua University 备注：Accepted by IEEE Transactions on Medical Imaging (TMI) 摘要：在半监督医学图像分割中，大多数以前的工作都基于一个共同的假设，即高熵意味着更高的不确定性。在本文中，我们研究了一种新的估计不确定性的方法。我们观察到，当在一定程度上分配不同的误分类代价时，如果某个像素的分割结果变得不一致，则该像素在其分割中表现出相对的不确定性。因此，我们提出了一种新的基于不确定性估计和分离自训练策略的半监督分割模型，即保守根网络（CoraNet）。特别是，我们的CoraNet模型由三个主要部分组成：保守根模块（CRM）、特定区域分割网络（C-SN）和不确定区域分割网络（UC-SN），它们可以以端到端的方式交替训练。我们已经用公开的基准数据集广泛地评估了我们在各种分割任务上的方法，包括ACDC数据集上的CT胰腺、MR心内膜和MR多结构分割。与目前的技术水平相比，我们的CoraNet表现出了卓越的性能。此外，我们还分析了它与传统的半监督医学图像分割不确定性估计方法的联系和区别。摘要：In semi-supervised medical image segmentation, most previous works draw on the common assumption that higher entropy means higher uncertainty. In this paper, we investigate a novel method of estimating uncertainty. We observe that, when assigned different misclassification costs in a certain degree, if the segmentation result of a pixel becomes inconsistent, this pixel shows a relative uncertainty in its segmentation. Therefore, we present a new semi-supervised segmentation model, namely, conservative-radical network (CoraNet in short) based on our uncertainty estimation and separate self-training strategy. In particular, our CoraNet model consists of three major components: a conservative-radical module (CRM), a certain region segmentation network (C-SN), and an uncertain region segmentation network (UC-SN) that could be alternatively trained in an end-to-end manner. We have extensively evaluated our method on various segmentation tasks with publicly available benchmark datasets, including CT pancreas, MR endocardium, and MR multi-structures segmentation on the ACDC dataset. Compared with the current state of the art, our CoraNet has demonstrated superior performance. In addition, we have also analyzed its connection with and difference from conventional methods of uncertainty estimation in semi-supervised medical image segmentation.

【9】 LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation 标题：LoveDA：一种支持领域自适应语义分割的遥感土地覆盖数据集链接：https://arxiv.org/abs/2110.08733

作者：Junjue Wang,Zhuo zheng,Ailong Ma,Xiaoyan Lu,Yanfei Zhong 机构：State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan , China 备注：Accepted by NeurIPS 2021 Datasets and Benchmarks Track 摘要：深度学习方法在遥感高空间分辨率（HSR）土地覆盖制图方面已显示出良好的效果。然而，城市和农村场景可以显示完全不同的地理景观，这些算法的不通用性阻碍了城市级或国家级地图的绘制。现有的高铁土地覆盖数据集大多侧重于学习语义表示的研究，忽略了模型的可移植性。在本文中，我们引入了土地覆盖领域自适应语义分割（LoveDA）数据集来推进语义和可转移学习。LoveDA数据集包含5927张HSR图像和来自三个不同城市的166768个注释对象。与现有数据集相比，LoveDA数据集包含两个领域（城市和农村），这带来了相当大的挑战，因为：1）多尺度对象；2）复杂背景样本；3）阶级分布不一致。LoveDA数据集适用于土地覆盖语义分割和无监督领域适应（UDA）任务。因此，我们在11种语义分割方法和8种UDA方法上对LoveDA数据集进行了基准测试。为了应对这些挑战，还开展了一些探索性研究，包括多尺度体系结构和策略、额外的背景监督和伪标签分析。有关代码和数据，请访问https://github.com/Junjue-Wang/LoveDA. 摘要：Deep learning approaches have shown promising results in remote sensing high spatial resolution (HSR) land-cover mapping. However, urban and rural scenes can show completely different geographical landscapes, and the inadequate generalizability of these algorithms hinders city-level or national-level mapping. Most of the existing HSR land-cover datasets mainly promote the research of learning semantic representation, thereby ignoring the model transferability. In this paper, we introduce the Land-cOVEr Domain Adaptive semantic segmentation (LoveDA) dataset to advance semantic and transferable learning. The LoveDA dataset contains 5927 HSR images with 166768 annotated objects from three different cities. Compared to the existing datasets, the LoveDA dataset encompasses two domains (urban and rural), which brings considerable challenges due to the: 1) multi-scale objects; 2) complex background samples; and 3) inconsistent class distributions. The LoveDA dataset is suitable for both land-cover semantic segmentation and unsupervised domain adaptation (UDA) tasks. Accordingly, we benchmarked the LoveDA dataset on eleven semantic segmentation methods and eight UDA methods. Some exploratory studies including multi-scale architectures and strategies, additional background supervision, and pseudo-label analysis were also carried out to address these challenges. The code and data are available at https://github.com/Junjue-Wang/LoveDA.

【10】 Pseudo-label refinement using superpixels for semi-supervised brain tumour segmentation 标题：基于超像素的伪标记法在半监督脑肿瘤分割中的应用链接：https://arxiv.org/abs/2110.08589

作者：Bethany H. Thompson,Gaetano Di Caterina,Jeremy P. Voisey 机构：† Canon Medical Research Europe Ltd., AI Research, Edinburgh, UK, ⋆ University of Strathclyde, Electronic & Electrical Engineering, Glasgow, UK 备注：This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要：使用有限注释训练神经网络是医学领域的一个重要问题。深度神经网络（DNN）通常需要大型带注释的数据集来实现可接受的性能，而在医学领域，由于需要专家放射科医生花费大量时间，因此很难获得可接受的性能。半监督学习的目的是通过在利用大量未标记数据的同时，使用很少的带注释数据来学习分段来克服这个问题。然而，最著名的技术是利用推断的伪标签，它容易受到不准确的伪标签的影响，从而降低性能。我们提出了一个基于超像素（相邻像素的有意义簇）的框架，以提高伪标签的准确性并解决这个问题。我们的框架将超像素与半监督学习相结合，利用超像素映射的特征和边缘在训练过程中细化伪标签。该方法在用于脑肿瘤区域分割的多模式磁共振成像（MRI）数据集上进行评估。我们的方法表明，当注释者负担减少且只有5名注释患者可用时，与标准的半监督伪标记基线相比，性能有所提高。对于整个肿瘤和肿瘤核心区域，我们分别报告了DSC=0.824和DSC=0.707。摘要：Training neural networks using limited annotations is an important problem in the medical domain. Deep Neural Networks (DNNs) typically require large, annotated datasets to achieve acceptable performance which, in the medical domain, are especially difficult to obtain as they require significant time from expert radiologists. Semi-supervised learning aims to overcome this problem by learning segmentations with very little annotated data, whilst exploiting large amounts of unlabelled data. However, the best-known technique, which utilises inferred pseudo-labels, is vulnerable to inaccurate pseudo-labels degrading the performance. We propose a framework based on superpixels - meaningful clusters of adjacent pixels - to improve the accuracy of the pseudo labels and address this issue. Our framework combines superpixels with semi-supervised learning, refining the pseudo-labels during training using the features and edges of the superpixel maps. This method is evaluated on a multimodal magnetic resonance imaging (MRI) dataset for the task of brain tumour region segmentation. Our method demonstrates improved performance over the standard semi-supervised pseudo-labelling baseline when there is a reduced annotator burden and only 5 annotated patients are available. We report DSC=0.824 and DSC=0.707 for the test set whole tumour and tumour core regions respectively.

【11】 DBSegment: Fast and robust segmentation of deep brain structures -- Evaluation of transportability across acquisition domains 标题：DBSegment：大脑深层结构的快速而稳健的分割--跨采集域的可移植性评估链接：https://arxiv.org/abs/2110.09473

作者：Mehri Baniasadi,Mikkel V. Petersen,Jorge Goncalves,Andreas Horn,Vanja Vlasov,Frank Hertel,Andreas Husch 机构：Luxembourg Center for Systems Biomedicine, University of Luxembourg, National Department of Neurosurgery, Centre Hospitalier de Luxembourg, Department of Clinical Medicine, Center of Functionally Integrative Neuroscience, University of Aarhus, Jorge Gonçalves 摘要：从磁共振图像中分割脑深部结构对于患者诊断、手术计划和研究非常重要。目前大多数最先进的解决方案都采用了按注册分段的方法，将受试者磁共振成像映射到具有明确分段的模板。然而，基于注册的管道非常耗时，因此限制了其临床应用。本文利用深度学习提供了一种鲁棒有效的脑深部分割解决方案。该方法包括一个预处理步骤，以使所有MRI图像符合相同的方向，然后是使用nnU网络框架的卷积神经网络。我们使用了来自研究和临床收集的总共14个数据集。其中7项用于训练和验证，7项保留用于独立测试。我们使用基于注册的方法生成的标签对网络进行训练，以分割30个大脑深层结构以及一个大脑面罩。我们通过执行遗漏一个数据集的交叉验证和对外部数据集的广泛测试来评估网络的通用性。此外，我们通过分别评估不同领域的结果来评估跨领域的可运输性。与基于注册的金标准相比，独立测试数据集的平均DSC为0.89$\pm$0.04。在我们的测试系统上，计算时间从基于参考注册的管道的42分钟减少到1分钟。我们提出的方法具有快速、鲁棒性和高可靠性的通用性。它可以扩展到其他大脑结构的分割。该方法在GitHub上公开可用，并提供了一个pip包以方便使用。摘要：Segmenting deep brain structures from magnetic resonance images is important for patient diagnosis, surgical planning, and research. Most current state-of-the-art solutions follow a segmentation-by-registration approach, where subject MRIs are mapped to a template with well-defined segmentations. However, registration-based pipelines are time-consuming, thus, limiting their clinical use. This paper uses deep learning to provide a robust and efficient deep brain segmentation solution. The method consists of a pre-processing step to conform all MRI images to the same orientation, followed by a convolutional neural network using the nnU-Net framework. We use a total of 14 datasets from both research and clinical collections. Of these, seven were used for training and validation and seven were retained for independent testing. We trained the network to segment 30 deep brain structures, as well as a brain mask, using labels generated from a registration-based approach. We evaluated the generalizability of the network by performing a leave-one-dataset-out cross-validation, and extensive testing on external datasets. Furthermore, we assessed cross-domain transportability by evaluating the results separately on different domains. We achieved an average DSC of 0.89 $\pm$ 0.04 on the independent testing datasets when compared to the registration-based gold standard. On our test system, the computation time decreased from 42 minutes for a reference registration-based pipeline to 1 minute. Our proposed method is fast, robust, and generalizes with high reliability. It can be extended to the segmentation of other brain structures. The method is publicly available on GitHub, as well as a pip package for convenient usage.

【12】 Self-Supervised U-Net for Segmenting Flat and Sessile Polyps 标题：基于自监督U网的扁平和硬性息肉分割链接：https://arxiv.org/abs/2110.08776

作者：Debayan Bhattacharya,Christian Betz,Dennis Eggert,Alexander Schlaefer 机构：a,dHamburg University of Technology, Hamburg, Germany, a,b,c Universit¨atsklinikum Hamburg-Eppendorf, Hamburg, Germany, . PURPOSE 摘要：结直肠癌（CRC）对公众健康构成巨大风险。它是美国第三大最常见的癌症病因。结直肠息肉的发生是癌症的早期症状之一。早期发现和切除息肉可以大大提高生存率至90%。人工检查可能导致误检，因为息肉的颜色、形状、大小和外观各不相同。为此，计算机辅助诊断系统（CADx）被提出，通过处理结肠镜视频来检测息肉。该系统起到辅助检查的作用，帮助临床医生减少误检，以便在息肉转化为癌症之前进行切除。息肉的颜色、形状、大小、质地和外观各不相同。因此，尽管CADx溶液非常突出，息肉的漏检率仍在6%至27%之间。此外，直径小于10毫米的无柄扁平息肉更容易被发现。卷积神经网络（CNN）在息肉分割中显示出良好的效果。然而，所有这些工作都有一个有监督的方法，并且受到数据集大小的限制。据观察，较小的数据集会降低ResUNet++的分割精度。我们训练一个U网络来修复图像中随机丢失的像素作为代理任务。我们用于预训练的数据集是Kvasir SEG数据集。随后，在有限的Kvasir会话数据集上进行监督训练。我们的实验结果表明，对于有限的注释数据集和较大的未标记数据集，自监督方法是一种比完全监督方法更好的选择。具体来说，我们的自监督U-Net比在Kvasir Sessile数据集上以监督方式训练的五种分割模型表现得更好。摘要：Colorectal Cancer(CRC) poses a great risk to public health. It is the third most common cause of cancer in the US. Development of colorectal polyps is one of the earliest signs of cancer. Early detection and resection of polyps can greatly increase survival rate to 90%. Manual inspection can cause misdetections because polyps vary in color, shape, size and appearance. To this end, Computer-Aided Diagnosis systems(CADx) has been proposed that detect polyps by processing the colonoscopic videos. The system acts a secondary check to help clinicians reduce misdetections so that polyps may be resected before they transform to cancer. Polyps vary in color, shape, size, texture and appearance. As a result, the miss rate of polyps is between 6% and 27% despite the prominence of CADx solutions. Furthermore, sessile and flat polyps which have diameter less than 10 mm are more likely to be undetected. Convolutional Neural Networks(CNN) have shown promising results in polyp segmentation. However, all of these works have a supervised approach and are limited by the size of the dataset. It was observed that smaller datasets reduce the segmentation accuracy of ResUNet++. We train a U-Net to inpaint randomly dropped out pixels in the image as a proxy task. The dataset we use for pre-training is Kvasir-SEG dataset. This is followed by a supervised training on the limited Kvasir-Sessile dataset. Our experimental results demonstrate that with limited annotated dataset and a larger unlabeled dataset, self-supervised approach is a better alternative than fully supervised approach. Specifically, our self-supervised U-Net performs better than five segmentation models which were trained in supervised manner on the Kvasir-Sessile dataset.

Zero/Few Shot|迁移|域适配|自适应(8篇)

【1】 MEMO: Test Time Robustness via Adaptation and Augmentation 标题：备注：通过自适应和增强实现测试时间健壮性链接：https://arxiv.org/abs/2110.09506

作者：Marvin Zhang,Sergey Levine,Chelsea Finn 机构： UC Berkeley, Stanford University 摘要：虽然深度神经网络可以在分布测试点上获得良好的精度，但许多应用需要鲁棒性，即使在面对输入中的意外扰动、域中的变化或其他分布偏移源时也是如此。我们研究了测试时间鲁棒性问题，即使用测试输入来提高模型的鲁棒性。最近的前期工作已经提出了测试时间自适应的方法，但是，它们都引入了额外的假设，例如访问多个测试点，从而阻止了广泛采用。在这项工作中，我们的目标是研究和设计方法，使模型训练过程不作任何假设，并在测试时广泛适用。我们提出了一种简单的方法，可用于模型具有概率性和适应性的任何测试环境中：当使用测试示例时，对数据点执行不同的数据增强，然后通过最小化模型平均值或边际熵来调整（所有）模型参数，整个扩充的输出分布。直观地说，这一目标鼓励模型在不同的增强中做出相同的预测，从而增强这些增强中编码的不变性，同时保持对其预测的信心。在我们的实验中，我们证明了这种方法持续改进了稳健的ResNet和vision transformer模型，与标准模型评估相比，实现了1-8%的精度增益，并且通常优于先前的增强和自适应策略。我们实现了由图像损坏（ImageNet-C）、常见对象的再现（ImageNet-R）以及在ResNet-50模型中，逆向选择的自然示例（ImageNet-A）引起的测试位移的最新结果。摘要：While deep neural networks can attain good accuracy on in-distribution test points, many applications require robustness even in the face of unexpected perturbations in the input, changes in the domain, or other sources of distribution shift. We study the problem of test time robustification, i.e., using the test input to improve model robustness. Recent prior works have proposed methods for test time adaptation, however, they each introduce additional assumptions, such as access to multiple test points, that prevent widespread adoption. In this work, we aim to study and devise methods that make no assumptions about the model training process and are broadly applicable at test time. We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable: when presented with a test example, perform different data augmentations on the data point, and then adapt (all of) the model parameters by minimizing the entropy of the model's average, or marginal, output distribution across the augmentations. Intuitively, this objective encourages the model to make the same prediction across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions. In our experiments, we demonstrate that this approach consistently improves robust ResNet and vision transformer models, achieving accuracy gains of 1-8% over standard model evaluation and also generally outperforming prior augmentation and adaptation strategies. We achieve state-of-the-art results for test shifts caused by image corruptions (ImageNet-C), renditions of common objects (ImageNet-R), and, among ResNet-50 models, adversarially chosen natural examples (ImageNet-A).

【2】 Ortho-Shot: Low Displacement Rank Regularization with Data Augmentation for Few-Shot Learning 标题：正射：基于数据增强的低位移秩正则化在少射学习中的应用链接：https://arxiv.org/abs/2110.09374

作者：Uche Osahor,Nasser M. Nasrabadi 机构：West Virginia University 摘要：在少数镜头分类中，主要目标是从几个样本中学习表示，这些样本可以很好地概括新类。在本文中，我们提出了一种有效的低位移秩（LDR）正则化策略称为正交镜头；基于双块toeplitz（DBT）矩阵结构的一种技术，它在少量镜头分类器的卷积层上施加正交正则化。少数镜头分类器的正则化卷积层增强了模型泛化和类内特征嵌入，这对于少数镜头学习至关重要。过度拟合是少数镜头模型的一个典型问题，数据多样性的缺乏抑制了适当的模型推理，从而削弱了少数镜头学习者对新类的分类精度。在这方面，我们分解了少数镜头分类器的管道，并确定支持、查询和任务数据的增加共同缓解了网络中的过度拟合。通过令人信服的结果，我们证明了将基于DBT的低秩正交正则化器与数据增强策略相结合，可以显著提高少量镜头分类器的性能。我们在miniImagenet、CIFAR-FS和斯坦福数据集上进行实验，与最先进的技术相比，性能值约为5% 摘要：In few-shot classification, the primary goal is to learn representations from a few samples that generalize well for novel classes. In this paper, we propose an efficient low displacement rank (LDR) regularization strategy termed Ortho-Shot; a technique that imposes orthogonal regularization on the convolutional layers of a few-shot classifier, which is based on the doubly-block toeplitz (DBT) matrix structure. The regularized convolutional layers of the few-shot classifier enhances model generalization and intra-class feature embeddings that are crucial for few-shot learning. Overfitting is a typical issue for few-shot models, the lack of data diversity inhibits proper model inference which weakens the classification accuracy of few-shot learners to novel classes. In this regard, we broke down the pipeline of the few-shot classifier and established that the support, query and task data augmentation collectively alleviates overfitting in networks. With compelling results, we demonstrated that combining a DBT-based low-rank orthogonal regularizer with data augmentation strategies, significantly boosts the performance of a few-shot classifier. We perform our experiments on the miniImagenet, CIFAR-FS and Stanford datasets with performance values of about 5\% when compared to state-of-the-art

【3】 InfAnFace: Bridging the infant-adult domain gap in facial landmark estimation in the wild 标题：InfAnFace：在野外弥合面部标志性估计中婴儿和成人领域的差距链接：https://arxiv.org/abs/2110.08935

作者：M. Wan,S. Zhu,P. Gulati,L. Luan,X. Huang,R. Schwartz-Mette,M. Hayes,E. Zimmerman,S. Ostadabbas 摘要：将算法面部标志估计应用于婴幼儿发育障碍和其他疾病的早期预测具有很大的潜力。然而，这些深度学习算法的性能受到婴儿数据稀缺性的严重影响。为了促进婴儿面部标志系统的发展，我们引入了InFaFace，这是一个多样化的、注释丰富的婴儿面部数据集。我们使用InfAnFace对现有的在成人脸上训练的面部地标估计算法的性能进行基准测试，并证明这些算法在婴儿脸和成人脸上学习到的表示之间存在显著的领域差距。最后，我们提出了弥合这一差距的下一个潜在步骤。摘要：There is promising potential in the application of algorithmic facial landmark estimation to the early prediction, in infants, of pediatric developmental disorders and other conditions. However, the performance of these deep learning algorithms is severely hampered by the scarcity of infant data. To spur the development of facial landmarking systems for infants, we introduce InfAnFace, a diverse, richly-annotated dataset of infant faces. We use InfAnFace to benchmark the performance of existing facial landmark estimation algorithms that are trained on adult faces and demonstrate there is a significant domain gap between the representations learned by these algorithms when applied on infant vs. adult faces. Finally, we put forward the next potential steps to bridge that gap.

【4】 Dataset Knowledge Transfer for Class-Incremental Learning without Memory 标题：用于无记忆增量学习的数据集知识转移链接：https://arxiv.org/abs/2110.08421

作者：Habib Slim,Eden Belouadah,Adrian Popescu,Darian Onchis 机构： Universit´e Paris-Saclay, CEA, List, F-, Palaiseau, France, IMT Atlantique, Lab-STICC, team RAMBO, UMR CNRS , F-, Brest, France, West University of Timisoara, Timisoara, Romania 备注：Accepted to WACV 2022 摘要：增量学习使人工智能体能够从连续数据中学习。虽然利用深度神经网络取得了重要进展，但增量学习仍然非常具有挑战性。尤其是当不允许记忆过去的数据，灾难性遗忘会产生强烈的负面影响时。我们通过调整预测偏差校正来解决无记忆的班级增量学习问题，这种方法使过去和新班级的预测更具可比性。它是在允许内存并且没有内存不能直接使用时提出的，因为需要过去类的样本。我们介绍了一个两步学习过程，该过程允许在参考数据集和目标数据集之间传递偏差校正参数。偏差校正首先在具有相关验证内存的参考数据集上进行离线优化。然后将获得的校正参数传输到没有可用内存的目标数据集。第二个贡献是通过学习每个增量状态的参数来引入更精细的偏差校正建模，而不是通常的过去与新类建模。所提出的数据集知识转移方法适用于任何无记忆的增量方法。我们通过将其应用于四种现有方法来测试其有效性。使用四个目标数据集和不同配置进行的评估表明，性能得到了持续改进，几乎没有计算和内存开销。摘要：Incremental learning enables artificial agents to learn from sequential data. While important progress was made by exploiting deep neural networks, incremental learning remains very challenging. This is particularly the case when no memory of past data is allowed and catastrophic forgetting has a strong negative effect. We tackle class-incremental learning without memory by adapting prediction bias correction, a method which makes predictions of past and new classes more comparable. It was proposed when a memory is allowed and cannot be directly used without memory, since samples of past classes are required. We introduce a two-step learning process which allows the transfer of bias correction parameters between reference and target datasets. Bias correction is first optimized offline on reference datasets which have an associated validation memory. The obtained correction parameters are then transferred to target datasets, for which no memory is available. The second contribution is to introduce a finer modeling of bias correction by learning its parameters per incremental state instead of the usual past vs. new class modeling. The proposed dataset knowledge transfer is applicable to any incremental method which works without memory. We test its effectiveness by applying it to four existing methods. Evaluation with four target datasets and different configurations shows consistent improvement, with practically no computational and memory overhead.

【5】 Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks 标题：注意差距：生成性对抗网络中单击域自适应的域差距控制链接：https://arxiv.org/abs/2110.08398

作者：Peihao Zhu,Rameen Abdal,John Femiani,Peter Wonka 机构：KAUST, Miami University 备注：Video: this https URL 摘要：我们提出了一种新的单镜头域自适应方法。我们的方法的输入是训练过的GAN，它可以生成域A中的图像和域B中的单个参考图像I_B。所提出的算法可以将训练过的GAN的任何输出从域A转换到域B。与当前技术相比，我们的方法有两个主要优点：第一，我们的解决方案实现了更高的视觉质量，例如通过显著减少过度装配。其次，我们的解决方案允许更多的自由度来控制域间隙，即使用图像i_B的哪些方面来定义域B。从技术上讲，我们通过将预训练的样式生成器构建为GAN和用于表示域间隙的预训练剪辑模型来实现新方法。我们提出了几个新的正则化器来控制域间距，以优化预先训练的StyleGAN生成器的权重，从而在域B而不是域A中输出图像。正则化器防止了优化会占用单个参考图像的太多属性。我们的结果显示，与最先进的技术相比，视觉效果有了显著的改善，并且有多个应用程序突出了改进的控制。摘要：We present a new method for one shot domain adaptation. The input to our method is trained GAN that can produce images in domain A and a single reference image I_B from domain B. The proposed algorithm can translate any output of the trained GAN from domain A to domain B. There are two main advantages of our method compared to the current state of the art: First, our solution achieves higher visual quality, e.g. by noticeably reducing overfitting. Second, our solution allows for more degrees of freedom to control the domain gap, i.e. what aspects of image I_B are used to define the domain B. Technically, we realize the new method by building on a pre-trained StyleGAN generator as GAN and a pre-trained CLIP model for representing the domain gap. We propose several new regularizers for controlling the domain gap to optimize the weights of the pre-trained StyleGAN generator to output images in domain B instead of domain A. The regularizers prevent the optimization from taking on too many attributes of the single reference image. Our results show significant visual improvements over the state of the art as well as multiple applications that highlight improved control.

【6】 Incremental Cross-Domain Adaptation for Robust Retinopathy Screening via Bayesian Deep Learning 标题：基于贝叶斯深度学习的增量式跨域适应鲁棒视网膜病变筛查链接：https://arxiv.org/abs/2110.09319

作者：Taimur Hassan,Bilal Hassan,Muhammad Usman Akram,Shahrukh Hashmi,Abdel Hakim Taguri,Naoufel Werghi 机构： Department of Electrical Engineering and Computer Sciences, KhalifaUniversity, Akram are with the Department of Computerand Software Engineering, National University of Sciences and Technology 备注：Accepted in IEEE Transactions on Instrumentation and Measurement. Source code is available at this https URL 摘要：视网膜病是一组视网膜疾病，如果不及时治疗，可能导致严重的视力损害甚至失明。许多研究人员已经开发出自主系统，通过眼底和光学相干断层扫描（OCT）图像识别视网膜病变。然而，这些框架大多采用传统的迁移学习和微调方法，需要大量注释良好的训练数据才能产生准确的诊断性能。本文提出了一种新的增量跨域自适应仪器，允许任何深度分类模型通过少量镜头训练逐步学习OCT和眼底图像中的异常视网膜病变。而且,与竞争对手不同,，所提出的工具通过贝叶斯多目标函数驱动，该函数不仅强制候选分类网络在增量训练期间保留其先前学习的知识，而且还确保网络理解先前学习的病理学和新添加的病理学之间的结构和语义关系在推理阶段有效识别疾病类别。在使用三种不同扫描仪获取的六个公共数据集上评估了拟议的框架，以筛选13种视网膜病变，其总体准确度和F1得分分别为0.9826和0.9846，优于最先进的竞争对手。摘要：Retinopathy represents a group of retinal diseases that, if not treated timely, can cause severe visual impairments or even blindness. Many researchers have developed autonomous systems to recognize retinopathy via fundus and optical coherence tomography (OCT) imagery. However, most of these frameworks employ conventional transfer learning and fine-tuning approaches, requiring a decent amount of well-annotated training data to produce accurate diagnostic performance. This paper presents a novel incremental cross-domain adaptation instrument that allows any deep classification model to progressively learn abnormal retinal pathologies in OCT and fundus imagery via few-shot training. Furthermore, unlike its competitors, the proposed instrument is driven via a Bayesian multi-objective function that not only enforces the candidate classification network to retain its prior learned knowledge during incremental training but also ensures that the network understands the structural and semantic relationships between previously learned pathologies and newly added disease categories to effectively recognize them at the inference stage. The proposed framework, evaluated on six public datasets acquired with three different scanners to screen thirteen retinal pathologies, outperforms the state-of-the-art competitors by achieving an overall accuracy and F1 score of 0.9826 and 0.9846, respectively.

【7】 A MIMO Radar-based Few-Shot Learning Approach for Human-ID 标题：一种基于MIMO雷达的身份识别Few-Shot学习方法链接：https://arxiv.org/abs/2110.08595

作者：Pascal Weller,Fady Aziz,Sherif Abdulatif,Urs Schneider,Marco F. Huber 机构：Department of Bio-mechatronic Systems, Fraunhofer IPA, Stuttgart, Germany, Institute of Signal Processing and System Theory, University of Stuttgart, Germany, Institute for Industrial Manufacturing and Management, University of Stuttgart, Germany 备注：5 pages, 6 figures, 2 tables 摘要：基于雷达的深度学习的人体识别已经成为人们越来越感兴趣的研究领域。研究表明，微多普勒（\（\upmu\）-D）可以通过捕捉周期性肢体微动来反映行走行为。其中一个主要方面是最大化包含的类的数量，同时考虑实时和训练数据集的大小限制。本文采用多输入多输出（MIMO）雷达来绘制仰角速度（\（\upmu\）-\（\omega\）的微动谱图。研究了这种新的谱图与常用的\（\upmu\）-D连接的有效性。为了适应非约束真实步行运动，采用了自适应周期分割框架，并在半步态周期（\（\约\）0.5 s）上训练了度量学习网络。研究了不同类别数（5-20）、不同数据集大小和不同观察时间窗口1-2秒的影响。收集了22名受试者的非约束步行数据集，数据集与雷达具有不同的视角。提出的Few-Shot学习（FSL）方法实现了11.3%的分类错误，每个主题只有2分钟的训练数据。摘要：Radar for deep learning-based human identification has become a research area of increasing interest. It has been shown that micro-Doppler ($\upmu$-D) can reflect the walking behavior through capturing the periodic limbs' micro-motions. One of the main aspects is maximizing the number of included classes while considering the real-time and training dataset size constraints. In this paper, a multiple-input-multiple-output (MIMO) radar is used to formulate micro-motion spectrograms of the elevation angular velocity ($\upmu$-$\omega$). The effectiveness of concatenating this newly-formulated spectrogram with the commonly used $\upmu$-D is investigated. To accommodate for non-constrained real walking motion, an adaptive cycle segmentation framework is utilized and a metric learning network is trained on half gait cycles ($\approx$ 0.5 s). Studies on the effects of various numbers of classes (5--20), different dataset sizes, and varying observation time windows 1--2 s are conducted. A non-constrained walking dataset of 22 subjects is collected with different aspect angles with respect to the radar. The proposed few-shot learning (FSL) approach achieves a classification error of 11.3 % with only 2 min of training data per subject.

【8】 Locally Adaptive Structure and Texture Similarity for Image Quality Assessment 标题：基于局部自适应结构和纹理相似性的图像质量评价链接：https://arxiv.org/abs/2110.08521

作者：Keyan Ding,Yi Liu,Xueyi Zou,Shiqi Wang,Kede Ma 机构：City University of Hong Kong, Hong Kong, Noah’s Ark Lab, Huawei Technologies, Shenzhen, China 备注：None 摘要：全参考图像质量评估（IQA）的最新进展涉及基于深度表示的统一结构和纹理相似性。然而，由此产生的深层图像结构和纹理相似性（DISTS）度量进行了相当全面的质量测量，忽略了自然摄影图像在空间和尺度上是局部结构和纹理的这一事实。在本文中，我们描述了一个局部自适应结构和纹理相似性指数，用于全参考IQA，我们称之为a-DISTS。具体地说，我们依靠一个单一的统计特征，即色散指数，在不同的尺度上定位纹理区域。估计的概率（一个面片是纹理）反过来用于自适应地汇集局部结构和纹理测量。由此产生的A-DISTS适合于局部图像内容，并且不需要昂贵的人类感知分数来进行监督训练。我们展示了A-DISTS在与十个IQA数据库上的人类数据相关和优化单图像超分辨率方法方面的优势。摘要：The latest advances in full-reference image quality assessment (IQA) involve unifying structure and texture similarity based on deep representations. The resulting Deep Image Structure and Texture Similarity (DISTS) metric, however, makes rather global quality measurements, ignoring the fact that natural photographic images are locally structured and textured across space and scale. In this paper, we describe a locally adaptive structure and texture similarity index for full-reference IQA, which we term A-DISTS. Specifically, we rely on a single statistical feature, namely the dispersion index, to localize texture regions at different scales. The estimated probability (of one patch being texture) is in turn used to adaptively pool local structure and texture measurements. The resulting A-DISTS is adapted to local image content, and is free of expensive human perceptual scores for supervised training. We demonstrate the advantages of A-DISTS in terms of correlation with human data on ten IQA databases and optimization of single image super-resolution methods.

半弱无监督|主动学习|不确定性(13篇)

【1】 Unsupervised Finetuning 标题：无监督微调链接：https://arxiv.org/abs/2110.09510

作者：Suichan Li,Dongdong Chen,Yinpeng Chen,Lu Yuan,Lei Zhang,Qi Chu,Bin Liu,Nenghai Yu 机构：University of Science and Technology of China, Microsoft Cloud & AI 摘要：本文研究了“无监督微调”这一著名的“有监督微调”的对称问题。在给定预训练模型和小规模未标记目标数据的情况下，无监督微调是将预训练后的表示形式从源域调整到目标域，从而获得更好的传输性能。这个问题比有监督的问题更具挑战性，因为小规模目标数据中的低数据密度不利于无监督学习，导致目标域中的预训练表示和差表示受到破坏。在本文中，我们发现在将精细调整范式从监督转向非监督时，源数据至关重要，并提出了两种简单有效的策略，将源数据和目标数据结合到无监督精细调整中：“稀疏源数据重放”和“数据混合”。前一种策略的动机是添加一小部分源数据以占据其预训练的表示空间，并帮助将目标数据推送到更小的紧凑空间中；后一种策略的动机是增加数据密度，帮助学习更紧凑的表示。为了证明我们提出的“无监督微调”策略的有效性，我们在多个不同的目标数据集上进行了大量的实验，实验结果表明，该策略比朴素策略具有更好的传输性能。摘要：This paper studies "unsupervised finetuning", the symmetrical problem of the well-known "supervised finetuning". Given a pretrained model and small-scale unlabeled target data, unsupervised finetuning is to adapt the representation pretrained from the source domain to the target domain so that better transfer performance can be obtained. This problem is more challenging than the supervised counterpart, as the low data density in the small-scale target data is not friendly for unsupervised learning, leading to the damage of the pretrained representation and poor representation in the target domain. In this paper, we find the source data is crucial when shifting the finetuning paradigm from supervise to unsupervise, and propose two simple and effective strategies to combine source and target data into unsupervised finetuning: "sparse source data replaying", and "data mixing". The motivation of the former strategy is to add a small portion of source data back to occupy their pretrained representation space and help push the target data to reside in a smaller compact space; and the motivation of the latter strategy is to increase the data density and help learn more compact representation. To demonstrate the effectiveness of our proposed ``unsupervised finetuning'' strategy, we conduct extensive experiments on multiple different target datasets, which show better transfer performance than the naive strategy.

【2】 Unsupervised Image Fusion Using Deep Image Priors 标题：基于深度图像先验的无监督图像融合链接：https://arxiv.org/abs/2110.09490

作者：Xudong Ma,Alin Achim,Paul Hill 机构：Department of Electrical and Electronic Engineering, University of Bristol, Bristol BS,UB, UK 摘要：最近，大量研究人员将深度学习方法应用于图像融合。然而，这些工作中的大多数要么需要大量的训练数据，要么依赖于预先训练的模型或框架。这不可避免地会遇到训练数据不足或框架与实际问题不匹配的情况。最近，深度图像先验（DIP）方法的发表使得完全不用训练数据进行图像恢复成为可能。然而，DIP的原始设计很难推广到多图像处理问题。本文介绍了一种新的损耗计算结构，在DIP的框架下，将图像融合描述为一个反问题。这使得DIP能够扩展到一般的多传感器/多焦点图像融合问题。其次，我们提出了一种多通道方法来改善DIP的效果。最后，使用几种常用的图像融合评估指标进行评估。结果与最新的传统和深度学习图像融合方法进行了比较。我们的方法在一系列指标上优于以前的技术。特别是，当应用于医学图像时，它可以为大多数度量提供最佳的客观结果。摘要：A significant number of researchers have recently applied deep learning methods to image fusion. However, most of these works either require a large amount of training data or depend on pre-trained models or frameworks. This inevitably encounters a shortage of training data or a mismatch between the framework and the actual problem. Recently, the publication of Deep Image Prior (DIP) method made it possible to do image restoration totally training-data-free. However, the original design of DIP is hard to be generalized to multi-image processing problems. This paper introduces a novel loss calculation structure, in the framework of DIP, while formulating image fusion as an inverse problem. This enables the extension of DIP to general multisensor/multifocus image fusion problems. Secondly, we propose a multi-channel approach to improve the effect of DIP. Finally, an evaluation is conducted using several commonly used image fusion assessment metrics. The results are compared with state-of-the-art traditional and deep learning image fusion methods. Our method outperforms previous techniques for a range of metrics. In particular, it is shown to provide the best objective results for most metrics when applied to medical images.

【3】 Self-Supervised Monocular DepthEstimation with Internal Feature Fusion 标题：基于内部特征融合的自监督单目深度估计链接：https://arxiv.org/abs/2110.09482

作者：Hang Zhou,David Greenwood,Sarah Taylor 机构：School of Computing Sciences, University of East Anglia, Norwich, UK 摘要：用于深度估计的自监督学习使用图像序列中的几何体进行监督，并显示出有希望的结果。与许多计算机视觉任务一样，深度网络的性能取决于从图像中学习准确的空间和语义表示的能力。因此，利用语义分割网络进行深度估计是很自然的。在这项工作中，基于一个发展良好的语义分割网络HRNet，我们提出了一种新的深度估计网络DiffNet，它可以在下采样和上采样过程中利用语义信息。通过应用特征融合和注意机制，我们提出的方法在KITTI基准上优于最新的单目深度估计方法。我们的方法在更高分辨率的训练数据上也显示出更大的潜力。我们提出了一个额外的扩展评估策略，通过建立一个测试集的挑战性的情况下，经验来自标准的基准。摘要：Self-supervised learning for depth estimation uses geometry in image sequences for supervision and shows promising results. Like many computer vision tasks, depth network performance is determined by the capability to learn accurate spatial and semantic representations from images. Therefore, it is natural to exploit semantic segmentation networks for depth estimation. In this work, based on a well-developed semantic segmentation network HRNet, we propose a novel depth estimation networkDIFFNet, which can make use of semantic information in down and upsampling procedures. By applying feature fusion and an attention mechanism, our proposed method outperforms the state-of-the-art monocular depth estimation methods on the KITTI benchmark. Our method also demonstrates greater potential on higher resolution training data. We propose an additional extended evaluation strategy by establishing a test set of challenging cases, empirically derived from the standard benchmark.

【4】 Streaming Machine Learning and Online Active Learning for Automated Visual Inspection 标题：自动视觉检测中的流式机器学习和在线主动学习链接：https://arxiv.org/abs/2110.09396

作者：Jože M. Rožanec,Elena Trajkova,Paulien Dam,Blaž Fortuna,Dunja Mladenić 机构：Jožef Stefan International, Postgraduate School, Ljubljana, Slovenia, University of Ljubljana, Electrical Engineering, Philips Consumer Lifestyle BV, Drachten, The Netherlands, Qlector d.o.o., Jožef Stefan Institute 摘要：质量控制是制造公司验证产品是否符合要求和规范的关键活动。标准化的质量控制确保所有产品在相同的标准下进行评估。传感器和连接成本的降低使得制造业的数字化程度不断提高，并提供了更高的数据可用性。这样的数据可用性刺激了人工智能模型的发展，使得在检查产品时能够实现更高程度的自动化并减少偏差。此外，检查速度的提高降低了缺陷检查所需的总体成本和时间。在本研究中，我们将五种用于视觉缺陷检测的流式机器学习算法与飞利浦消费者生活方式有限公司提供的真实数据进行了比较。此外，我们在流式主动学习环境中对它们进行了比较，这减少了真实环境中的数据标记工作。我们的结果表明，在最坏的情况下，主动学习平均减少了15%的数据标记工作，同时保持了可接受的分类性能。使用机器学习模型进行自动目视检查，预计将使质量检查速度提高40%。摘要：Quality control is a key activity performed by manufacturing companies to verify product conformance to the requirements and specifications. Standardized quality control ensures that all the products are evaluated under the same criteria. The decreased cost of sensors and connectivity enabled an increasing digitalization of manufacturing and provided greater data availability. Such data availability has spurred the development of artificial intelligence models, which allow higher degrees of automation and reduced bias when inspecting the products. Furthermore, the increased speed of inspection reduces overall costs and time required for defect inspection. In this research, we compare five streaming machine learning algorithms applied to visual defect inspection with real-world data provided by Philips Consumer Lifestyle BV. Furthermore, we compare them in a streaming active learning context, which reduces the data labeling effort in a real-world context. Our results show that active learning reduces the data labeling effort by almost 15% on average for the worst case, while keeping an acceptable classification performance. The use of machine learning models for automated visual inspection are expected to speed up the quality inspection up to 40%.

【5】 Learning multiplane images from single views with self-supervision 标题：基于自监督的单视图多平面图像学习链接：https://arxiv.org/abs/2110.09380

作者：Gustavo Sutter P. Carvalho,Diogo C. Luvizon,Antonio Joia,Andre G. C. Pacheco,Otavio A. B. Penatti 机构：André G. C. Pacheco, Otávio A. B. Penatti, AI R&D Lab, Samsung R&D Institute Brazil, Campinas, SP,-, Brazil 备注：To appear on BMVC 2021 摘要：从已经捕获的图像生成静态新视图是计算机视觉和图形学中的一项艰巨任务，特别是当单个输入图像具有动态部分（如人或运动对象）时。在本文中，我们提出了一个新的框架CycleMPI来解决这个问题，该框架能够通过自我监督的循环训练策略从单个图像学习多平面图像表示。我们的框架不需要立体数据进行训练，因此它可以使用来自互联网的大量视觉数据进行训练，即使在非常具有挑战性的情况下也能产生更好的泛化能力。虽然我们的方法不需要立体数据进行监督，但它在立体数据集上的结果与Zero-Shot场景中的最新水平相当。我们在RealEstate10K和人体模型挑战数据集上评估了我们的方法，用于视图合成，并在Places II数据集上给出了定性结果。摘要：Generating static novel views from an already captured image is a hard task in computer vision and graphics, in particular when the single input image has dynamic parts such as persons or moving objects. In this paper, we tackle this problem by proposing a new framework, called CycleMPI, that is capable of learning a multiplane image representation from single images through a cyclic training strategy for self-supervision. Our framework does not require stereo data for training, therefore it can be trained with massive visual data from the Internet, resulting in a better generalization capability even for very challenging cases. Although our method does not require stereo data for supervision, it reaches results on stereo datasets comparable to the state of the art in a zero-shot scenario. We evaluated our method on RealEstate10K and Mannequin Challenge datasets for view synthesis and presented qualitative results on Places II dataset.

【6】 Understanding Dimensional Collapse in Contrastive Self-supervised Learning 标题：理解对比性自我监督学习中的维度塌陷链接：https://arxiv.org/abs/2110.09348

作者：Li Jing,Pascal Vincent,Yann LeCun,Yuandong Tian 机构：Facebook AI Research 备注：15 pages, 10 figures 摘要：自监督视觉表征学习旨在学习有用的表征，而不依赖于人类的注释。联合嵌入方法基于最大化同一图像不同视图的嵌入向量之间的一致性。人们已经提出了各种方法来解决所有嵌入向量折叠为平凡常数解的折叠问题。在这些方法中，对比学习通过负样本对防止崩溃。已经证明，非对比方法遭受不同性质的较小折叠问题：维度折叠，嵌入向量最终跨越低维子空间而不是整个可用嵌入空间。在这里，我们表明，维度崩溃也发生在对比学习中。在这篇论文中，我们揭示了在对比学习中导致维度崩溃的动态机制。受我们理论的启发，我们提出了一种新的对比学习方法，称为DirectCLR，它直接优化表示空间，而不依赖可训练的投影仪。实验表明，在ImageNet上使用可训练线性投影仪时，DirectCLR的性能优于SimCLR。摘要：Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.

【7】 Self-Supervised Representation Learning: Introduction, Advances and Challenges 标题：自我监督表征学习：介绍、进展与挑战链接：https://arxiv.org/abs/2110.09327

作者：Linus Ericsson,Henry Gouk,Chen Change Loy,Timothy M. Hospedales 摘要：自监督表示学习方法旨在提供强大的深度特征学习，而无需大型标注数据集，从而缓解标注瓶颈，这是当前深度学习实际应用的主要障碍之一。近年来，这些方法发展迅速，其功效接近，有时甚至超过了各种数据模式（包括图像、视频、声音、文本和图形）中完全监督的训练前备选方案。本文介绍了这一充满活力的领域，包括关键概念、四大方法家族和相关的最新技术，以及如何将自我监督方法应用于各种数据模式。我们进一步讨论实际考虑因素，包括工作流、表示可转移性和计算成本。最后，我们调查了该领域中为未来工作提供肥沃土壤的主要公开挑战。摘要：Self-supervised representation learning methods aim to provide powerful deep feature learning without the requirement of large annotated datasets, thus alleviating the annotation bottleneck that is one of the main barriers to practical deployment of deep learning today. These methods have advanced rapidly in recent years, with their efficacy approaching and sometimes surpassing fully supervised pre-training alternatives across a variety of data modalities including image, video, sound, text and graphs. This article introduces this vibrant area including key concepts, the four main families of approach and associated state of the art, and how self-supervised methods are applied to diverse modalities of data. We further discuss practical considerations including workflows, representation transferability, and compute cost. Finally, we survey the major open challenges in the field that provide fertile ground for future work.

【8】 Graph Convolution Neural Network For Weakly Supervised Abnormality Localization In Long Capsule Endoscopy Videos 标题：基于图形卷积神经网络的长胶囊内窥镜视频弱监督异常定位链接：https://arxiv.org/abs/2110.09110

作者：Sodiq Adewole,Philip Fernandes,James Jablonski,Andrew Copland,Michael Porter,Sana Syed,Donald Brown 机构：∗ Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA, USA, ∗ Department of Pediatrics, School of Medicine, University of Virginia, Charlottesville, VA, USA 摘要：长视频中的时间活动定位是一个重要的问题。获取长无线胶囊内窥镜（WCE）视频帧级标签的成本过高。在本文中，我们提出了一种仅使用弱视频级别标签的长WCE视频的端到端时间异常定位方法。内科医生使用胶囊内镜（CE）作为非手术和非侵入性方法检查整个消化道，以诊断疾病或异常。虽然CE彻底改变了传统的内窥镜检查程序，但一次CE检查可以持续8小时，产生多达100000帧图像。医生必须逐帧检查整个视频，以识别捕获相关异常的帧。这有时可能只有一帧。考虑到这种高度冗余，分析长CE视频可能非常繁琐、耗时，而且容易出错。该文提出了一种新的多步定位方法，仅使用弱视频标签对长视频中捕获感兴趣异常的目标帧进行端到端定位。首先，我们开发了一种使用变化点检测技术的自动时间分割方法，将视频时间分割成均匀、均匀和可识别的片段。然后，我们使用图卷积神经网络（GCNN）来学习每个视频片段的表示。使用弱视频片段标签，我们训练我们的GCNN模型，以便在每个视频片段至少包含一个异常帧时将其识别为异常。最后，利用训练的GCNN模型的参数，我们将网络的最后一层替换为时间池层，以定位每个异常视频片段中的相关异常帧。我们的方法在图形分类任务上的准确率为89.9%，在异常帧定位任务上的特异性为97.5%。摘要：Temporal activity localization in long videos is an important problem. The cost of obtaining frame level label for long Wireless Capsule Endoscopy (WCE) videos is prohibitive. In this paper, we propose an end-to-end temporal abnormality localization for long WCE videos using only weak video level labels. Physicians use Capsule Endoscopy (CE) as a non-surgical and non-invasive method to examine the entire digestive tract in order to diagnose diseases or abnormalities. While CE has revolutionized traditional endoscopy procedures, a single CE examination could last up to 8 hours generating as much as 100,000 frames. Physicians must review the entire video, frame-by-frame, in order to identify the frames capturing relevant abnormality. This, sometimes could be as few as just a single frame. Given this very high level of redundancy, analyzing long CE videos can be very tedious, time consuming and also error prone. This paper presents a novel multi-step method for an end-to-end localization of target frames capturing abnormalities of interest in the long video using only weak video labels. First we developed an automatic temporal segmentation using change point detection technique to temporally segment the video into uniform, homogeneous and identifiable segments. Then we employed Graph Convolutional Neural Network (GCNN) to learn a representation of each video segment. Using weak video segment labels, we trained our GCNN model to recognize each video segment as abnormal if it contains at least a single abnormal frame. Finally, leveraging the parameters of the trained GCNN model, we replaced the final layer of the network with a temporal pool layer to localize the relevant abnormal frames within each abnormal video segment. Our method achieved an accuracy of 89.9\% on the graph classification task and a specificity of 97.5\% on the abnormal frames localization task.

【9】 Utilizing Active Machine Learning for Quality Assurance: A Case Study of Virtual Car Renderings in the Automotive Industry 标题：利用主动机器学习进行质量保证--以汽车行业虚拟汽车渲染为例链接：https://arxiv.org/abs/2110.09023

作者：Patrick Hemmer,Niklas Kühl,Jakob Schöffer 机构：Karlsruhe Institute of Technology, Niklas K¨uhl, Jakob Sch¨offer 备注：Hawaii International Conference on System Sciences 2022 (HICSS-55) 摘要：计算机生成的汽车模型图像已成为汽车制造商广告理念中不可或缺的一部分。例如，它们被用于汽车配置程序中，为客户提供根据个人喜好在线配置汽车的可能性。然而，由于车型日益复杂，以人为主导的质量保证面临着跟上大批量目视检查的挑战。尽管机器学习在许多视觉检查任务中的应用已经取得了巨大的成功，但它对大型标记数据集的需求仍然是在实践中使用此类系统的主要障碍。在本文中，我们提出了一种基于主动机器学习的质量保证系统，该系统在不影响性能的情况下，需要显著更少的标记实例来识别有缺陷的虚拟汽车渲染。通过在一家德国汽车制造商使用我们的系统，可以克服启动困难，提高检测过程效率，从而实现经济优势。摘要：Computer-generated imagery of car models has become an indispensable part of car manufacturers' advertisement concepts. They are for instance used in car configurators to offer customers the possibility to configure their car online according to their personal preferences. However, human-led quality assurance faces the challenge to keep up with high-volume visual inspections due to the car models' increasing complexity. Even though the application of machine learning to many visual inspection tasks has demonstrated great success, its need for large labeled data sets remains a central barrier to using such systems in practice. In this paper, we propose an active machine learning-based quality assurance system that requires significantly fewer labeled instances to identify defective virtual car renderings without compromising performance. By employing our system at a German automotive manufacturer, start-up difficulties can be overcome, the inspection process efficiency can be increased, and thus economic advantages can be realized.

【10】 Demystifying How Self-Supervised Features Improve Training from Noisy Labels 标题：揭开自我监督特征如何从噪声标签中改进训练的神秘面纱链接：https://arxiv.org/abs/2110.09022

作者：Hao Cheng,Zhaowei Zhu,Xing Sun,Yang Liu 机构：† Computer Science and Engineering, University of California, Santa Cruz, ‡ Tencent YouTu Lab 摘要：自我监督学习（SSL）的发展促使研究人员将SSL应用于其他任务，如使用噪声标签进行学习。最近的文献表明，基于SSL特性的方法可以极大地提高带噪标签的学习性能。尽管如此，人们对SSL特性为什么（以及如何）从嘈杂的标签中受益于训练的深层次原因还不太了解。在本文中，我们通过理论分析和数值实验研究了自监督特征为何以及如何帮助网络抵抗标签噪声。我们的结果表明，给定一个从SSL预训练的高质量编码器，由交叉熵损失训练的简单线性层理论上对对称标签噪声具有鲁棒性。此外，我们还深入了解了从SSL特性中提取的知识如何缓解过度拟合问题。我们希望我们的工作能够从自我监督学习的角度更好地理解带噪声标签的学习，并为进一步的研究提供指导。代码可从github.com/UCSC-REAL/SelfSup\u NoisyLabel获得。摘要：The advancement of self-supervised learning (SSL) motivates researchers to apply SSL on other tasks such as learning with noisy labels. Recent literature indicates that methods built on SSL features can substantially improve the performance of learning with noisy labels. Nonetheless, the deeper reasons why (and how) SSL features benefit the training from noisy labels are less understood. In this paper, we study why and how self-supervised features help networks resist label noise using both theoretical analyses and numerical experiments. Our result shows that, given a quality encoder pre-trained from SSL, a simple linear layer trained by the cross-entropy loss is theoretically robust to symmetric label noise. Further, we provide insights for how knowledge distilled from SSL features can alleviate the over-fitting problem. We hope our work provides a better understanding for learning with noisy labels from the perspective of self-supervised learning and can potentially serve as a guideline for further research. Code is available at github.com/UCSC-REAL/SelfSup_NoisyLabel.

【11】 DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction 标题：DPC：基于交叉和自构造的无监督深点通信链接：https://arxiv.org/abs/2110.08636

作者：Itai Lang,Dvir Ginzburg,Shai Avidan,Dan Raviv 机构：Tel Aviv University 备注：3DV 2021 摘要：提出了一种基于结构化形状构造的点云实时非刚性稠密对应方法。我们的方法称为深点对应（DPC），和以前的技术相比，需要的训练数据很少，并且具有更好的泛化能力。到目前为止，对于稠密对应问题有两种主要的解决方法。第一种是基于光谱的方法，该方法在合成数据集上获得了很好的结果，但需要形状的网格连接和较长的推理处理时间，而在真实场景中则不稳定。第二种是空间方法，它使用编码器-解码器框架从不规则输入回归有序点云以进行匹配对齐。不幸的是，解码器带来了相当大的缺点，因为它需要大量的训练数据，并且难以在跨数据集评估中很好地推广。DPC的新奇之处在于它没有解码器组件。相反，我们使用潜在相似性和输入坐标本身来构造点云并确定对应关系，取代解码器所做的坐标回归。大量实验表明，与最新的通信方法相比，我们的构造方案提高了性能。我们的代码在https://github.com/dvirginz/DPC. 摘要：We present a new method for real-time non-rigid dense correspondence between point clouds based on structured shape construction. Our method, termed Deep Point Correspondence (DPC), requires a fraction of the training data compared to previous techniques and presents better generalization capabilities. Until now, two main approaches have been suggested for the dense correspondence problem. The first is a spectral-based approach that obtains great results on synthetic datasets but requires mesh connectivity of the shapes and long inference processing time while being unstable in real-world scenarios. The second is a spatial approach that uses an encoder-decoder framework to regress an ordered point cloud for the matching alignment from an irregular input. Unfortunately, the decoder brings considerable disadvantages, as it requires a large amount of training data and struggles to generalize well in cross-dataset evaluations. DPC's novelty lies in its lack of a decoder component. Instead, we use latent similarity and the input coordinates themselves to construct the point cloud and determine correspondence, replacing the coordinate regression done by the decoder. Extensive experiments show that our construction scheme leads to a performance boost in comparison to recent state-of-the-art correspondence methods. Our code is publicly available at https://github.com/dvirginz/DPC.

【12】 FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling 标题：FlexMatch：用课程伪标签促进半监督学习链接：https://arxiv.org/abs/2110.08263

作者：Bowen Zhang,Yidong Wang,Wenxin Hou,Hao Wu,Jindong Wang,Manabu Okumura,Takahiro Shinozaki 机构：Tokyo Institute of Technology, Microsoft Research Asia 备注：Accepted by NeurIPS 2021; 16 pages with appendix; code: this https URL 摘要：最近提出的FixMatch在大多数半监督学习（SSL）基准上取得了最先进的结果。然而，与其他现代SSL算法一样，FixMatt使用预定义的常数阈值来选择所有有助于训练的未标记数据，因此不能考虑不同的学习状态和不同类别的学习困难。为了解决这个问题，我们提出了课程伪标记（CPL），一种根据模型的学习状态利用未标记数据的课程学习方法。CPL的核心是在每个时间步灵活地调整不同类的阈值，以便传递信息性的未标记数据及其伪标签。CPL不引入额外的参数或计算（正向或反向传播）。我们将CPL应用于FixMatch，并将改进后的算法称为FlexMatch。FlexMatch在各种SSL基准上实现了最先进的性能，当标记的数据非常有限或任务具有挑战性时，性能尤其强大。例如，当每个类只有4个标签时，FlexMatch在CIFAR-100和STL-10数据集上的性能分别比FixMatch好14.32%和24.55%。CPL还显著提高了收敛速度，例如，FlexMatch可以仅使用FixMatch的1/5训练时间来实现更好的性能。此外，我们还证明了CPL可以很容易地适应其他SSL算法，并显著提高了它们的性能。我们在https://github.com/TorchSSL/TorchSSL. 摘要：The recently proposed FixMatch achieved state-of-the-art results on most semi-supervised learning (SSL) benchmarks. However, like other modern SSL algorithms, FixMatch uses a pre-defined constant threshold for all classes to select unlabeled data that contribute to the training, thus failing to consider different learning status and learning difficulties of different classes. To address this issue, we propose Curriculum Pseudo Labeling (CPL), a curriculum learning approach to leverage unlabeled data according to the model's learning status. The core of CPL is to flexibly adjust thresholds for different classes at each time step to let pass informative unlabeled data and their pseudo labels. CPL does not introduce additional parameters or computations (forward or backward propagation). We apply CPL to FixMatch and call our improved algorithm FlexMatch. FlexMatch achieves state-of-the-art performance on a variety of SSL benchmarks, with especially strong performances when the labeled data are extremely limited or when the task is challenging. For example, FlexMatch outperforms FixMatch by 14.32% and 24.55% on CIFAR-100 and STL-10 datasets respectively, when there are only 4 labels per class. CPL also significantly boosts the convergence speed, e.g., FlexMatch can use only 1/5 training time of FixMatch to achieve even better performance. Furthermore, we show that CPL can be easily adapted to other SSL algorithms and remarkably improve their performances. We open source our code at https://github.com/TorchSSL/TorchSSL.

【13】 A deep learning pipeline for localization, differentiation, and uncertainty estimation of liver lesions using multi-phasic and multi-sequence MRI 标题：利用多时相多序列MRI对肝脏病变进行定位、鉴别和不确定性评估的深度学习管道链接：https://arxiv.org/abs/2110.08817

作者：Peng Wang,Yuhsuan Wu,Bolin Lai,Xiao-Yun Zhou,Le Lu,Wendi Liu,Huabang Zhou,Lingyun Huang,Jing Xiao,Adam P. Harrison,Ningyang Jia,Heping Hu 机构：Eastern Hepatobiliary Surgery Hospital, Shanghai, China, Ping An Technology, Shanghai, China, Georgia Institute of Technology, Atlanta, USA, PAII Inc., Bethesda, Maryland, USA 备注：18 pages, 6 figures 摘要：目的：提出一种具有不确定性估计的全自动肝脏病变特征计算机辅助诊断（CAD）解决方案。方法：从2006年到2019年，我们招募了400名接受过肝切除或活检并被诊断为肝细胞癌（HCC）、肝内胆管癌或继发转移的患者。对每位患者进行T1WI、T2WI、T1WI静脉期（T2WI-V）、T1WI动脉期（T1WI-A）和DWI MRI序列扫描。我们提出了一个全自动的深部CAD管道，该管道使用关键切片解析对3D MRI研究中的病变进行定位，并为其诊断提供了一个置信度度量。我们使用五倍交叉验证进行评估，并与三名放射科医生（包括一名高级肝脏放射科医生、一名初级肝脏放射科医生和一名腹部放射科医生）进行比较。结果：建议的CAD解决方案的F1平均得分为0.62，优于腹部放射科医生（0.47），与初级肝脏放射科医生（0.61）相匹配，低于高级肝脏放射科医生（0.68）。CAD系统可以对其诊断置信度进行信息评估，即仅对70%最有置信度的病例进行评估时，HCC的平均f1评分和80%特异性敏感性分别从0.62提高到0.71和0.84提高到0.92。结论：建议的全自动CAD解决方案可以提供良好的诊断性能，并在发现和区分肝脏病变和MRI研究时提供信息可信度评估。摘要：Objectives: to propose a fully-automatic computer-aided diagnosis (CAD) solution for liver lesion characterization, with uncertainty estimation. Methods: we enrolled 400 patients who had either liver resection or a biopsy and was diagnosed with either hepatocellular carcinoma (HCC), intrahepatic cholangiocarcinoma, or secondary metastasis, from 2006 to 2019. Each patient was scanned with T1WI, T2WI, T1WI venous phase (T2WI-V), T1WI arterial phase (T1WI-A), and DWI MRI sequences. We propose a fully-automatic deep CAD pipeline that localizes lesions from 3D MRI studies using key-slice parsing and provides a confidence measure for its diagnoses. We evaluate using five-fold cross validation and compare performance against three radiologists, including a senior hepatology radiologist, a junior hepatology radiologist and an abdominal radiologist. Results: the proposed CAD solution achieves a mean F1 score of 0.62, outperforming the abdominal radiologist (0.47), matching the junior hepatology radiologist (0.61), and underperforming the senior hepatology radiologist (0.68). The CAD system can informatively assess its diagnostic confidence, i.e., when only evaluating on the 70% most confident cases the mean f1 score and sensitivity at 80% specificity for HCC vs. others are boosted from 0.62 to 0.71 and 0.84 to 0.92, respectively. Conclusion: the proposed fully-automatic CAD solution can provide good diagnostic performance with informative confidence assessments in finding and discriminating liver lesions from MRI studies.

时序|行为识别|姿态|视频|运动估计(9篇)

【1】 Don't Judge Me by My Face : An Indirect Adversarial Approach to Remove Sensitive Information From Multimodal Neural Representation in Asynchronous Job Video Interviews 标题：不要以貌取人：异步工作视频面试中从多模态神经表征中去除敏感信息的间接对抗性方法链接：https://arxiv.org/abs/2110.09424

作者：Léo Hemamou,Arthur Guillon,Jean-Claude Martin,Chloé Clavel 机构：∗EASYRECRUE, Paris, France, †LIMSI-LISN, CNRS, Paris-Sud University, Paris-Saclay University F-, Orsay, France, ‡T´el´ecom-Paris, IP-Paris, F-, Paris, France 备注：published in ACII 2021 摘要：机器学习用于自动分析求职面试视频的se最近引起了越来越多的兴趣。尽管声称对候选人的性别或种族等敏感信息的输出是公平的，但目前的方法很少提供无偏见决策的证据，或者说敏感信息没有被使用。近年来，对抗性方法已被证明能有效地从神经网络的潜在表征中去除敏感信息。然而，这些方法依赖于使用明确标记的受保护变量（如性别），而在某些国家（如法国）的招聘中无法收集这些变量。在这篇文章中，我们提出了一种新的对抗性方法来去除神经网络潜在表示中的敏感信息，而不需要收集任何敏感变量。仅使用面试的几个框架，我们训练我们的模型，使其无法在模型的内层找到与面试相关的候选人的面孔。这反过来又允许我们从这些层中删除相关的私有信息。将我们的方法与带有性别和种族注释的公共数据集上的标准基线进行比较，我们发现它可以有效地从主网络中删除敏感信息。此外，据我们所知，这是第一次应用对抗性技术，在视频面试中获得多模式公平表达。总之，我们的贡献旨在提高即将推出的自动系统处理求职面试视频的公平性，以实现求职平等。摘要：se of machine learning for automatic analysis of job interview videos has recently seen increased interest. Despite claims of fair output regarding sensitive information such as gender or ethnicity of the candidates, the current approaches rarely provide proof of unbiased decision-making, or that sensitive information is not used. Recently, adversarial methods have been proved to effectively remove sensitive information from the latent representation of neural networks. However, these methods rely on the use of explicitly labeled protected variables (e.g. gender), which cannot be collected in the context of recruiting in some countries (e.g. France). In this article, we propose a new adversarial approach to remove sensitive information from the latent representation of neural networks without the need to collect any sensitive variable. Using only a few frames of the interview, we train our model to not be able to find the face of the candidate related to the job interview in the inner layers of the model. This, in turn, allows us to remove relevant private information from these layers. Comparing our approach to a standard baseline on a public dataset with gender and ethnicity annotations, we show that it effectively removes sensitive information from the main network. Moreover, to the best of our knowledge, this is the first application of adversarial techniques for obtaining a multimodal fair representation in the context of video job interviews. In summary, our contributions aim at improving fairness of the upcoming automatic systems processing videos of job interviews for equality in job selection.

【2】 A DCT-based Tensor Completion Approach for Recovering Color Images and Videos from Highly Undersampled Data 标题：一种基于DCT的高欠采样彩色图像和视频张量补全方法链接：https://arxiv.org/abs/2110.09298

作者：Chenjian Pan,Chen Ling,Hongjin He,Liqun Qi,Yanwei Xu 机构：Department of Mathematics, School of Science, Hangzhou Dianzi University, Hangzhou, China., School of Mathematics and Statistics, Ningbo University, Ningbo , China. 备注：13 pages, 2 tables and 8 figures 摘要：在人脸识别和计算机视觉中，从高度欠采样的数据中恢复彩色图像和视频是一项基本且具有挑战性的任务。根据彩色图像和视频的多维特性，本文提出了一种新的张量补全方法，能够有效地研究离散余弦变换（DCT）下张量数据的稀疏性。具体来说，我们介绍了两种基于DCT的张量完成模型以及两种可实现的算法。第一种是基于DCT的加权核范数最小化模型。第二种称为基于DCT的$p$-收缩张量完成模型，它是一种非凸模型，利用$p$-收缩映射来提高数据的低粗糙度。此外，我们相应地提出了两种可实现的基于增广拉格朗日的算法来求解底层优化模型。包括彩色和MRI图像修复和视频数据恢复在内的一系列数值实验表明，我们提出的方法比许多现有的最先进的张量补全方法具有更好的性能，特别是在缺失数据比率较高的情况下。摘要：Recovering color images and videos from highly undersampled data is a fundamental and challenging task in face recognition and computer vision. By the multi-dimensional nature of color images and videos, in this paper, we propose a novel tensor completion approach, which is able to efficiently explore the sparsity of tensor data under the discrete cosine transform (DCT). Specifically, we introduce two DCT-based tensor completion models as well as two implementable algorithms for their solutions. The first one is a DCT-based weighted nuclear norm minimization model. The second one is called DCT-based $p$-shrinking tensor completion model, which is a nonconvex model utilizing $p$-shrinkage mapping for promoting the low-rankness of data. Moreover, we accordingly propose two implementable augmented Lagrangian-based algorithms for solving the underlying optimization models. A series of numerical experiments including color and MRI image inpainting and video data recovery demonstrate that our proposed approach performs better than many existing state-of-the-art tensor completion methods, especially for the case when the ratio of missing data is high.

【3】 Video Coding for Machine: Compact Visual Representation Compression for Intelligent Collaborative Analytics 标题：机器视频编码：面向智能协作分析的紧凑视觉表示压缩链接：https://arxiv.org/abs/2110.09241

作者：Wenhan Yang,Haofeng Huang,Yueyu Hu,Ling-Yu Duan,Jiaying Liu 机构： 70] and other techniques have•The authors are with Peking University 备注：The first three authors had equal contribution. arXiv admin note: text overlap with arXiv:2106.08512 摘要：机器视频编码（VCM）致力于在一定程度上衔接视频/图像压缩和特征压缩的独立研究轨道，并试图从高精度机器视觉和全保真度人类视觉的统一角度，共同优化紧凑性和效率。在本文中，我们总结了基于现有学术界和工业界努力的VCM方法和理念。VCM的开发遵循一般的率失真优化，并确定了关键模块或技术的分类。从以前的工作中可以看出，尽管现有工作试图在处理机器和人类视觉任务时揭示bit中可伸缩表示的本质，但对于低比特率表示的普遍性以及如何支持各种视觉分析任务的研究仍然很少。因此，我们针对分析分类问题研究了一种新的可视信息压缩方法，以增强从多个任务中提取用于可视分析的紧凑可视表示的能力。重新审视了任务关系与压缩之间的新视角。通过牢记不同机器视觉任务之间的可转移性（例如，高级语义和中级几何相关），我们旨在以低比特率联合支持多个任务。特别是，为了缩小从像素中提取的神经网络生成的特征与各种机器视觉特征/标签（例如场景类、分割标签）之间的维数差距，设计了一个codebook hyperprior来压缩神经网络生成的特征。实验表明，这种新的超先验模型通过更准确地估计信号熵来提高特征压缩效率，从而进一步研究不同任务之间提取紧凑特征的粒度。摘要：Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression, and attempts to optimize compactness and efficiency jointly from a unified perspective of high accuracy machine vision and full fidelity human vision. In this paper, we summarize VCM methodology and philosophy based on existing academia and industrial efforts. The development of VCM follows a general rate-distortion optimization, and the categorization of key modules or techniques is established. From previous works, it is demonstrated that, although existing works attempt to reveal the nature of scalable representation in bits when dealing with machine and human vision tasks, there remains a rare study in the generality of low bit rate representation, and accordingly how to support a variety of visual analytic tasks. Therefore, we investigate a novel visual information compression for the analytics taxonomy problem to strengthen the capability of compact visual representations extracted from multiple tasks for visual analytics. A new perspective of task relationships versus compression is revisited. By keeping in mind the transferability among different machine vision tasks (e.g. high-level semantic and mid-level geometry-related), we aim to support multiple tasks jointly at low bit rates. In particular, to narrow the dimensionality gap between neural network generated features extracted from pixels and a variety of machine vision features/labels (e.g. scene class, segmentation labels), a codebook hyperprior is designed to compress the neural network-generated features. As demonstrated in our experiments, this new hyperprior model is expected to improve feature compression efficiency by estimating the signal entropy more accurately, which enables further investigation of the granularity of abstracting compact features among different tasks.

【4】 Boosting the Transferability of Video Adversarial Examples via Temporal Translation 标题：通过时间翻译提高视频对抗性例句的可转移性链接：https://arxiv.org/abs/2110.09075

作者：Zhipeng Wei,Jingjing Chen,Zuxuan Wu,Yu-Gang Jiang 机构：Fudan University 摘要：尽管基于深度学习的视频识别模型取得了显著的成功，但它们很容易受到通过在干净的视频样本上添加人类无法察觉的干扰而产生的对抗性示例的影响。正如最近的研究所表明的那样，对抗性示例是可转移的，这使得在实际应用中进行黑盒攻击是可行的。然而，大多数现有的对抗性攻击方法在攻击其他视频模型时具有很差的可转移性，基于传输的视频模型攻击仍然没有被研究。为此，我们建议提高视频识别模型中黑盒攻击的视频对抗示例的可转移性。通过广泛的分析，我们发现不同的视频识别模型依赖于不同的区分时间模式，导致视频对抗示例的可传递性较差。这促使我们引入了一种时态翻译攻击方法，它优化了一组时态翻译视频剪辑上的对抗性干扰。通过在翻译的视频上生成对抗性示例，生成的对抗性示例对白盒模型中存在的时间模式不太敏感，因此可以更好地传输。在Kinetics-400数据集和UCF-101数据集上的大量实验表明，我们的方法可以显著提高视频对抗示例的可传输性。对于针对视频识别模型的基于传输的攻击，Kinetics-400和UCF-101的平均攻击成功率分别为61.56%和48.60%。摘要：Although deep-learning based video recognition models have achieved remarkable success, they are vulnerable to adversarial examples that are generated by adding human-imperceptible perturbations on clean video samples. As indicated in recent studies, adversarial examples are transferable, which makes it feasible for black-box attacks in real-world applications. Nevertheless, most existing adversarial attack methods have poor transferability when attacking other video models and transfer-based attacks on video models are still unexplored. To this end, we propose to boost the transferability of video adversarial examples for black-box attacks on video recognition models. Through extensive analysis, we discover that different video recognition models rely on different discriminative temporal patterns, leading to the poor transferability of video adversarial examples. This motivates us to introduce a temporal translation attack method, which optimizes the adversarial perturbations over a set of temporal translated video clips. By generating adversarial examples over translated videos, the resulting adversarial examples are less sensitive to temporal patterns existed in the white-box model being attacked and thus can be better transferred. Extensive experiments on the Kinetics-400 dataset and the UCF-101 dataset demonstrate that our method can significantly boost the transferability of video adversarial examples. For transfer-based attack against video recognition models, it achieves a 61.56% average attack success rate on the Kinetics-400 and 48.60% on the UCF-101.

【5】 Fast tree skeleton extraction using voxel thinning based on tree point cloud 标题：基于树状点云的体素细化快速提取树骨架链接：https://arxiv.org/abs/2110.09028

作者：Jingqian Sun,Pei Wang,Ronghao Li,Mei Zhou 机构：Address: ,School of Science, Beijing Forestry University, No., Qinghua East Road, Haidian District, Beijing , China, Key Laboratory of Quantitative Remote Sensing Information Technology, Aerospace 摘要：树木骨架在树木结构分析、森林调查和生态系统监测中起着重要作用。然而，从具有复杂分支的树点云中提取骨架是一项挑战。提出了一种基于体素细化的自动快速树骨架提取方法。该方法引入了一种木材叶片分类算法来过滤叶片点，以减少叶片对树木骨架生成的干扰，采用树木体素细化来快速提取原始树木骨架，并采用断点连接算法来提高骨架的连通性和完整性。实验在北京海淀公园进行，对24棵树木进行扫描和处理，获得树木骨骼。使用图搜索算法（GSA）提取基于相同数据集的树骨架。与GSA方法相比，FTSEM方法得到了更完整的树骨架。FTSEM方法的时间成本使用运行时间和百万分时间（TPMP）进行评估。FTSEM的运行时间从1.0秒到13.0秒，GSA的运行时间从6.4秒到309.3秒。FTSEM和GSA的TPMP平均值分别为1.8秒和22.3秒。实验结果表明，该方法可行、鲁棒、快速，在树木骨架提取方面具有良好的潜力。摘要：Tree skeleton plays an important role in tree structure analysis, forest inventory and ecosystem monitoring. However, it is a challenge to extract a skeleton from a tree point cloud with complex branches. In this paper, an automatic and fast tree skeleton extraction method (FTSEM) based on voxel thinning is proposed. In this method, a wood-leaf classification algorithm was introduced to filter leaf points for the reduction of the leaf interference on tree skeleton generation, tree voxel thinning was adopted to extract raw tree skeleton quickly, and a breakpoint connection algorithm was used to improve the skeleton connectivity and completeness. Experiments were carried out in Haidian Park, Beijing, in which 24 trees were scanned and processed to obtain tree skeletons. The graph search algorithm (GSA) is used to extract tree skeletons based on the same datasets. Compared with GSA method, the FTSEM method obtained more complete tree skeletons. And the time cost of the FTSEM method is evaluated using the runtime and time per million points (TPMP). The runtime of FTSEM is from 1.0 s to 13.0 s, and the runtime of GSA is from 6.4 s to 309.3 s. The average value of TPMP is 1.8 s for FTSEM, and 22.3 s for GSA respectively. The experimental results demonstrate that the proposed method is feasible, robust, and fast with a good potential on tree skeleton extraction.

【6】 Backpropagation with Biologically Plausible Spatio-Temporal Adjustment For Training Deep Spiking Neural Networks 标题：生物似然时空调整反向传播训练深度棘波神经网络链接：https://arxiv.org/abs/2110.08858

作者：Guobin Shen,Dongcheng Zhao,Yi Zeng 机构：Afiliations ,Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, School of Future Technology, University of Chinese Academy of Sciences, Beijing, China, Research Center for Brain-Inspired Intelligence, IACAS, Beijing, China 摘要：尖峰神经网络（SNN）模拟人脑中的信息处理操作，在包含丰富时空信息的尖峰序列中表示和传输信息，在许多认知任务中表现出优异的性能。此外，事件驱动的信息处理使得在神经形态芯片上实现节能。深度学习的成功离不开反向传播。由于离散的信息传输，直接将反向传播应用于SNN的训练与传统的深度神经网络相比仍存在性能差距。此外，为了获得更好的性能，需要大量的模拟时间，这会导致较高的延迟。为了解决这些问题，我们提出了一种生物学上合理的空间调整方法，它重新考虑了膜电位和峰值之间的关系，并实现了对不同时间步长梯度的合理调整。它精确地控制误差沿空间维度的反向传播。其次，我们提出了一种生物学上合理的时间调整方法，使误差在时间维度上跨棘波传播，从而克服了传统棘波神经元在单个棘波周期内的时间依赖性问题。我们已经在多个数据集上验证了我们的算法，实验结果表明，我们的算法大大减少了网络延迟和能耗，同时也提高了网络性能。我们在神经形态数据集N-MNIST、DVS手势和DVS-CIFAR10上取得了最先进的性能。对于静态数据集MNIST和CIFAR10，我们已经超过了大多数传统的SNN反向传播训练算法，并取得了相对优越的性能。摘要：The spiking neural network (SNN) mimics the information processing operation in the human brain, represents and transmits information in spike trains containing wealthy spatial and temporal information, and shows superior performance on many cognitive tasks. In addition, the event-driven information processing enables the energy-efficient implementation on neuromorphic chips. The success of deep learning is inseparable from backpropagation. Due to the discrete information transmission, directly applying the backpropagation to the training of the SNN still has a performance gap compared with the traditional deep neural networks. Also, a large simulation time is required to achieve better performance, which results in high latency. To address the problems, we propose a biological plausible spatial adjustment, which rethinks the relationship between membrane potential and spikes and realizes a reasonable adjustment of gradients to different time steps. And it precisely controls the backpropagation of the error along the spatial dimension. Secondly, we propose a biologically plausible temporal adjustment making the error propagate across the spikes in the temporal dimension, which overcomes the problem of the temporal dependency within a single spike period of the traditional spiking neurons. We have verified our algorithm on several datasets, and the experimental results have shown that our algorithm greatly reduces the network latency and energy consumption while also improving network performance. We have achieved state-of-the-art performance on the neuromorphic datasets N-MNIST, DVS-Gesture, and DVS-CIFAR10. For the static datasets MNIST and CIFAR10, we have surpassed most of the traditional SNN backpropagation training algorithm and achieved relatively superior performance.

【7】 Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor 标题：智能视频编辑：在视频编辑器中整合现代语音脸部生成算法链接：https://arxiv.org/abs/2110.08580

作者：Anchit Gupta,Faizan Farooq Khan,Rudrabha Mukhopadhyay,Vinay P. Namboodiri,C. V. Jawahar 机构：IIIT, Hyderabad, Telangana, India, University of Bath, England 备注：9 pages, 7 figures, accepted in ICVGIP 2021 摘要：本文提出了一种基于OpenShot的视频编辑器，其中添加了几种最先进的人脸视频编辑算法。我们的编辑器提供了一个易于使用的界面，以交互方式应用现代假唱算法。除了假唱，编辑者还使用音频和面部再现来生成富有表情的说话面孔。手动控制改善了视频编辑的整体体验，而不会错过现代合成视频生成算法的好处。此控件使我们能够对复杂的配音电影场景、采访、电视节目和其他视觉内容进行唇形同步。此外，我们的编辑器还提供了一些功能，可以自动从口语内容、教授的口型同步以及幻灯片等背景内容翻译讲座。在这样做的同时，我们还处理了将背景内容与翻译的演讲同步的关键方面。我们通过进行人工评估，定性地评估所建议的编辑器的有用性。我们的评估表明，使用人工编辑器的效率明显提高，视频生成质量也得到改善。我们附上演示视频和补充材料，清楚地解释了该工具，并展示了多种结果。摘要：This paper proposes a video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively. Apart from lip-syncing, the editor also uses audio and facial re-enactment to generate expressive talking faces. The manual control improves the overall experience of video editing without missing out on the benefits of modern synthetic video generation algorithms. This control enables us to lip-sync complex dubbed movie scenes, interviews, television shows, and other visual content. Furthermore, our editor provides features that automatically translate lectures from spoken content, lip-sync of the professor, and background content like slides. While doing so, we also tackle the critical aspect of synchronizing background content with the translated speech. We qualitatively evaluate the usefulness of the proposed editor by conducting human evaluations. Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality. We attach demo videos with the supplementary material clearly explaining the tool and also showcasing multiple results.

【8】 Visual-aware Attention Dual-stream Decoder for Video Captioning 标题：视频字幕的视觉感知注意双流解码器链接：https://arxiv.org/abs/2110.08578

作者：Zhixin Sun,Xian Zhong,Shuqin Chen,Lin Li,Luo Zhong 机构： School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, China, Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology, Wuhan, China 摘要：视频字幕是一项具有挑战性的任务，需要捕捉不同的视觉部分并用句子描述，因为它需要视觉和语言的连贯性。当前视频字幕方法中的注意机制学习为每一帧分配权重，从而动态提升解码器。这可能无法明确地建模序列框架中提取的视觉特征的相关性和时间一致性。为了生成语义一致的句子，我们提出了一种新的视觉感知注意（VA）模型，该模型将时间序列框架的动态变化与前一时刻的单词连接起来，作为注意机制的输入，提取序列特征。此外，流行的方法广泛使用教师强迫（TF）学习，在训练过程中，下一个标记是在前一个地面真值标记的基础上生成的。先前生成的标记中的语义信息丢失。因此，我们设计了一个自强制（SF）流，将前一个令牌概率分布中的语义信息作为输入，以增强当前令牌。双流解码器（DD）架构将TF和SF流统一起来，生成句子以促进两个流的注释字幕。同时，通过使用双流解码器，缓解了TF学习中由于训练和测试之间的差异而导致的暴露偏差问题。通过对Microsoft视频描述（MSVD）的实验研究，证明了所提出的视觉感知注意双流解码器（VADD）的有效性语料库和MSR视频到文本（MSR-VTT）数据集。摘要：Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence. The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.To generate semantically coherent sentences, we propose a new Visual-aware Attention (VA) model, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence features.In addition, the prevalent approaches widely use the teacher-forcing (TF) learning during training, where the next token is generated conditioned on the previous ground-truth tokens. The semantic information in the previously generated tokens is lost. Therefore, we design a self-forcing (SF) stream that takes the semantic information in the probability distribution of the previous token as input to enhance the current token.The Dual-stream Decoder (DD) architecture unifies the TF and SF streams, generating sentences to promote the annotated captioning for both streams.Meanwhile, with the Dual-stream Decoder utilized, the exposure bias problem is alleviated, caused by the discrepancy between the training and testing in the TF learning.The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated through the result of experimental studies on Microsoft video description (MSVD) corpus and MSR-Video to text (MSR-VTT) datasets.

【9】 3D Human Pose Estimation for Free-form Activity Using WiFi Signals 标题：基于WiFi信号的自由活动三维人体姿态估计链接：https://arxiv.org/abs/2110.08314

作者：Yili Ren,Jie Yang 机构：ZI WANG, Florida State University, USA, SHENG TAN, Trinity University, USA, YINGYING CHEN, Rutgers University, USA 摘要：WiFi人体感应在实现新兴人机交互应用方面越来越有吸引力。相应的技术已逐渐从多种活动类型的分类发展到更细粒度的3D人体姿势跟踪。然而，现有的基于WiFi的3D人体姿势跟踪仅限于一组预定义的活动。在这项工作中，我们介绍了Winect，一个使用商品WiFi设备的自由活动3D人体姿势跟踪系统。我们的系统通过估计由一组人体关节组成的三维骨骼姿势来跟踪自由形式的活动。特别是，我们将信号分离和关节运动建模相结合，实现了自由形式的活动跟踪。我们的系统首先通过利用人体反射信号的二维到达角来识别运动肢体，并分离每个肢体的纠缠信号。然后，它跟踪每个肢体，并通过建模肢体运动和相应关节之间的固有关系来构建身体的三维骨架。我们的评估结果表明，Winect与环境无关，在各种挑战性环境（包括非视距（NLoS）场景）下实现厘米级的自由形式活动跟踪精度。摘要：WiFi human sensing has become increasingly attractive in enabling emerging human-computer interaction applications. The corresponding technique has gradually evolved from the classification of multiple activity types to more fine-grained tracking of 3D human poses. However, existing WiFi-based 3D human pose tracking is limited to a set of predefined activities. In this work, we present Winect, a 3D human pose tracking system for free-form activity using commodity WiFi devices. Our system tracks free-form activity by estimating a 3D skeleton pose that consists of a set of joints of the human body. In particular, we combine signal separation and joint movement modeling to achieve free-form activity tracking. Our system first identifies the moving limbs by leveraging the two-dimensional angle of arrival of the signals reflected off the human body and separates the entangled signals for each limb. Then, it tracks each limb and constructs a 3D skeleton of the body by modeling the inherent relationship between the movements of the limb and the corresponding joints. Our evaluation results show that Winect is environment-independent and achieves centimeter-level accuracy for free-form activity tracking under various challenging environments including the none-line-of-sight (NLoS) scenarios.

医学相关(6篇)

【1】 A Prior Guided Adversarial Representation Learning and Hypergraph Perceptual Network for Predicting Abnormal Connections of Alzheimer's Disease 标题：先验引导的对抗性表征学习和超图感知网络预测阿尔茨海默病异常联系链接：https://arxiv.org/abs/2110.09302

作者：Qiankun Zuo,Baiying Lei,Shuqiang Wang,Yong Liu,Bingchuan Wang,Yanyan Shen 机构： ShenzhenUniversity, ChinaYong Liu is with the Gaoling School of Artificial Intelligence, RenminUniversity of China 摘要：阿尔茨海默病的特征是在其进行性退化过程中大脑结构和功能连接的改变。现有的辅助诊断方法已经完成了分类任务，但很少有方法能够准确地评估脑连接性的变化特征。在这项工作中，提出了一种先验引导的对抗性表征学习和超图感知网络（PGARL-HPN），用于使用三模态医学图像预测异常的大脑连接。具体而言，根据解剖学知识估计先验分布，以指导使用对抗策略的多模态表征学习。此外，还进一步利用成对协作鉴别器结构来缩小表示分布的差异。此外，开发了超图感知网络来有效地融合学习的表示，同时在多模态图像内部和之间建立高阶关系。实验结果表明，该模型在分析和预测阿尔茨海默病进展方面优于其他相关方法。更重要的是，已识别的异常连接部分符合先前的神经科学发现。该模型可以评估阿尔茨海默病不同阶段异常脑连接的特征，有助于认知疾病的研究和早期治疗。摘要：Alzheimer's disease is characterized by alterations of the brain's structural and functional connectivity during its progressive degenerative processes. Existing auxiliary diagnostic methods have accomplished the classification task, but few of them can accurately evaluate the changing characteristics of brain connectivity. In this work, a prior guided adversarial representation learning and hypergraph perceptual network (PGARL-HPN) is proposed to predict abnormal brain connections using triple-modality medical images. Concretely, a prior distribution from the anatomical knowledge is estimated to guide multimodal representation learning using an adversarial strategy. Also, the pairwise collaborative discriminator structure is further utilized to narrow the difference of representation distribution. Moreover, the hypergraph perceptual network is developed to effectively fuse the learned representations while establishing high-order relations within and between multimodal images. Experimental results demonstrate that the proposed model outperforms other related methods in analyzing and predicting Alzheimer's disease progression. More importantly, the identified abnormal connections are partly consistent with the previous neuroscience discoveries. The proposed model can evaluate characteristics of abnormal brain connections at different stages of Alzheimer's disease, which is helpful for cognitive disease study and early treatment.

【2】 GAN-based disentanglement learning for chest X-ray rib suppression 标题：基于GaN的解缠学习在胸部X线肋骨抑制中的应用链接：https://arxiv.org/abs/2110.09134

作者：Luyi Han,Yuanyuan Lyu,Cheng Peng,S. Kevin Zhou 机构：Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Geert Grooteplein , GA, The Netherlands, Department of Radiology, Netherlands Cancer Institute (NKI), Plesmanlaan ,CX, Amsterdam, The Netherlands 摘要：临床证据表明，肋骨抑制胸部X射线（CXRs）可以提高肺部疾病诊断的可靠性。然而，以前产生肋骨抑制的CXR的方法在保留细节和消除肋骨残留方面面临挑战。在此，我们提出了一种基于GAN的解纠缠学习框架，称为肋骨抑制GAN，或RSGAN，通过利用嵌入在未成对CT（CT）图像中的解剖学知识来执行肋骨抑制。在这种方法中，我们使用残差图来描述CXR和相应肋骨抑制结果之间的强度差异。为了预测CXR域中的残差图，我们将图像分解为结构和对比度特征，并从CT计算的数字重建X线照片（DRR）中转移肋骨结构先验信息。此外，我们采用额外的自适应损失来抑制肋骨残留并保留更多细节。我们基于1673个CT体积和四个基准CXR数据集（总计超过120K图像）进行了广泛的实验，以证明（i）与最先进的肋骨抑制方法相比，我们提出的RSGAN实现了更高的图像质量；（ii）将CXR与我们的肋骨抑制结果相结合，可以在肺部疾病分类和结核区域检测方面取得更好的性能。摘要：Clinical evidence has shown that rib-suppressed chest X-rays (CXRs) can improve the reliability of pulmonary disease diagnosis. However, previous approaches on generating rib-suppressed CXR face challenges in preserving details and eliminating rib residues. We hereby propose a GAN-based disentanglement learning framework called Rib Suppression GAN, or RSGAN, to perform rib suppression by utilizing the anatomical knowledge embedded in unpaired computed tomography (CT) images. In this approach, we employ a residual map to characterize the intensity difference between CXR and the corresponding rib-suppressed result. To predict the residual map in CXR domain, we disentangle the image into structure- and contrast-specific features and transfer the rib structural priors from digitally reconstructed radiographs (DRRs) computed by CT. Furthermore, we employ additional adaptive loss to suppress rib residue and preserve more details. We conduct extensive experiments based on 1,673 CT volumes, and four benchmarking CXR datasets, totaling over 120K images, to demonstrate that (i) our proposed RSGAN achieves superior image quality compared to the state-of-the-art rib suppression methods; (ii) combining CXR with our rib-suppressed result leads to better performance in lung disease classification and tuberculosis area detection.

【3】 Data Shapley Value for Handling Noisy Labels: An application in Screening COVID-19 Pneumonia from Chest CT Scans 标题：处理噪声标签的数据Shapley值：在胸部CT扫描筛查冠状病毒肺炎中的应用链接：https://arxiv.org/abs/2110.08726

作者：Nastaran Enshaei,Moezedin Javad Rafiee,Arash Mohammadi,Farnoosh Naderkhani 机构：†Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Canada, ‡Department of Medicine and Diagnostic Radiology, McGill University, Montreal, QC, Canada 摘要：深度学习模型的一个长期挑战涉及如何处理嘈杂的标签，特别是在危及人类生命的应用中。采用数据Shapley值（SV），一种合作博弈理论方法，是解决标签噪声问题的智能估值解决方案。数据SV可与学习模型和评估指标一起使用，以验证每个训练点对模型性能的贡献。然而，数据点的SV不是唯一的，它取决于学习模型、评估指标以及在训练游戏中协作的其他数据点。然而，利用不同的评估指标计算SV、检测噪声标签和测量数据点重要性的效果尚未得到彻底研究。在此背景下，我们进行了一系列比较分析，以评估SV在通过不同评估指标进行测量时检测噪声输入标签的能力。我们在COVID-19感染的CT图像上的实验表明，尽管数据SV可以有效地识别有噪声的标签，但采用不同的评估指标会显著影响其从不同数据类别识别有噪声标签的能力。具体而言，我们证明SV在很大程度上取决于相关的评估指标。摘要：A long-standing challenge of deep learning models involves how to handle noisy labels, especially in applications where human lives are at stake. Adoption of the data Shapley Value (SV), a cooperative game theoretical approach, is an intelligent valuation solution to tackle the issue of noisy labels. Data SV can be used together with a learning model and an evaluation metric to validate each training point's contribution to the model's performance. The SV of a data point, however, is not unique and depends on the learning model, the evaluation metric, and other data points collaborating in the training game. However, effects of utilizing different evaluation metrics for computation of the SV, detecting the noisy labels, and measuring the data points' importance has not yet been thoroughly investigated. In this context, we performed a series of comparative analyses to assess SV's capabilities to detect noisy input labels when measured by different evaluation metrics. Our experiments on COVID-19-infected of CT images illustrate that although the data SV can effectively identify noisy labels, adoption of different evaluation metric can significantly influence its ability to identify noisy labels from different data classes. Specifically, we demonstrate that the SV greatly depends on the associated evaluation metric.

【4】 BAPGAN: GAN-based Bone Age Progression of Femur and Phalange X-ray Images 标题：BAPGAN：GaN基股骨和指骨X射线图像的骨龄演变链接：https://arxiv.org/abs/2110.08509

作者：Shinji Nakazawa,Changhee Han,Joe Hasei,Ryuichi Nakahara,Toshifumi Ozaki 机构： LPIXEL Inc., Tokyo, Japan, Saitama Prefectural University, Saitama, Japan, Okayama City General Medical Center, Okayama City Hospital, Okayama, Japan, Dept. of Orthopaedic Surgery, Grad. School of Medicine, Dentistry and Pharmaceutical Sciences 备注：6 pages, 5 figures, accepted to SPIE Medical Imaging 2022 摘要：卷积神经网络在骨龄评估中起着关键作用，用于研究各种形态和身体区域下的内分泌、遗传和生长障碍。然而，尽管骨龄进展/回归具有潜在的应用价值：骨相关疾病诊断、临床知识获取和博物馆教育，但还没有研究人员研究过它。因此，我们提出骨龄进展生成性对抗网络（BAPGAN）在保留身份和真实感的同时，对股骨/指骨X射线图像进行进展/回归。我们通过Frechet起始距离、两位专业骨科医师的视觉图灵测试和t分布随机邻域嵌入，彻底证实了BAPGAN的临床潜力。摘要：Convolutional Neural Networks play a key role in bone age assessment for investigating endocrinology, genetic, and growth disorders under various modalities and body regions. However, no researcher has tackled bone age progression/regression despite its valuable potential applications: bone-related disease diagnosis, clinical knowledge acquisition, and museum education. Therefore, we propose Bone Age Progression Generative Adversarial Network (BAPGAN) to progress/regress both femur/phalange X-ray images while preserving identity and realism. We exhaustively confirm the BAPGAN's clinical potential via Frechet Inception Distance, Visual Turing Test by two expert orthopedists, and t-Distributed Stochastic Neighbor Embedding.

【5】 Bridging the gap between paired and unpaired medical image translation 标题：弥合配对与非配对医学影像翻译之间的差距链接：https://arxiv.org/abs/2110.08407

作者：Pauliina Paavilainen,Saad Ullah Akram,Juho Kannala 机构：Kannala,[,−,−,−,], Aalto University, Finland, MVision AI, Finland 备注：Deep Generative Models for MICCAI (DGM4MICCAI) workshop 2021 摘要：医学图像翻译有可能通过消除捕获某些序列的需要来减少成像工作量，并减少开发机器学习方法的注释负担。GANs已经成功地用于将图像从一个域转换到另一个域，如MR到CT。目前，学习好的翻译模型需要成对的数据（注册的MR和CT图像）或额外的监督（例如分割掩模）。在每个模式中注册多个模式或注释结构是一项繁琐而费力的任务。因此，有必要改进未配对数据的翻译方法。这里，我们介绍了任务CT$\rightarrow$MR和MR$\rightarrow$CT的改进pix2pix模型，使用未配对的CT和MR数据以及MR扫描生成的MRCAT对进行训练。建议的修改利用成对的MR和MRCAT图像，以确保输入图像和翻译图像之间的良好对齐，而未成对的CT图像确保MR$\rightarrow$CT模型生成逼真的CT，CT$\rightarrow$MR模型与真实CT作为输入很好地工作。建议的pix2pix变体在FID和KID方面优于基准pix2pix、pix2pixHD和CycleGAN，并生成更逼真的CT和MR翻译。摘要：Medical image translation has the potential to reduce the imaging workload, by removing the need to capture some sequences, and to reduce the annotation burden for developing machine learning methods. GANs have been used successfully to translate images from one domain to another, such as MR to CT. At present, paired data (registered MR and CT images) or extra supervision (e.g. segmentation masks) is needed to learn good translation models. Registering multiple modalities or annotating structures within each of them is a tedious and laborious task. Thus, there is a need to develop improved translation methods for unpaired data. Here, we introduce modified pix2pix models for tasks CT$\rightarrow$MR and MR$\rightarrow$CT, trained with unpaired CT and MR data, and MRCAT pairs generated from the MR scans. The proposed modifications utilize the paired MR and MRCAT images to ensure good alignment between input and translated images, and unpaired CT images ensure the MR$\rightarrow$CT model produces realistic-looking CT and CT$\rightarrow$MR model works well with real CT as input. The proposed pix2pix variants outperform baseline pix2pix, pix2pixHD and CycleGAN in terms of FID and KID, and generate more realistic looking CT and MR translations.

【6】 MedAug: Contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation 标题：MedAug：利用患者元数据的对比性学习改善了胸部X光解释的表示链接：https://arxiv.org/abs/2102.10663

作者：Yen Nhi Truong Vu,Richard Wang,Niranjan Balachandar,Can Liu,Andrew Y. Ng,Pranav Rajpurkar 机构：Equal Contribution, Department of Computer Science, Stanford University, School of Medicine, Stanford University 摘要：同一图像的多视图对之间的自监督对比学习已被证明能够成功地利用未标记数据为自然图像和医学图像生成有意义的视觉表示。然而，在确定如何为医学图像选择对方面的工作有限，可以利用患者元数据的可用性来改进表示。在这项工作中，我们开发了一种方法，通过使用患者元数据从可能不同的图像视图中选择阳性对。我们比较了选择阳性对进行胸部X射线解释的策略，包括要求它们来自同一患者、影像学研究或偏侧。我们通过微调1%的胸腔积液分类标记数据集的线性层来评估下游任务性能。我们表现最佳的阳性配对选择策略，包括使用来自同一研究的同一患者的所有侧位图像，与ImageNet预训练基线相比，平均AUC的表现提高了14.4%。我们的对照实验表明，提高下游疾病分类性能的关键在于：（1）使用患者元数据从具有相同基本病理学的不同图像中适当地创建阳性对；（2）最大化查询配对中使用的不同图像的数量。此外，我们还探讨了如何利用患者元数据来选择硬阴性对进行对比学习，但没有发现比不使用元数据的基线有所改进。我们的方法广泛适用于医学图像解释，并允许在选择配对进行对比学习时灵活地结合医学见解。摘要：Self-supervised contrastive learning between pairs of multiple views of the same image has been shown to successfully leverage unlabeled data to produce meaningful visual representations for both natural and medical images. However, there has been limited work on determining how to select pairs for medical images, where availability of patient metadata can be leveraged to improve representations. In this work, we develop a method to select positive pairs coming from views of possibly different images through the use of patient metadata. We compare strategies for selecting positive pairs for chest X-ray interpretation including requiring them to be from the same patient, imaging study or laterality. We evaluate downstream task performance by fine-tuning the linear layer on 1% of the labeled dataset for pleural effusion classification. Our best performing positive pair selection strategy, which involves using images from the same patient from the same study across all lateralities, achieves a performance increase of 14.4% in mean AUC from the ImageNet pretrained baseline. Our controlled experiments show that the keys to improving downstream performance on disease classification are (1) using patient metadata to appropriately create positive pairs from different images with the same underlying pathologies, and (2) maximizing the number of different images used in query pairing. In addition, we explore leveraging patient metadata to select hard negative pairs for contrastive learning, but do not find improvement over baselines that do not use metadata. Our method is broadly applicable to medical image interpretation and allows flexibility for incorporating medical insights in choosing pairs for contrastive learning.

GAN|对抗|攻击|生成相关(8篇)

【1】 Continuation of Famous Art with AI: A Conditional Adversarial Network Inpainting Approach 标题：名家艺术与人工智能的延续：一种有条件的对抗性网络修复方法链接：https://arxiv.org/abs/2110.09170

作者：Jordan J. Bird 机构：Computational Intelligence and Applications Research Group (CIA), Department of Computer Science, Nottingham Trent University, Nottingham, United Kingdom 摘要：受真实艺术品启发的许多图像合成技术要么完全由过滤的随机噪声生成，要么受风格转换的启发。这项工作探索了图像修复的应用，以继续著名的艺术作品，并产生一个有条件的生成艺术。在过程的训练阶段，图像的边界被裁剪，只留下中心。然后，修复GAN的任务是通过最小化对抗性和绝对差异损失，学习从中心作物重建原始图像。一旦网络经过训练，图像将被调整大小，而不是裁剪，并作为输入显示给生成器。在学习过程之后，生成器通过从原始工件的边缘继续创建新图像。使用4766幅风景画（印象主义和浪漫主义）、1167幅日本江户时代浮世绘作品和4968幅抽象艺术作品的数据集进行了三个实验。结果表明，几何和纹理（包括画布和绘画）以及天空、云、水、土地（包括丘陵和山脉）、草和花等景物都是由生成器在扩展真实艺术品时实现的。在浮世绘实验中，观察到，由于输入图像中存在未上漆的边界，即使在原始图像没有任何特征的情况下，也会生成书写文本等特征。摘要：Much of the state-of-the-art in image synthesis inspired by real artwork are either entirely generative by filtered random noise or inspired by the transfer of style. This work explores the application of image inpainting to continue famous artworks and produce generative art with a Conditional GAN. During the training stage of the process, the borders of images are cropped, leaving only the centre. An inpainting GAN is then tasked with learning to reconstruct the original image from the centre crop by way of minimising both adversarial and absolute difference losses. Once the network is trained, images are then resized rather than cropped and presented as input to the generator. Following the learning process, the generator then creates new images by continuing from the edges of the original piece. Three experiments are performed with datasets of 4766 landscape paintings (impressionism and romanticism), 1167 Ukiyo-e works from the Japanese Edo period, and 4968 abstract artworks. Results show that geometry and texture (including canvas and paint) as well as scenery such as sky, clouds, water, land (including hills and mountains), grass, and flowers are implemented by the generator when extending real artworks. In the Ukiyo-e experiments, it was observed that features such as written text were generated even in cases where the original image did not have any, due to the presence of an unpainted border within the input image.

【2】 StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis 标题：StyleNeRF：一种基于样式的高分辨率图像合成3D感知生成器链接：https://arxiv.org/abs/2110.08985

作者：Jiatao Gu,Lingjie Liu,Peng Wang,Christian Theobalt 机构：†Facebook AI, ‡Max Planck Institute for Informatics, ⋄The University of Hong Kong 备注：24 pages, 19 figures. Project page: this http URL 摘要：我们提出了StyleNeRF，一种三维感知生成模型，用于具有高多视图一致性的照片真实感高分辨率图像合成，可在非结构化二维图像上进行训练。现有的方法要么不能合成具有精细细节的高分辨率图像，要么产生明显的3D不一致伪影。此外，它们中的许多缺乏对样式属性和显式3D摄影机姿势的控制。StyleNeRF将neural radiance field（NeRF）集成到基于样式的生成器中，以解决上述挑战，即提高渲染效率和高分辨率图像生成的3D一致性。我们执行体绘制只是为了生成一个低分辨率的特征映射，并逐步应用二维上采样来解决第一个问题。为了缓解二维上采样造成的不一致性，我们提出了多种设计，包括更好的上采样器和新的正则化损耗。通过这些设计，StyleNeRF可以以交互速率合成高分辨率图像，同时保持高质量的3D一致性。StyleNeRF还可以控制相机姿势和不同级别的样式，这可以概括为看不见的视图。它还支持具有挑战性的任务，包括放大和缩小、样式混合、反转和语义编辑。摘要：We propose StyleNeRF, a 3D-aware generative model for photo-realistic high-resolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize high-resolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.

【3】 MeronymNet: A Hierarchical Approach for Unified and Controllable Multi-Category Object Generation 标题：MeronymNet：一种层次化的统一可控的多类别对象生成方法链接：https://arxiv.org/abs/2110.08818

作者：Rishabh Baghel,Abhishek Trivedi,Tejas Ravichandran,Ravi Kiran Sarvadevabhatla 机构：CVIT, IIIT Hyderabad, Hyderabad , INDIA, Generate, a cow, I’d like to, add horns, and front, legs as, regions, a cat, I’d like one, with larger, right ear, left eye and, with front, a dog, one with, right eye, present., No legs., a sheep, add torso., aeroplane, make the, fuselage, bigger., a bird, torso, and 备注：Accepted at ACM Multimedia (ACMMM) 2021 [ORAL] . Website : this https URL arXiv admin note: text overlap with arXiv:2006.00190 摘要：我们介绍了MeronymNet，这是一种新的分层方法，用于使用单个统一模型生成可控的、基于部分的多类别对象。我们采用了一种从粗到精的引导策略，包括语义条件下生成边界框布局、像素级部分布局，最终生成对象描述本身。我们使用图卷积网络、深度递归网络以及定制设计的条件变分自动编码器，以可控方式实现灵活、多样和类别感知的二维对象生成。生成对象的性能分数反映了MeronymNet与多个强基线和烧蚀变体相比的优越性能。我们还展示了MeronymNet在不同结构和语义粒度级别上适用于可控对象生成和交互式对象编辑。摘要：We introduce MeronymNet, a novel hierarchical approach for controllable, part-based generation of multi-category objects using a single unified model. We adopt a guided coarse-to-fine strategy involving semantically conditioned generation of bounding box layouts, pixel-level part layouts and ultimately, the object depictions themselves. We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed Conditional Variational Autoencoders to enable flexible, diverse and category-aware generation of 2-D objects in a controlled manner. The performance scores for generated objects reflect MeronymNet's superior performance compared to multiple strong baselines and ablative variants. We also showcase MeronymNet's suitability for controllable object generation and interactive object editing at various levels of structural and semantic granularity.

【4】 Taming Visually Guided Sound Generation 标题：驯服视觉引导的声音生成链接：https://arxiv.org/abs/2110.08791

作者：Vladimir Iashin,Esa Rahtu 机构：Computing Sciences, Tampere University, Tampere, Finland, Playing, Harp, Lions, Roaring, Canary, Calling, Visually-, Guided, Sound, Generation, Model, � Click to Play, in Adobe Reader 备注：Accepted as an oral presentation for the BMVC 2021. Code: this https URL Project page: this https URL 摘要：视觉诱导音频生成的最新进展是基于采样短、低保真度和一类声音。此外，在高端GPU上，从最先进的模型中采集1秒的音频需要几分钟。在这项工作中，我们提出了一种单一的模型，它能够在比在单个GPU上播放所需时间更短的时间内，从开放域视频中生成一组帧提示的视觉相关的高保真声音。我们训练一个转换器，在给定一组视频特征的情况下，从预先训练的频谱图码本中采样一个新的频谱图。该码本是使用VQGAN的一种变体获得的，该变体经过训练以产生一个紧凑的采样空间，该采样空间具有一种新的基于谱图的感知损失。生成的光谱图使用基于窗口的GAN转换为波形，显著加快生成速度。考虑到缺乏自动评估生成光谱图的指标，我们还构建了一系列指标，称为FID和MKL。这些指标基于一种称为Melection的新型声音分类器，旨在评估开放域样本的保真度和相关性。定性和定量研究均在小型和大型数据集上进行，以评估生成样本的保真度和相关性。我们还将我们的模型与最新技术进行了比较，并观察到在质量、大小和计算时间方面有了实质性的改进。代码、演示和示例：v-iashin.github.io/SpecVQGAN 摘要：Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

【5】 Multimodal Dialogue Response Generation 标题：多模态对话响应生成链接：https://arxiv.org/abs/2110.08515

作者：Qingfeng Sun,Yujing Wang,Can Xu,Kai Zheng,Yaming Yang,Huang Hu,Fei Xu,Jessica Zhang,Xiubo Geng,Daxin Jiang 机构：Microsoft STC Aisa, Microsoft Research Asia 备注：This paper has been submitted before 15th October @ 11:59pm AOE(UTC -12) 摘要：图像响应是智能会话代理的一项重要功能。然而现有的研究只关注于多模态对话模型的探索，依赖于基于检索的方法，而忽略了生成方法。为了填补这一空白，我们首先提出了一个多模态对话生成模型，该模型将对话历史作为输入，然后生成文本序列或图像作为响应。学习这样一个模型通常需要多模态对话，其中包含难以获得的文本和图像。动机的挑战，在实践中，我们认为多模态对话产生的自然假设，只有有限的训练实例是可用的。在这种低资源环境下，我们设计了一种新的会话代理Divter，以便将依赖于多模态对话的参数从整个生成模型中分离出来。通过这种方法，可以分别从大量纯文本对话和文本-图像对中学习模型的主要部分，然后使用有限的训练示例对整个参数进行拟合。大量实验表明，我们的方法在自动和人工评估方面都达到了最先进的效果，并且能够生成信息丰富的文本和高分辨率的图像响应。摘要：Responsing with image has been recognized as an important capability for an intelligent conversational agent. Yet existing works only focus on exploring the multimodal dialogue models which depend on retrieval-based methods, but neglecting generation methods. To fill in the gaps, we first present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response. Learning such a model often requires multimodal dialogues containing both texts and images which are difficult to obtain. Motivated by the challenge in practice, we consider multimodal dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of text-only dialogues and text-image pairs respectively, then the whole parameters can be well fitted using the limited training examples. Extensive experiments demonstrate our method achieves state-of-the-art results in both automatic and human evaluation, and can generate informative text and high-resolution image responses.

【6】 Vit-GAN: Image-to-image Translation with Vision Transformes and Conditional GANS 标题：VIT-GAN：基于视觉变换和条件GAN的图像到图像转换链接：https://arxiv.org/abs/2110.09305

作者：Yiğit Gündüç 摘要：在本文中，我们开发了一个通用架构Vit-Gan，它能够执行从语义图像分割到单个图像深度感知的大部分图像到图像的转换任务。本文是一篇后续论文，是基于生成器的模型[1]的扩展，其中获得的结果非常有希望。这为对抗式体系结构的进一步改进提供了可能性。我们使用了一种独特的基于视觉变换器的生成器架构和带有马尔可夫鉴别器（PatchGAN）的条件GANs（CGAN）(https://github.com/YigitGunduc/vit-gan). 在目前的工作中，我们使用图像作为条件参数。结果表明，所得到的结果比常用的体系结构更为真实。摘要：In this paper, we have developed a general-purpose architecture, Vit-Gan, capable of performing most of the image-to-image translation tasks from semantic image segmentation to single image depth perception. This paper is a follow-up paper, an extension of generator-based model [1] in which the obtained results were very promising. This opened the possibility of further improvements with adversarial architecture. We used a unique vision transformers-based generator architecture and Conditional GANs(cGANs) with a Markovian Discriminator (PatchGAN) (https://github.com/YigitGunduc/vit-gan). In the present work, we use images as conditioning arguments. It is observed that the obtained results are more realistic than the commonly used architectures.

【7】 CT-SGAN: Computed Tomography Synthesis GAN 标题：CT-SGAN：CT合成GaN 链接：https://arxiv.org/abs/2110.09288

作者：Ahmad Pesaranghader,Yiping Wang,Mohammad Havaei 机构： McGill University, Quebec AI Institute (Mila), University of Waterloo, Imagia Inc. 备注：In Proceedings of MICCAI Deep Generative Models workshop, October 2021 摘要：数据的多样性对于成功训练深度学习模型至关重要。利用循环生成对抗网络，我们提出了CT-SGAN模型，该模型在胸部CT扫描的小数据集上进行训练时生成大规模3D合成CT扫描体积（$\geq 224\times224\times224$）。CT-SGAN为医学成像中机器学习面临的两大挑战提供了一个有吸引力的解决方案：少量给定的i.i.d.训练数据，以及围绕患者数据共享的限制，阻止快速获得更大、更多样化的数据集。我们使用各种指标（包括起始距离和起始分数）定性和定量地评估生成图像的保真度。我们进一步表明，CT-SGAN通过对大量合成数据预训练分类器，可以显著提高肺结节检测的准确性。摘要：Diversity in data is critical for the successful training of deep learning models. Leveraged by a recurrent generative adversarial network, we propose the CT-SGAN model that generates large-scale 3D synthetic CT-scan volumes ($\geq 224\times224\times224$) when trained on a small dataset of chest CT-scans. CT-SGAN offers an attractive solution to two major challenges facing machine learning in medical imaging: a small number of given i.i.d. training data, and the restrictions around the sharing of patient data preventing to rapidly obtain larger and more diverse datasets. We evaluate the fidelity of the generated images qualitatively and quantitatively using various metrics including Fr\'echet Inception Distance and Inception Score. We further show that CT-SGAN can significantly improve lung nodule detection accuracy by pre-training a classifier on a vast amount of synthetic data.

【8】 SAGAN: Adversarial Spatial-asymmetric Attention for Noisy Nona-Bayer Reconstruction 标题：Sagan：噪声Nona-Bayer重建的对抗性空间非对称注意链接：https://arxiv.org/abs/2110.08619

作者：S M A Sharif,Rizwan Ali Naqvi,Mithun Biswas 机构： Rigel-IT, Bangladesh, Sejong University, South Korea 摘要：诺纳-拜耳滤色器阵列（CFA）图案被认为是传统拜耳图案最可行的替代品之一。尽管具有实质性的优势，但这种非拜耳CFA模式在从噪声传感器数据重建RGB图像时容易产生视觉伪影。本研究全面解决了从噪声的Nona-Bayer CFA中学习RGB图像重建的挑战。我们提出了一种新的空间非对称注意模块来联合学习双向转换和大核全局注意，以减少视觉伪影。我们将我们提出的模块与对抗性学习相结合，从Nona-Bayer CFA生成合理的图像。验证了该方法的可行性，并与现有的图像重建方法进行了比较。实验表明，该方法可以在不产生任何视觉干扰的情况下，从有噪声的Nona-Bayer CFA重建RGB图像。此外，它在定性和定量比较方面都优于最先进的图像重建方法。可用代码：https://github.com/sharif-apu/SAGAN_BMVC21. 摘要：Nona-Bayer colour filter array (CFA) pattern is considered one of the most viable alternatives to traditional Bayer patterns. Despite the substantial advantages, such non-Bayer CFA patterns are susceptible to produce visual artefacts while reconstructing RGB images from noisy sensor data. This study addresses the challenges of learning RGB image reconstruction from noisy Nona-Bayer CFA comprehensively. We propose a novel spatial-asymmetric attention module to jointly learn bi-direction transformation and large-kernel global attention to reduce the visual artefacts. We combine our proposed module with adversarial learning to produce plausible images from Nona-Bayer CFA. The feasibility of the proposed method has been verified and compared with the state-of-the-art image reconstruction method. The experiments reveal that the proposed method can reconstruct RGB images from noisy Nona-Bayer CFA without producing any visually disturbing artefacts. Also, it can outperform the state-of-the-art image reconstruction method in both qualitative and quantitative comparison. Code available: https://github.com/sharif-apu/SAGAN_BMVC21.

自动驾驶|车辆|车道检测等(1篇)

【1】 MAAD: A Model and Dataset for "Attended Awareness" in Driving 链接：https://arxiv.org/abs/2110.08610

作者：Deepak Gopinath,Guy Rosman,Simon Stent,Katsuya Terahata,Luke Fletcher,Brenna Argall,John Leonard 备注：25 pages, 13 figures, 14 tables, Accepted at EPIC@ICCV 2021 Workshop. Main paper + Supplementary Material 摘要：我们提出了一个计算模型来估计一个人对环境的感知。我们将参与意识定义为一个人在最近的历史中参与过的潜在动态场景的那些部分，并且他们仍然可能在身体上意识到这些部分。我们的模型以视频和噪声注视估计的形式作为输入场景信息，并输出视觉显著性、精细注视估计和人的注意感知估计。为了测试我们的模型，我们用一个高精度的凝视跟踪器捕获了一个新的数据集，其中包括23名观看驾驶场景视频的受试者24.5小时的凝视序列。该数据集还包含基于扫描路径观察的受试者注意意识的第三方注释。我们的结果表明，我们的模型能够在受控环境下合理估计有人参与的意识，并且在未来可能扩展到真实的以自我为中心的驾驶数据，以帮助在安全系统中实现更有效的提前警告，从而提高驾驶员的驾驶性能。我们还使用我们的数据集和现有的显著性数据集证明了我们的模型在显著性、凝视校准和去噪任务上的有效性。我们的模型和数据集在https://github.com/ToyotaResearchInstitute/att-aware/. 摘要：We propose a computational model to estimate a person's attended awareness of their environment. We define attended awareness to be those parts of a potentially dynamic scene which a person has attended to in recent history and which they are still likely to be physically aware of. Our model takes as input scene information in the form of a video and noisy gaze estimates, and outputs visual saliency, a refined gaze estimate, and an estimate of the person's attended awareness. In order to test our model, we capture a new dataset with a high-precision gaze tracker including 24.5 hours of gaze sequences from 23 subjects attending to videos of driving scenes. The dataset also contains third-party annotations of the subjects' attended awareness based on observations of their scan path. Our results show that our model is able to reasonably estimate attended awareness in a controlled setting, and in the future could potentially be extended to real egocentric driving data to help enable more effective ahead-of-time warnings in safety systems and thereby augment driver performance. We also demonstrate our model's effectiveness on the tasks of saliency, gaze calibration, and denoising, using both our dataset and an existing saliency dataset. We make our model and dataset available at https://github.com/ToyotaResearchInstitute/att-aware/.

Attention注意力(3篇)

【1】 Finding Strong Gravitational Lenses Through Self-Attention 标题：通过自我关注寻找强大的引力透镜链接：https://arxiv.org/abs/2110.09202

作者：Hareesh Thuruthipilly,Adam Zadrozny,Agnieszka Pollo 机构： National Centre for Nuclear Research, Warsaw, Poland, Jagiellonian University, Kraków, Poland 备注：11 Pages, 4 tables and 10 Figures 摘要：即将到来的大规模调查预计将通过分析比当前时代多个数量级的数据，发现大约10^5美元的强大引力系统。在这种情况下，非自动化技术将是极具挑战性和耗时的。我们提出了一种新的基于自我注意原理的自动结构来寻找强引力透镜。研究了基于自注意的编码器模型相对于卷积神经网络的优势，分析了编码器模型以优化性能。我们构建了21个基于自我注意的编码器模型和四个经过训练的卷积神经网络，用于识别博洛尼亚透镜挑战赛中的引力透镜。每个模型分别使用18000个模拟图像进行训练，使用2000个图像进行交叉验证，然后应用于具有100000个图像的测试集。我们使用四种不同的指标进行评估：分类准确度、受试者操作特征曲线下面积（AUROC）、$TPR_0$得分和$TPR_{10}$得分。比较了基于自我注意的编码器模型和参与挑战的CNN的性能。编码器型号的性能优于CNN，并以$TPR_0$和$TPR_{10}$的高利润率超过了参加博洛尼亚镜头挑战赛的CNN型号。就AUROC而言，编码器模型的得分与顶级CNN模型相当，仅使用CNN的六分之一参数。与简单的CNN相比，基于自我注意的模型具有明显的优势。较低的计算成本和复杂度使其成为目前使用的残差神经网络的高度竞争体系结构。此外，引入编码器层还可以通过充当有效的滤波器来解决CNN中存在的过拟合问题。摘要：The upcoming large scale surveys are expected to find approximately $10^5$ strong gravitational systems by analyzing data of many orders of magnitude than the current era. In this scenario, non-automated techniques will be highly challenging and time-consuming. We propose a new automated architecture based on the principle of self-attention to find strong gravitational lensing. The advantages of self-attention based encoder models over convolution neural networks are investigated and encoder models are analyzed to optimize performance. We constructed 21 self-attention based encoder models and four convolution neural networks trained to identify gravitational lenses from the Bologna Lens Challenge. Each model is trained separately using 18,000 simulated images, cross-validated using 2 000 images, and then applied to a test set with 100 000 images. We used four different metrics for evaluation: classification accuracy, the area under the receiver operating characteristic curve (AUROC), the $TPR_0$ score and the $TPR_{10}$ score. The performance of the self-attention based encoder models and CNN's participated in the challenge are compared. The encoder models performed better than the CNNs and surpassed the CNN models that participated in the bologna lens challenge by a high margin for the $TPR_0$ and $TPR_{10}$. In terms of the AUROC, the encoder models scored equivalent to the top CNN model by only using one-sixth parameters to that of the CNN. Self-Attention based models have a clear advantage compared to simpler CNNs. A low computational cost and complexity make it a highly competing architecture to currently used residual neural networks. Moreover, introducing the encoder layers can also tackle the over-fitting problem present in the CNN's by acting as effective filters.

【2】 Multi-View Stereo Network with attention thin volume 标题：关注薄体积的多视点立体网络链接：https://arxiv.org/abs/2110.08556

作者：Zihang Wan 摘要：我们提出了一种有效的多视图立体（MVS）网络，用于从多个RGB图像推断深度值。最近的研究表明，将真实空间中的几何关系映射到神经网络是MVS问题的一个基本主题。具体来说，这些方法侧重于如何通过构造一个好的成本量来表示不同视图之间的对应关系。在本文中，我们在吸收以往经验的基础上，提出了一种更完整的成本量构建方法。首先，我们引入自我注意机制，从输入图像中充分聚集主导信息，并精确地建模长期依赖关系，从而有选择地聚集参考特征。其次，我们将分组相关引入到特征聚合中，大大减少了内存和计算负担。同时，该方法增强了不同特征通道之间的信息交互。通过这种方法，构建了更轻量级和更高效的成本卷。最后，我们遵循从粗到精的策略，借助不确定性估计，逐步细化深度采样范围。我们进一步结合前面的步骤来获得关注。定量和定性实验证明了该模型的性能。摘要：We propose an efficient multi-view stereo (MVS) network for infering depth value from multiple RGB images. Recent studies have shown that mapping the geometric relationship in real space to neural network is an essential topic of the MVS problem. Specifically, these methods focus on how to express the correspondence between different views by constructing a nice cost volume. In this paper, we propose a more complete cost volume construction approach based on absorbing previous experience. First of all, we introduce the self-attention mechanism to fully aggregate the dominant information from input images and accurately model the long-range dependency, so as to selectively aggregate reference features. Secondly, we introduce the group-wise correlation to feature aggregation, which greatly reduces the memory and calculation burden. Meanwhile, this method enhances the information interaction between different feature channels. With this approach, a more lightweight and efficient cost volume is constructed. Finally we follow the coarse to fine strategy and refine the depth sampling range scale by scale with the help of uncertainty estimation. We further combine the previous steps to get the attention thin volume. Quantitative and qualitative experiments are presented to demonstrate the performance of our model.

【3】 Attention W-Net: Improved Skip Connections for better Representations 标题：注意W-Net：改进了跳过连接以获得更好的表示链接：https://arxiv.org/abs/2110.08811

作者：Shikhar Mohan,Saumik Bhattacharya,Sayantari Ghosh 机构： Indian Institute of Technology Kharagpur, India-, National Institute of Technology Durgapur, India- 备注：Under review at ICASSP 2022, Singapore 摘要：眼底镜视网膜图像中大血管和微血管结构的分割在多种视网膜和全身疾病的检测中起着至关重要的作用，但这是一个很难解决的问题。此任务的大多数深度学习方法涉及基于自动编码器的体系结构，但它们面临一些问题，例如缺少足够的参数、参数足够时的过度拟合以及内部特征空间之间的不兼容。由于这些问题，这些技术因此无法从此类任务的有限数据中提取最佳语义信息。我们提出了注意力W-Net，一种新的基于U-Net的视网膜血管分割架构来解决这些问题。在这个具有梯形网主干的体系结构中，我们有两个主要贡献：注意块和正则化措施。我们的注意块使用解码器功能在上采样期间关注跳过连接的编码器功能，从而在添加编码器和解码器功能时实现更高的兼容性。我们的正则化措施包括图像增强和对使用的ResNet块进行修改，以防止过度拟合。通过这些添加，我们观察到AUC和F1得分分别为0.8407和0.9833，这是对其梯形网主干的显著改进，也是当代最先进方法中具有竞争力的性能。摘要：Segmentation of macro and microvascular structures in fundoscopic retinal images plays a crucial role in detection of multiple retinal and systemic diseases, yet it is a difficult problem to solve. Most deep learning approaches for this task involve an autoencoder based architecture, but they face several issues such as lack of enough parameters, overfitting when there are enough parameters and incompatibility between internal feature-spaces. Due to such issues, these techniques are hence not able to extract the best semantic information from the limited data present for such tasks. We propose Attention W-Net, a new U-Net based architecture for retinal vessel segmentation to address these problems. In this architecture with a LadderNet backbone, we have two main contributions: Attention Block and regularisation measures. Our Attention Block uses decoder features to attend over the encoder features from skip-connections during upsampling, resulting in higher compatibility when the encoder and decoder features are added. Our regularisation measures include image augmentation and modifications to the ResNet Block used, which prevent overfitting. With these additions, we observe an AUC and F1-Score of 0.8407 and 0.9833 - a sizeable improvement over its LadderNet backbone as well as competitive performance among the contemporary state-of-the-art methods.

人脸|人群计数(6篇)

【1】 A Literature Review of 3D Face Reconstruction From a Single Image 标题：单幅图像三维人脸重建的文献综述链接：https://arxiv.org/abs/2110.09299

作者：Hanxin Wang 机构：University of Electronic Science and Technology of China 摘要：本文是一篇关于从一张图像重建三维人脸的最新文献的简要综述。为了提供单图像3D人脸重建的最新视图，大多数文章都是在2016年和2020年期间选择的。摘要：This paper is a brief survey of the recent literature on 3D face reconstruction from a single image. Most articles have been choosen among 2016 and 2020, in order to provide the most up-to-date view of the single image 3D face reconstruction.

【2】 Leveraging MoCap Data for Human Mesh Recovery 标题：利用MOCAP数据进行人体网格恢复链接：https://arxiv.org/abs/2110.09243

作者：Fabien Baradel,Thibault Groueix,Philippe Weinzaepfel,Romain Brégier,Yannis Kalantidis,Grégory Rogez 机构：Romain Br´egier, Gr´egory Rogez, NAVER LABS Europe 备注：3DV 2021 摘要：为了从图像或视频中恢复人体姿势和形状，训练最先进的模型需要具有相应注释的数据集，这些数据集很难获得，而且成本很高。我们在本文中的目标是研究3D运动捕获（MoCap）数据中的姿势是否可以用于改进基于图像和基于视频的人体网格恢复方法。我们发现，通过提供更广泛的姿势、纹理和背景，使用来自MoCap数据的合成渲染对基于图像的模型进行微调可以提高其性能。事实上，我们表明，只需微调模型的批量规范化层就足以获得较大的收益。我们进一步研究了MoCap数据在视频中的应用，并介绍了PoseBERT，这是一个直接回归姿态参数并通过蒙面建模进行训练的转换模块。它简单、通用，可以插在任何最先进的基于图像的模型之上，以便利用时间信息将其转换为基于视频的模型。我们的实验结果表明，所提出的方法在各种数据集（包括3DPW、MPI-INF-3DHP、MuPoTS-3D、MCB和AIST）上都达到了最先进的性能。测试代码和模型将很快提供。摘要：Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds. In fact, we show that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains. We further study the use of MoCap data for video, and introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling. It is simple, generic and can be plugged on top of any state-of-the-art image-based model in order to transform it in a video-based model leveraging temporal information. Our experimental results show that the proposed approaches reach state-of-the-art performance on various datasets including 3DPW, MPI-INF-3DHP, MuPoTS-3D, MCB and AIST. Test code and models will be available soon.

【3】 Disentangled Representation with Dual-stage Feature Learning for Face Anti-spoofing 标题：基于两阶段特征学习的人脸反欺骗解缠表示链接：https://arxiv.org/abs/2110.09157

作者：Yu-Chun Wang,Chien-Yi Wang,Shang-Hong Lai 机构：∗National Tsing Hua University, †Microsoft AI R&D Center, Taiwan 备注：WACV 2022 摘要：随着人脸识别在各种安全关键应用中的广泛应用，人脸反欺骗（face anti-spoofing，FAS）的研究越来越受到人们的关注。如果测试数据中的攻击类型与训练数据中的攻击类型相同，那么几种FAS方法已经取得了很好的性能，而对于看不见的攻击类型，性能会显著下降。为了防止过度适应预定义的欺骗攻击类型，必须学习更通用和更具区别性的特征。该文提出了一种新的双阶段非纠缠表示学习方法，能够有效地从无关特征中分离出欺骗相关特征。与以往的FAS解纠缠采用单阶段结构不同，我们发现双阶段训练设计可以提高训练稳定性，并有效地对特征进行编码以检测不可见的攻击类型。我们的实验表明，在多个跨类型FAS基准测试中，所提出的方法比最新的方法具有更高的精度。摘要：As face recognition is widely used in diverse security-critical applications, the study of face anti-spoofing (FAS) has attracted more and more attention. Several FAS methods have achieved promising performances if the attack types in the testing data are the same as training data, while the performance significantly degrades for unseen attack types. It is essential to learn more generalized and discriminative features to prevent overfitting to pre-defined spoof attack types. This paper proposes a novel dual-stage disentangled representation learning method that can efficiently untangle spoof-related features from irrelevant ones. Unlike previous FAS disentanglement works with one-stage architecture, we found that the dual-stage training design can improve the training stability and effectively encode the features to detect unseen attack types. Our experiments show that the proposed method provides superior accuracy than the state-of-the-art methods on several cross-type FAS benchmarks.

【4】 VoteHMR: Occlusion-Aware Voting Network for Robust 3D Human Mesh Recovery from Partial Point Clouds 标题：VoteHMR：遮挡感知投票网络用于从部分点云中稳健地恢复三维人体网格链接：https://arxiv.org/abs/2110.08729

作者：Guanze Liu,Yu Rong,Lu Sheng 机构：College of Software, Beihang, Beijing, China, The Chinese University of Hong Kong, Hong Kong SAR, China 备注：Our paper are accepted to MM 2021 as oral 摘要：从点云中恢复三维人体网格对于各种任务都至关重要，包括AR/VR和人类行为理解。该领域以前的工作要么需要高质量的3D人体扫描，要么需要连续的点云，这无法轻松应用于消费者级深度传感器捕获的低质量3D扫描。在本文中，我们首次尝试从单帧局部点云重建可靠的三维人体形状。为此，我们提出了一种端到端可学习的方法，称为VoteHMR。VoteHMR的核心是一种新型的遮挡感知投票网络，该网络能够从输入的部分点云可靠地生成可见的关节级特征，然后通过人体骨骼的运动树完成关节级特征。与以往的整体特征相比，关节级特征不仅能有效地编码人体几何信息，而且对具有自遮挡和缺失区域的噪声输入具有鲁棒性。该方法利用关节级特征和输入点云的全局特征提供的丰富互补线索，为统计三维人体模型（如SMPL）提供可靠且不纠缠的参数预测。该方法在超现实和DFAUST两个大规模数据集上实现了最先进的性能。此外，VoteHMR还展示了在现实世界数据集（如Berkeley MHAD）上的卓越泛化能力。摘要：3D human mesh recovery from point clouds is essential for various tasks, including AR/VR and human behavior understanding. Previous works in this field either require high-quality 3D human scans or sequential point clouds, which cannot be easily applied to low-quality 3D scans captured by consumer-level depth sensors. In this paper, we make the first attempt to reconstruct reliable 3D human shapes from single-frame partial point clouds.To achieve this, we propose an end-to-end learnable method, named VoteHMR. The core of VoteHMR is a novel occlusion-aware voting network that can first reliably produce visible joint-level features from the input partial point clouds, and then complete the joint-level features through the kinematic tree of the human skeleton. Compared with holistic features used by previous works, the joint-level features can not only effectively encode the human geometry information but also be robust to noisy inputs with self-occlusions and missing areas. By exploiting the rich complementary clues from the joint-level features and global features from the input point clouds, the proposed method encourages reliable and disentangled parameter predictions for statistical 3D human models, such as SMPL. The proposed method achieves state-of-the-art performances on two large-scale datasets, namely SURREAL and DFAUST. Furthermore, VoteHMR also demonstrates superior generalization ability on real-world datasets, such as Berkeley MHAD.

【5】 Face Verification with Challenging Imposters and Diversified Demographics 标题：使用具有挑战性的冒名顶替者和多样化的人口统计进行人脸验证链接：https://arxiv.org/abs/2110.08667

作者：Adrian Popescu,Liviu-Daniel Ştefan,Jérôme Deshayes-Chossart,Bogdan Ionescu 机构：Universit´e Paris-Saclay, CEA, List, F-, Palaiseau, France, University Politehnica of Bucharest, Romania 摘要：人脸验证的目的是区分真实的和冒名顶替的人脸对，它们分别包含相同或不同的身份。近年来报告的业绩给人的印象是，这项任务实际上已经解决了。在这里，我们重新审视这个问题，并认为现有的评估数据集是使用两种过于简单的设计选择构建的。首先，形成冒名顶替者对的通常身份选择不够具有挑战性，因为在实践中，需要验证来检测具有挑战性的冒名顶替者。其次，现有数据集的基本人口统计数据往往不足以解释世界各地人们的面部特征的广泛多样性。为了减轻这些限制，我们引入了$FaVCI2D$数据集。冒名顶替者配对很有挑战性，因为它们包括从大量人口多样化身份中挑选的视觉相似的面孔。数据集还包括与性别、国家和年龄相关的元数据，以便于对结果进行细粒度分析$FaVCI2D$由可自由分配的资源生成。使用在现有数据集上提供近100%性能的最先进深度模型进行的实验表明$FaVCI2D$的性能显著下降，证实了我们的初始假设。同样重要的是，我们分析了近年来出现的阻碍人脸分析研究发展的法律和道德挑战。我们介绍了一系列的设计选择，以应对这些挑战，并使数据集的构成和使用更加可持续和公平$FaVCI2D$可从~\url获得{https://github.com/AIMultimediaLab/FaVCI2D-Face-Verification-with-Challenging-Imposters-and-Diversified-Demographics}. 摘要：Face verification aims to distinguish between genuine and imposter pairs of faces, which include the same or different identities, respectively. The performance reported in recent years gives the impression that the task is practically solved. Here, we revisit the problem and argue that existing evaluation datasets were built using two oversimplifying design choices. First, the usual identity selection to form imposter pairs is not challenging enough because, in practice, verification is needed to detect challenging imposters. Second, the underlying demographics of existing datasets are often insufficient to account for the wide diversity of facial characteristics of people from across the world. To mitigate these limitations, we introduce the $FaVCI2D$ dataset. Imposter pairs are challenging because they include visually similar faces selected from a large pool of demographically diversified identities. The dataset also includes metadata related to gender, country and age to facilitate fine-grained analysis of results. $FaVCI2D$ is generated from freely distributable resources. Experiments with state-of-the-art deep models that provide nearly 100\% performance on existing datasets show a significant performance drop for $FaVCI2D$, confirming our starting hypothesis. Equally important, we analyze legal and ethical challenges which appeared in recent years and hindered the development of face analysis research. We introduce a series of design choices which address these challenges and make the dataset constitution and usage more sustainable and fairer. $FaVCI2D$ is available at~\url{https://github.com/AIMultimediaLab/FaVCI2D-Face-Verification-with-Challenging-Imposters-and-Diversified-Demographics}.

【6】 Joint 3D Human Shape Recovery from A Single Imag with Bilayer-Graph 标题：基于双层图的单幅图像联合三维人体形状恢复链接：https://arxiv.org/abs/2110.08472

作者：Xin Yu,Jeroen van Baar,Siheng Chen 机构：School of computing, University of Utah, SLC, Utah, USA, Mitsubishi Electric Research Laboratories, Cambridge, MA, USA 备注：3DV'21 摘要：从图像中估计3D人体形状和姿势的能力在许多情况下都很有用。最近的方法已经探索使用图卷积网络，并取得了有希望的结果。事实上，三维形状是由一个网格表示的，一个无向图，使得图卷积网络自然适合这个问题。然而，图卷积网络的表示能力有限。来自图中节点的信息传递给连接的邻居，信息的传播需要连续的图卷积。为了克服这一限制，我们提出了一种双尺度图方法。我们使用从稠密图中导出的粗糙图来估计人类的三维姿势，使用稠密图来估计三维形状。与密集图相比，粗糙图中的信息可以传播更长的距离。此外，有关姿势的信息可以引导恢复局部形状细节，反之亦然。我们认识到粗糙图和稠密图之间的联系本身就是一个图，并引入图融合块在不同尺度的图之间交换信息。我们对我们的模型进行了端到端的训练，并表明我们可以为多个评估数据集获得最先进的结果。摘要：The ability to estimate the 3D human shape and pose from images can be useful in many contexts. Recent approaches have explored using graph convolutional networks and achieved promising results. The fact that the 3D shape is represented by a mesh, an undirected graph, makes graph convolutional networks a natural fit for this problem. However, graph convolutional networks have limited representation power. Information from nodes in the graph is passed to connected neighbors, and propagation of information requires successive graph convolutions. To overcome this limitation, we propose a dual-scale graph approach. We use a coarse graph, derived from a dense graph, to estimate the human's 3D pose, and the dense graph to estimate the 3D shape. Information in coarse graphs can be propagated over longer distances compared to dense graphs. In addition, information about pose can guide to recover local shape detail and vice versa. We recognize that the connection between coarse and dense is itself a graph, and introduce graph fusion blocks to exchange information between graphs with different scales. We train our model end-to-end and show that we can achieve state-of-the-art results for several evaluation datasets.

跟踪(1篇)

【1】 MTP: Multi-Hypothesis Tracking and Prediction for Reduced Error Propagation 标题：MTP：减少误差传播的多假设跟踪和预测链接：https://arxiv.org/abs/2110.09481

作者：Xinshuo Weng,Boris Ivanovic,Marco Pavone 机构： Stanford University 备注：Project page: this https URL 摘要：最近，在开发标准感知规划机器人自主管道的各个模块方面取得了巨大进展，包括检测、跟踪、预测其他代理的轨迹和自我代理的轨迹规划。然而，对这些组件的原则性集成关注较少，特别是在级联错误的表征和缓解方面。本文通过关注跟踪和预测模块之间的耦合来解决级联误差问题。首先，通过使用最先进的跟踪和预测工具，我们对跟踪产生的错误对预测性能的影响程度进行了全面的实验评估。在KITTI和nuScenes数据集上，我们发现使用跟踪轨迹作为输入的预测（实践中的典型情况）与使用地面真实过去轨迹作为输入的理想环境相比，性能会显著下降（甚至数量级）。为了解决这个问题，我们提出了一个多假设跟踪和预测框架。我们的框架不依赖于单个跟踪结果集进行预测，而是同时对多个跟踪结果集进行推理，从而增加了将准确的跟踪结果作为预测输入的可能性。我们表明，在nuScenes数据集上，与标准的单假设跟踪预测管道相比，该框架将整体预测性能提高了34.2%，甚至有更显著的改进（高达70%）当将评估限制在涉及身份切换和片段的具有挑战性的场景时——所有这些都具有可接受的计算开销。摘要：Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components, particularly in terms of the characterization and mitigation of cascading errors. This paper addresses the problem of cascading errors by focusing on the coupling between the tracking and prediction modules. First, by using state-of-the-art tracking and prediction tools, we conduct a comprehensive experimental evaluation of how severely errors stemming from tracking can impact prediction performance. On the KITTI and nuScenes datasets, we find that predictions consuming tracked trajectories as inputs (the typical case in practice) can experience a significant (even order of magnitude) drop in performance in comparison to the idealized setting where ground truth past trajectories are used as inputs. To address this issue, we propose a multi-hypothesis tracking and prediction framework. Rather than relying on a single set of tracking results for prediction, our framework simultaneously reasons about multiple sets of tracking results, thereby increasing the likelihood of including accurate tracking results as inputs to prediction. We show that this framework improves overall prediction performance over the standard single-hypothesis tracking-prediction pipeline by up to 34.2% on the nuScenes dataset, with even more significant improvements (up to ~70%) when restricting the evaluation to challenging scenarios involving identity switches and fragments -- all with an acceptable computation overhead.

裁剪|量化|加速|压缩相关(4篇)

【1】 Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks 标题：亚比特神经网络：学习压缩和加速二进制神经网络链接：https://arxiv.org/abs/2110.09195

作者：Yikai Wang,Yi Yang,Fuchun Sun,Anbang Yao 机构：Beijing National Research Center for Information Science and Technology (BNRist), State Key Lab on Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Intel Corporation 备注：ICCV 2021. Code and models: this https URL 摘要：在低位量化领域，训练二进制神经网络（BNN）是简化资源受限设备上深度模型部署的极端解决方案，与32位浮点型相比，具有最低的存储成本和显著更低的位操作成本。在本文中，我们介绍了子比特神经网络（SNN），一种新的二进制量化设计，专门用于压缩和加速BNN。SNN的灵感来自于一项经验观察，表明在BNN模型的卷积层学习到的二进制核可能分布在核子集上。因此，与现有的逐个对权重进行二值化的方法不同，SNN是使用核感知优化框架进行训练的，该框架利用细粒度卷积核空间中的二值量化。具体地说，我们的方法包括一个随机采样步骤，生成特定于层的内核空间子集，以及一个细化步骤，学习通过优化调整这些二进制内核子集。视觉识别基准测试和FPGA硬件部署实验验证了SNN的巨大潜力。例如，在ImageNet上，与传统BNN相比，具有0.56位权重的ResNet-18/ResNet-34的SNN实现了3.13/3.33倍的运行时加速和1.8倍的压缩，识别精度略有下降。当应用SNN对权重和激活进行二值化时，也获得了有希望的结果。我们的代码可在https://github.com/yikaiw/SNN. 摘要：In the low-bit quantization field, training Binary Neural Networks (BNNs) is the extreme solution to ease the deployment of deep models on resource-constrained devices, having the lowest storage cost and significantly cheaper bit-wise operations compared to 32-bit floating-point counterparts. In this paper, we introduce Sub-bit Neural Networks (SNNs), a new type of binary quantization design tailored to compress and accelerate BNNs. SNNs are inspired by an empirical observation, showing that binary kernels learnt at convolutional layers of a BNN model are likely to be distributed over kernel subsets. As a result, unlike existing methods that binarize weights one by one, SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space. Specifically, our method includes a random sampling step generating layer-specific subsets of the kernel space, and a refinement step learning to adjust these subsets of binary kernels via optimization. Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs. For instance, on ImageNet, SNNs of ResNet-18/ResNet-34 with 0.56-bit weights achieve 3.13/3.33 times runtime speed-up and 1.8 times compression over conventional BNNs with moderate drops in recognition accuracy. Promising results are also obtained when applying SNNs to binarize both weights and activations. Our code is available at https://github.com/yikaiw/SNN.

【2】 An Acceleration Method Based on Deep Learning and Multilinear Feature Space 标题：一种基于深度学习和多线性特征空间的加速方法链接：https://arxiv.org/abs/2110.08679

作者：Michel Vinagreiro Edson Kitani Armando Lagana Leopoldo Yoshioka 机构： Yoshioka 1 1Laboratory of Integrated Systems, Escola Politecnica da Universidade de Sao Paulo, br 2Department of Automotive Electronics, br 1Laboratory of Integrated Systems 备注：20 pages, International Journal of Artificial Intelligence and Applications 摘要：计算机视觉在先进的辅助系统中起着至关重要的作用。大多数计算机视觉系统是基于深度卷积神经网络（Deep-CNN）结构的。然而，运行CNN算法需要大量的计算资源。因此，提高计算速度的方法成为一个相关的研究课题。尽管文献中关于体系结构简化的几项工作尚未在嵌入式实时系统应用中取得令人满意的结果。本文提出了一种基于多线性特征空间（MFS）方法的替代方法，该方法借助于大型CNN结构的转移学习。该方法使用CNN生成特征映射，但不能作为复杂度降低方法。在训练过程之后，生成的特征映射用于创建向量特征空间。我们使用这个新的向量空间对任何新样本进行投影以对它们进行分类。我们的方法，命名为AMFC，使用来自预先训练的CNN的转移学习来减少新样本图像的分类时间，并且精度损失最小。我们的方法使用VGG-16模型作为CNN的基础架构进行实验；然而，该方法适用于任何类似的CNN模型。使用著名的车辆图像数据库和德国交通标志识别基准，我们比较了原始VGG-16模型和AMFC方法的分类时间，我们的方法平均快17倍。在需要大型CNN体系结构的嵌入式应用程序中，快速分类时间减少了计算和内存需求。摘要：Computer vision plays a crucial role in Advanced Assistance Systems. Most computer vision systems are based on Deep Convolutional Neural Networks (deep CNN) architectures. However, the high computational resource to run a CNN algorithm is demanding. Therefore, the methods to speed up computation have become a relevant research issue. Even though several works on architecture reduction found in the literature have not yet been achieved satisfactory results for embedded real-time system applications. This paper presents an alternative approach based on the Multilinear Feature Space (MFS) method resorting to transfer learning from large CNN architectures. The proposed method uses CNNs to generate feature maps, although it does not work as complexity reduction approach. After the training process, the generated features maps are used to create vector feature space. We use this new vector space to make projections of any new sample to classify them. Our method, named AMFC, uses the transfer learning from pre-trained CNN to reduce the classification time of new sample image, with minimal accuracy loss. Our method uses the VGG-16 model as the base CNN architecture for experiments; however, the method works with any similar CNN model. Using the well-known Vehicle Image Database and the German Traffic Sign Recognition Benchmark, we compared the classification time of the original VGG-16 model with the AMFC method, and our method is, on average, 17 times faster. The fast classification time reduces the computational and memory demands in embedded applications requiring a large CNN architecture.

【3】 Neural Network Pruning Through Constrained Reinforcement Learning 标题：基于约束强化学习的神经网络修剪链接：https://arxiv.org/abs/2110.08558

作者：Shehryar Malik,Muhammad Umair Haider,Omer Iqbal,Murtaza Taj 机构：LUMS School of Science and Engineering 备注：Submitted to ICASSP 2021 摘要：网络修剪通过删除（修剪）神经元来减小神经网络的大小，从而使性能下降最小。传统的剪枝方法侧重于设计度量来量化神经元的有用性，而神经元的有用性通常是非常单调和次优的。而最近的方法则侧重于训练辅助网络，以自动了解每个神经元有多有用。然而，它们通常不考虑计算限制。在这项工作中，我们提出了一种修剪神经网络的通用方法。我们提出的方法可以对神经网络进行修剪，从而在任意、可能不可微的函数上遵守预定义的计算预算。此外，我们只假设能够为不同的输入评估这些函数，因此不需要事先完全指定它们。我们通过提出一种新的通过约束强化学习算法的剪枝策略来实现这一点。通过与标准图像分类数据集上的最新方法的比较，我们证明了该方法的有效性。具体而言，我们减少了VGG各种变体总参数的83-92.90，同时实现了与原始网络相当或更好的性能。在ResNet18上，我们还实现了75.09的参数缩减，而不会导致任何精度损失。摘要：Network pruning reduces the size of neural networks by removing (pruning) neurons such that the performance drop is minimal. Traditional pruning approaches focus on designing metrics to quantify the usefulness of a neuron which is often quite tedious and sub-optimal. More recent approaches have instead focused on training auxiliary networks to automatically learn how useful each neuron is however, they often do not take computational limitations into account. In this work, we propose a general methodology for pruning neural networks. Our proposed methodology can prune neural networks to respect pre-defined computational budgets on arbitrary, possibly non-differentiable, functions. Furthermore, we only assume the ability to be able to evaluate these functions for different inputs, and hence they do not need to be fully specified beforehand. We achieve this by proposing a novel pruning strategy via constrained reinforcement learning algorithms. We prove the effectiveness of our approach via comparison with state-of-the-art methods on standard image classification datasets. Specifically, we reduce 83-92.90 of total parameters on various variants of VGG while achieving comparable or better performance than that of original networks. We also achieved 75.09 reduction in parameters on ResNet18 without incurring any loss in accuracy.

【4】 Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations 标题：基于联合量化和权值修剪的深度神经网络训练链接：https://arxiv.org/abs/2110.08271

作者：Xinyu Zhang,Ian Colbert,Ken Kreutz-Delgado,Srinjoy Das 机构： 1Department of Electrical and Computer Engineering, 2School of Mathematical and Data Sciences, West VirginiaUniversity 摘要：量化和剪枝是用来降低深层神经网络推理代价的核心技术。最先进的量化技术目前应用于权重和激活；然而，剪枝通常只应用于网络的权重。在这项工作中，我们将新的均匀量化和非结构化剪枝方法联合应用于训练期间深度神经网络的权重和激活。使用我们的方法，我们对当前广泛的计算机视觉任务中接受的剪枝-然后量化范式进行了经验评估，并观察到当应用于深度神经网络的权值和激活时的非交换性质。根据这些观察结果，我们阐明了非交换性假设：对于针对特定任务训练的给定深度神经网络，存在一个精确的训练计划，其中可以引入量化和剪枝来优化网络性能。我们发现，这种最优排序不仅存在，而且在区分性任务和生成性任务中也不同。在我们的训练框架中使用最佳训练计划，我们展示了比现有解决方案每内存占用更高的性能。摘要：Quantization and pruning are core techniques used to reduce the inference costs of deep neural networks. State-of-the-art quantization techniques are currently applied to both the weights and activations; however, pruning is most often applied to only the weights of the network. In this work, we jointly apply novel uniform quantization and unstructured pruning methods to both the weights and activations of deep neural networks during training. Using our methods, we empirically evaluate the currently accepted prune-then-quantize paradigm across a wide range of computer vision tasks and observe a non-commutative nature when applied to both the weights and activations of deep neural networks. Informed by these observations, we articulate the non-commutativity hypothesis: for a given deep neural network being trained for a specific task, there exists an exact training schedule in which quantization and pruning can be introduced to optimize network performance. We identify that this optimal ordering not only exists, but also varies across discriminative and generative tasks. Using the optimal training schedule within our training framework, we demonstrate increased performance per memory footprint over existing solutions.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Self-Annotated Training for Controllable Image Captioning 标题：用于可控图像字幕的自注解训练链接：https://arxiv.org/abs/2110.08446

作者：Zhangzi Zhu,Tianlei Wang,Hong Qu 机构：University of Electronic Science and Technology of China 摘要：可控图像字幕（CIC）任务旨在根据指定的控制信号生成字幕。本文从两个方面对CIC进行了改进：1）现有的强化训练方法不适用于与结构相关的CIC模型，因为基于准确性的奖励主要关注内容而不是语义结构。由于缺乏强化训练，该模型无法生成更准确、更可控的句子。为了解决上述问题，我们提出了一种新的结构相关CIC模型强化训练方法：自注释训练（SAT），其中设计了递归采样机制（RSM），强制输入控制信号与实际输出句子匹配。在MSCOCO上进行的大量实验表明，在长度控制任务中，我们的SAT方法将CIDEr-D分数的C-Transformer（XE）从118.6提高到130.1，在紧张控制任务中从132.2提高到142.7，同时保持了与控制信号的99$\%$以上匹配精度。2）我们引入了一个新的控制信号：句子质量。配备了它，CIC模型能够根据需要生成不同质量级别的字幕。实验表明，在不增加背景真实性字幕信息的情况下，由最高句子质量水平控制的模型比基线模型在准确性上表现得更好。摘要：The Controllable Image Captioning (CIC) task aims to generate captions conditioned on designated control signals. In this paper, we improve CIC from two aspects: 1) Existing reinforcement training methods are not applicable to structure-related CIC models due to the fact that the accuracy-based reward focuses mainly on contents rather than semantic structures. The lack of reinforcement training prevents the model from generating more accurate and controllable sentences. To solve the problem above, we propose a novel reinforcement training method for structure-related CIC models: Self-Annotated Training (SAT), where a recursive sampling mechanism (RSM) is designed to force the input control signal to match the actual output sentence. Extensive experiments conducted on MSCOCO show that our SAT method improves C-Transformer (XE) on CIDEr-D score from 118.6 to 130.1 in the length-control task and from 132.2 to 142.7 in the tense-control task, while maintaining more than 99$\%$ matching accuracy with the control signal. 2) We introduce a new control signal: sentence quality. Equipped with it, CIC models are able to generate captions of different quality levels as needed. Experiments show that without additional information of ground truth captions, models controlled by the highest level of sentence quality perform much better in accuracy than baseline models.

超分辨率|去噪|去模糊|去雾(2篇)

【1】 Dynamic Slimmable Denoising Network 标题：动态可伸缩去噪网络链接：https://arxiv.org/abs/2110.08940

作者：Zutao Jiang,Changlin Li,Xiaojun Chang,Jihua Zhu,Yi Yang 机构： Chang is with the School of Computing Technologies, RMIT University 备注：11 pages 摘要：近年来，大量人工设计和自动搜索的神经网络被应用于图像去噪。然而，以前的工作打算在预定义的静态网络结构中处理所有噪声图像，这不可避免地导致高计算复杂度以获得良好的去噪质量。在这里，我们提出了动态可精简去噪网络（DDS-Net），这是一种通用的方法，通过在测试时针对不同的噪声图像动态调整网络的信道配置，以较低的计算复杂度实现良好的去噪质量。我们的DDS网络通过一个动态门被赋予了动态推理的能力，它可以预测地调整网络的信道配置，而额外的计算成本可以忽略不计。为了保证每个候选子网的性能和动态门的公平性，我们提出了一种三阶段优化方案。在第一阶段，我们训练了一个重量共享的超薄超级网络。在第二阶段，我们以迭代的方式评估经过训练的可精简超级网络，并以最小的去噪质量下降逐步调整每层的通道数。在不同的信道配置下，通过一次传递，我们可以获得多个子网，具有良好的性能。在最后一个阶段，我们在线识别容易和困难的样本，并训练一个动态门来针对不同的噪声图像预测性地选择相应的子网络。大量实验表明，我们的DDS网络始终优于最先进的单独训练静态去噪网络。摘要：Recently, tremendous human-designed and automatically searched neural networks have been applied to image denoising. However, previous works intend to handle all noisy images in a pre-defined static network architecture, which inevitably leads to high computational complexity for good denoising quality. Here, we present dynamic slimmable denoising network (DDS-Net), a general method to achieve good denoising quality with less computational complexity, via dynamically adjusting the channel configurations of networks at test time with respect to different noisy images. Our DDS-Net is empowered with the ability of dynamic inference by a dynamic gate, which can predictively adjust the channel configuration of networks with negligible extra computation cost. To ensure the performance of each candidate sub-network and the fairness of the dynamic gate, we propose a three-stage optimization scheme. In the first stage, we train a weight-shared slimmable super network. In the second stage, we evaluate the trained slimmable super network in an iterative way and progressively tailor the channel numbers of each layer with minimal denoising quality drop. By a single pass, we can obtain several sub-networks with good performance under different channel configurations. In the last stage, we identify easy and hard samples in an online way and train a dynamic gate to predictively select the corresponding sub-network with respect to different noisy images. Extensive experiments demonstrate our DDS-Net consistently outperforms the state-of-the-art individually trained static denoising networks.

【2】 An Analysis and Implementation of the HDR+ Burst Denoising Method 标题：HDR+突发去噪方法的分析与实现链接：https://arxiv.org/abs/2110.09354

作者：Antoine Monod,Julie Delon,Thomas Veit 机构： Universit´e de Paris 备注：None 摘要：HDR+是谷歌在2016年推出的图像处理管道。其核心是一种去噪算法，该算法使用一组原始图像生成一幅更高质量的图像。由于它是为智能手机摄像头设计的多功能解决方案，因此它不一定以标准去噪指标的最大化为目标，而是为了生成自然、视觉愉悦的图像。在本文中，我们具体讨论和分析了HDR+突发去噪算法的体系结构及其各种参数的影响。在本出版物中，我们提供了该算法的一个开源Python实现，以及一个交互式演示。摘要：HDR+ is an image processing pipeline presented by Google in 2016. At its core lies a denoising algorithm that uses a burst of raw images to produce a single higher quality image. Since it is designed as a versatile solution for smartphone cameras, it does not necessarily aim for the maximization of standard denoising metrics, but rather for the production of natural, visually pleasing images. In this article, we specifically discuss and analyze the HDR+ burst denoising algorithm architecture and the impact of its various parameters. With this publication, we provide an open source Python implementation of the algorithm, along with an interactive demo.

点云|SLAM|雷达|激光|深度RGBD相关(3篇)

【1】 Deep Models with Fusion Strategies for MVP Point Cloud Registration 标题：基于融合策略的MVP点云配准深度模型链接：https://arxiv.org/abs/2110.09129

作者：Lifa Zhu,Changwei Lin,Dongrui Liu,Xin Li,Francisco Gómez-Fernández 机构：DeepGlint, Beijing, China, Shanghai Jiao Tong University, Shanghai, China, Sichuan University, Chengdu, China, Francisco G´omez-Fern´andez 备注：Point cloud registration competition, ICCV21 workshop. Substantial text overlap with arXiv:2107.02583 摘要：在多视图局部（MVP）挑战2021中，点云注册的主要目标是估计刚性变换以对齐点云对。本次比赛中的配对具有低重叠、密度不均匀、旋转不受限制和模糊等特点，这对配准任务提出了巨大挑战。在本报告中，我们介绍了注册任务的解决方案，该方案融合了两种深度学习模型：ROPNet和PREDATOR，以及定制的集成策略。最后，在Rot\\ U误差、Trans\\ U误差和MSE指标下，我们分别以2.96546、0.02632和0.07808获得了第二名。摘要：The main goal of point cloud registration in Multi-View Partial (MVP) Challenge 2021 is to estimate a rigid transformation to align a point cloud pair. The pairs in this competition have the characteristics of low overlap, non-uniform density, unrestricted rotations and ambiguity, which pose a huge challenge to the registration task. In this report, we introduce our solution to the registration task, which fuses two deep learning models: ROPNet and PREDATOR, with customized ensemble strategies. Finally, we achieved the second place in the registration track with 2.96546, 0.02632 and 0.07808 under the the metrics of Rot\_Error, Trans\_Error and MSE, respectively.

【2】 Patch-Based Deep Autoencoder for Point Cloud Geometry Compression 标题：基于面片的点云几何深度自动编码器链接：https://arxiv.org/abs/2110.09109

作者：Kang You,Pan Gao 机构：Nanjing University of Aeronautics and Astronautics, Nanjing, China 备注：Accepted to ACM Multimedia Asia (MMAsia '21) 摘要：日益增长的3D应用使得点云压缩变得前所未有的重要和必要。在本文中，我们提出了一种基于面片的深度学习压缩过程，重点是有损点云几何压缩。与现有点云压缩网络在整个点云上应用特征提取和重建不同，我们将点云划分为块，并独立压缩每个块。在解码过程中，我们最终将解压缩的补丁组装成一个完整的点云。此外，我们通过一个面片到面片的准则来训练我们的网络，即使用局部重建损失进行优化，以逼近全局重建最优性。我们的方法在率失真性能方面优于最新技术，尤其是在低比特率下。此外，我们提出的压缩过程可以保证生成与输入相同数量的点。该方法的网络模型可以方便地应用于其他点云重建问题，如上采样等。摘要：The ever-increasing 3D application makes the point cloud compression unprecedentedly important and needed. In this paper, we propose a patch-based compression process using deep learning, focusing on the lossy point cloud geometry compression. Unlike existing point cloud compression networks, which apply feature extraction and reconstruction on the entire point cloud, we divide the point cloud into patches and compress each patch independently. In the decoding process, we finally assemble the decompressed patches into a complete point cloud. In addition, we train our network by a patch-to-patch criterion, i.e., use the local reconstruction loss for optimization, to approximate the global reconstruction optimality. Our method outperforms the state-of-the-art in terms of rate-distortion performance, especially at low bitrates. Moreover, the compression process we proposed can guarantee to generate the same number of points as the input. The network model of this method can be easily applied to other point cloud reconstruction problems, such as upsampling.

【3】 Accurate and Robust Object-oriented SLAM with 3D Quadric Landmark Construction in Outdoor Environment 标题：室外环境中精确健壮的面向对象的SLAM三维二次地标构造链接：https://arxiv.org/abs/2110.08977

作者：Rui Tian,Yunzhou Zhang,Yonghui Feng,Linghao Yang,Zhenzhong Cao,Sonya Coleman,Dermot Kerr 机构： Ulster University 备注：Submitting to RA-L 摘要：面向对象的SLAM是自动驾驶和机器人技术中的一种流行技术。在本文中，我们提出了一种立体视觉SLAM与鲁棒二次地标表示方法。该系统由四个部分组成，包括深度学习检测、面向对象的数据关联、双二次地标初始化和基于对象的姿态优化。基于二次曲面的SLAM算法一直面临着观测相关的问题，并且对观测噪声非常敏感，这限制了其在室外场景中的应用。针对这一问题，提出了一种基于二次参数解耦的二次初始化方法，提高了对观测噪声的鲁棒性。充分的对象数据关联算法和具有多个线索的面向对象优化能够实现对局部观测具有鲁棒性的高度精确的对象姿势估计。实验结果表明，该系统对观测噪声具有较强的鲁棒性，在室外环境下的性能明显优于现有的方法。此外，该系统还具有实时性。摘要：Object-oriented SLAM is a popular technology in autonomous driving and robotics. In this paper, we propose a stereo visual SLAM with a robust quadric landmark representation method. The system consists of four components, including deep learning detection, object-oriented data association, dual quadric landmark initialization and object-based pose optimization. State-of-the-art quadric-based SLAM algorithms always face observation related problems and are sensitive to observation noise, which limits their application in outdoor scenes. To solve this problem, we propose a quadric initialization method based on the decoupling of the quadric parameters method, which improves the robustness to observation noise. The sufficient object data association algorithm and object-oriented optimization with multiple cues enables a highly accurate object pose estimation that is robust to local observations. Experimental results show that the proposed system is more robust to observation noise and significantly outperforms current state-of-the-art methods in outdoor environments. In addition, the proposed system demonstrates real-time performance.

多模态(1篇)

【1】 Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals 标题：通过对多模态教学手册排序来理解过程性知识链接：https://arxiv.org/abs/2110.08486

作者：Te-Lin Wu,Alex Spangher,Pegah Alipoormolabashi,Marjorie Freedman,Ralph Weischedel,Nanyun Peng 机构：University of California, Los Angeles,Information Sciences Institute, University of Southern California, Sharif University of Technology 摘要：对无序事件进行排序的能力是理解和推理现实世界任务程序的一项基本技能，这通常需要彻底理解时间常识和多模态信息，因为这些程序通常通过文本和图像的组合进行沟通。这种能力对于顺序任务规划和多源指令摘要等应用至关重要。虽然人类能够对无序的多模态程序指令进行推理和排序，但当前的机器学习模型是否具有这种基本能力仍然是一个悬而未决的问题。在这项工作中，我们通过整理流行在线教学手册中的数据集和收集全面的人类注释，对模型对无序多模态指令进行推理和排序的能力进行基准测试。我们发现模型不仅表现得比人类差得多，而且似乎无法有效地利用多模态信息。为了提高机器在多模式事件排序方面的性能，我们提出了序列感知预训练技术，该技术利用文本和图像的顺序对齐特性，显著提高了5%以上。摘要：The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.

3D|3D重建等相关(1篇)

【1】 FAST3D: Flow-Aware Self-Training for 3D Object Detectors 标题：FAST3D：三维物体探测器的流感知自训练链接：https://arxiv.org/abs/2110.09355

作者：Christian Fruhwirth-Reisinger,Michael Opitz,Horst Possegger,Horst Bischof 机构： Christian Doppler Laboratory for, Embedded Machine Learning, Institute of Computer Graphics and, Vision, Graz University of Technology 备注：Accepted to BMVC 2021 摘要：在自主驾驶领域，自训练被广泛应用于缓解基于激光雷达的三维目标探测器的分布偏移。这样，每当环境发生变化时（例如地理位置、传感器设置、天气状况），就不需要昂贵、高质量的标签。然而，最先进的自我训练方法大多忽略了自动驾驶数据的时间特性。为了解决这个问题，我们提出了一种流感知自训练方法，该方法能够在连续激光雷达点云上对三维目标探测器进行无监督的域自适应。为了获得可靠的伪标签，我们利用场景流通过时间传播检测。特别是，我们介绍了一种基于流的多目标跟踪器，该跟踪器利用流的一致性来过滤和细化生成的轨迹。然后，出现的精确伪标签将作为模型重新训练的基础。从预先训练好的KITTI模型开始，我们在具有挑战性的Waymo开放数据集上进行实验，以证明我们方法的有效性。在没有任何目标领域的先验知识的情况下，我们的结果显示了比最新技术的显著改进。摘要：In the field of autonomous driving, self-training is widely applied to mitigate distribution shifts in LiDAR-based 3D object detectors. This eliminates the need for expensive, high-quality labels whenever the environment changes (e.g., geographic location, sensor setup, weather condition). State-of-the-art self-training approaches, however, mostly ignore the temporal nature of autonomous driving data. To address this issue, we propose a flow-aware self-training method that enables unsupervised domain adaptation for 3D object detectors on continuous LiDAR point clouds. In order to get reliable pseudo-labels, we leverage scene flow to propagate detections through time. In particular, we introduce a flow-based multi-target tracker, that exploits flow consistency to filter and refine resulting tracks. The emerged precise pseudo-labels then serve as a basis for model re-training. Starting with a pre-trained KITTI model, we conduct experiments on the challenging Waymo Open Dataset to demonstrate the effectiveness of our approach. Without any prior target domain knowledge, our results show a significant improvement over the state-of-the-art.

其他神经网络|深度学习|模型|建模(18篇)

【1】 Discovering and Achieving Goals via World Models 标题：通过世界模型发现和实现目标链接：https://arxiv.org/abs/2110.09514

作者：Russell Mendonca,Oleh Rybkin,Kostas Daniilidis,Danijar Hafner,Deepak Pathak 机构：Carnegie Mellon University, University of Pennsylvania, University of Toronto 备注：NeurIPS 2021. First two authors contributed equally. Website at this https URL 摘要：人工智能体如何在没有任何监督的情况下学会在复杂的视觉环境中解决许多不同的任务？我们将这个问题分解为两个问题：发现新的目标和学习可靠地实现它们。我们介绍了潜在探索者成就者（LEXA），这是一个统一的解决方案，它从图像输入中学习世界模型，并使用它从想象的卷展中训练探索者和成就者策略。与之前通过到达之前访问过的州进行探索的方法不同，探索者计划通过预见发现未知的令人惊讶的州，然后将这些州作为不同的目标供成功者练习。在无监督阶段之后，LEXA解决了指定为目标图像Zero-Shot的任务，无需任何额外的学习。LEXA在以前的基准测试和新的具有挑战性的基准测试中，在四个标准机器人操作和移动领域共有40项测试任务，大大优于以前的无监督目标达成方法。LEXA进一步实现了需要按顺序与多个对象交互的目标。最后，为了演示LEXA的可伸缩性和通用性，我们在四个不同的环境中训练了一个通用代理。代码和视频在https://orybkin.github.io/lexa/ 摘要：How can artificial agents learn to solve many diverse tasks in complex visual environments in the absence of any supervision? We decompose this question into two problems: discovering new goals and learning to reliably achieve them. We introduce Latent Explorer Achiever (LEXA), a unified solution to these that learns a world model from image inputs and uses it to train an explorer and an achiever policy from imagined rollouts. Unlike prior methods that explore by reaching previously visited states, the explorer plans to discover unseen surprising states through foresight, which are then used as diverse targets for the achiever to practice. After the unsupervised phase, LEXA solves tasks specified as goal images zero-shot without any additional learning. LEXA substantially outperforms previous approaches to unsupervised goal-reaching, both on prior benchmarks and on a new challenging benchmark with a total of 40 test tasks spanning across four standard robotic manipulation and locomotion domains. LEXA further achieves goals that require interacting with multiple objects in sequence. Finally, to demonstrate the scalability and generality of LEXA, we train a single general agent across four distinct environments. Code and videos at https://orybkin.github.io/lexa/

【2】 Learning in High Dimension Always Amounts to Extrapolation 标题：高维学习总是等同于外推链接：https://arxiv.org/abs/2110.09485

作者：Randall Balestriero,Jerome Pesenti,Yann LeCun 机构：Facebook AI Research,NYU 摘要：插值和外推的概念在从深度学习到函数逼近的各个领域都是基本的。每当样本$x$落在给定数据集凸包的内部或边界上时，就会对该样本进行插值。当$x$落在凸面外壳之外时，会发生外推。一个基本（mis）概念是，最先进的算法工作得非常好，因为它们能够正确地插值训练数据。第二个（mis）概念是插值发生在整个任务和数据集中，事实上，许多直觉和理论都依赖于该假设。我们从经验和理论上反对这两点，并证明在任何高维（$>$100）数据集上，插值几乎肯定不会发生。这些结果挑战了我们当前的插值/外推定义作为泛化性能指标的有效性。摘要：The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample $x$ whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when $x$ falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional ($>$100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

【3】 No RL, No Simulation: Learning to Navigate without Navigating 标题：没有RL，没有模拟：学习在没有导航的情况下导航链接：https://arxiv.org/abs/2110.09470

作者：Meera Hahn,Devendra Chaplot,Shubham Tulsiani,Mustafa Mukadam,James M. Rehg,Abhinav Gupta 机构：Georgia Institute of Technology, Facebook AI Research 摘要：大多数以前学习导航策略的方法都需要访问模拟环境，因为它们需要在线策略交互，并依赖地面真实地图获得奖励。然而，由于sim-to-real域之间的差距，构建模拟器的成本很高（每个场景都需要人工操作），并且在将学到的策略转移到现实世界中的机器人平台方面带来了挑战。在本文中，我们提出了一个简单的问题：为了解决图像目标导航任务，我们真的需要主动交互、地面真实地图甚至强化学习（RL）吗？我们提出了一种自我监督的方法来学习导航，只从被动视频漫游。我们的方法，无RL，无模拟器（NRNS），简单且可扩展，但非常有效。NRNS的性能显著优于基于RL的配方。我们将NRNS作为未来任何使用RL或模拟的基于图像的导航任务的强大基线。摘要：Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards. However, building simulators is expensive (requires manual effort for each and every scene) and creates challenges in transferring learned policies to robotic platforms in the real-world, due to the sim-to-real domain gap. In this paper, we pose a simple question: Do we really need active interaction, ground-truth maps or even reinforcement-learning (RL) in order to solve the image-goal navigation task? We propose a self-supervised approach to learn to navigate from only passive videos of roaming. Our approach, No RL, No Simulator (NRNS), is simple and scalable, yet highly effective. NRNS outperforms RL-based formulations by a significant margin. We present NRNS as a strong baseline for any future image-based navigation tasks that use RL or Simulation.

【4】 TLDR: Twin Learning for Dimensionality Reduction 标题：TLDR：用于降维的孪生学习链接：https://arxiv.org/abs/2110.09455

作者：Yannis Kalantidis,Carlos Lassance,Jon Almazan,Diane Larlus 机构：Jon Almazán, NAVER LABS Europe 备注：Code available at: this https URL 摘要：降维方法是一种无监督的方法，用于学习低维空间，在低维空间中保留初始空间的某些属性，通常是“邻域”的概念。它们是可视化、压缩、索引和检索等多种任务的重要组成部分。针对一个完全不同的目标，自监督视觉表征学习已被证明能够通过学习模型产生可转移的表征函数，该模型对人为产生的失真（例如一组手工制作的图像变换）进行编码不变性。与流形学习方法不同，流形学习方法通常需要在大型k-NN图或复杂的优化求解器上传播，自监督学习方法依赖于更简单和更可伸缩的学习框架。在本文中，我们从流形学习的角度统一了这两类方法，并提出了TLDR，一种通用输入空间的降维方法，该方法将Barlow Twins的简单自监督学习框架移植到难以或不可能手动定义适当扭曲集的环境中。我们建议使用最近邻从训练集中构建成对，并借用自监督文献中的冗余减少损失来学习一个编码器，该编码器在这些成对中产生不变的表示。TLDR方法简单，易于实施和训练，适用性广；它包括一个可高度近似的离线最近邻计算步骤，以及一个不需要挖掘负样本进行对比、特征分解或繁琐的优化解算器的简单学习过程。通过将PCA替换为TLDR，我们能够将GeM AP的性能提高4%，使128维的mAP提高4%，并使其性能保持在16倍以下的维。摘要：Dimensionality reduction methods are unsupervised approaches which learn low-dimensional spaces where some properties of the initial space, typically the notion of "neighborhood", are preserved. They are a crucial component of diverse tasks like visualization, compression, indexing, and retrieval. Aiming for a totally different goal, self-supervised visual representation learning has been shown to produce transferable representation functions by learning models that encode invariance to artificially created distortions, e.g. a set of hand-crafted image transformations. Unlike manifold learning methods that usually require propagation on large k-NN graphs or complicated optimization solvers, self-supervised learning approaches rely on simpler and more scalable frameworks for learning. In this paper, we unify these two families of approaches from the angle of manifold learning and propose TLDR, a dimensionality reduction method for generic input spaces that is porting the simple self-supervised learning framework of Barlow Twins to a setting where it is hard or impossible to define an appropriate set of distortions by hand. We propose to use nearest neighbors to build pairs from a training set and a redundancy reduction loss borrowed from the self-supervised literature to learn an encoder that produces representations invariant across such pairs. TLDR is a method that is simple, easy to implement and train, and of broad applicability; it consists of an offline nearest neighbor computation step that can be highly approximated, and a straightforward learning process that does not require mining negative samples to contrast, eigendecompositions, or cumbersome optimization solvers. By replacing PCA with TLDR, we are able to increase the performance of GeM-AP by 4% mAP for 128 dimensions, and to retain its performance with 16x fewer dimensions.

【5】 Natural Image Reconstruction from fMRI using Deep Learning: A Survey 标题：基于深度学习的fMRI自然图像重建研究综述链接：https://arxiv.org/abs/2110.09006

作者：Zarina Rakhimberdina,Quentin Jodelet,Xin Liu,Tsuyoshi Murata 机构：Department of Computer Science, Tokyo Institute of Technology, Tokyo ,-, Artificial Intelligence Research Center, National Institute of Advanced Industrial, Science and Technology, Tokyo ,-, Japan 摘要：随着脑成像技术和机器学习工具的出现，人们致力于建立计算模型，以捕获人脑中视觉信息的编码。最具挑战性的大脑解码任务之一是从功能磁共振成像（fMRI）测量的大脑活动中准确重建感知的自然图像。在这项工作中，我们调查了最新的深度学习方法的自然图像重建功能磁共振成像。我们从体系结构设计、基准数据集和评估指标方面对这些方法进行了检查，并提出了跨标准化评估指标的公平性能评估。最后，我们讨论了现有研究的优势和局限性以及未来可能的发展方向。摘要：With the advent of brain imaging techniques and machine learning tools, much effort has been devoted to building computational models to capture the encoding of visual information in the human brain. One of the most challenging brain decoding tasks is the accurate reconstruction of the perceived natural images from brain activities measured by functional magnetic resonance imaging (fMRI). In this work, we survey the most recent deep learning methods for natural image reconstruction from fMRI. We examine these methods in terms of architectural design, benchmark datasets, and evaluation metrics and present a fair performance evaluation across standardized evaluation metrics. Finally, we discuss the strengths and limitations of existing studies and present potential future directions.

【6】 Predicting Rebar Endpoints using Sin Exponential Regression Model 标题：基于Sin指数回归模型的钢筋终点预测链接：https://arxiv.org/abs/2110.08955

作者：Jong-Chan Park,Hye-Youn Lim,Dae-Seong Kang 机构：Department of Electronic Engineering, Dong-A University, Republic of Korea 摘要：目前，正在进行无人自动化研究，以最大限度地降低钢筋生产的损失率，以及在钢筋加工厂的切割过程中生产缺陷产品时校准的时间和准确性。在本文中，我们提出了一种基于YOLO（你只看一次）v3的检测和跟踪进入机器视觉摄像机的钢筋端点图像的方法，并利用获得的坐标的正弦指数回归提前预测钢筋端点。该方法解决了OPPDet（Object Position prediction Detect，对象位置预测检测）模型中钢筋端点距离较远的帧位置预测错误率较高的问题，该模型对钢筋端点进行预预测，改进后的结果显示，在正弦指数回归预测点，错误率降低了0.23%至0.52%。摘要：Currently, unmanned automation studies are underway to minimize the loss rate of rebar production and the time and accuracy of calibration when producing defective products in the cutting process of processing rebar factories. In this paper, we propose a method to detect and track rebar endpoint images entering the machine vision camera based on YOLO (You Only Look Once)v3, and to predict rebar endpoint in advance with sin exponential regression of acquired coordinates. The proposed method solves the problem of large prediction error rates for frame locations where rebar endpoints are far away in OPPDet (Object Position Prediction Detect) models, which prepredict rebar endpoints with improved results showing 0.23 to 0.52% less error rates at sin exponential regression prediction points.

【7】 Network Augmentation for Tiny Deep Learning 标题：用于微深度学习的网络增强链接：https://arxiv.org/abs/2110.08890

作者：Han Cai,Chuang Gan,Ji Lin,Song Han 机构：Massachusetts Institute of Technology, MIT-IBM Watson AI Lab 摘要：我们介绍了一种新的训练方法，即网络增强（netaugment，NetAug），以提高微型神经网络的性能。现有的正则化技术（例如，数据增强、丢失）通过添加噪声克服过拟合，在大型神经网络（例如，ResNet50）上取得了很大的成功。然而，我们发现这些技术损害了微型神经网络的性能。我们认为，训练微小模型不同于大型模型：与其增加数据，不如增加模型，因为由于容量有限，微小模型往往存在拟合不足而不是拟合过度的问题。为了缓解此问题，NetAug增加了网络（反向退出），而不是将噪声插入数据集或网络。它将微小的模型放入更大的模型中，并鼓励它作为更大模型的子模型工作，以获得额外的监督，此外还作为独立模型发挥作用。在测试时，仅使用微小的模型进行推理，因此推理开销为零。我们展示了NetAug在图像分类和目标检测方面的有效性。NetAug持续改进微型模型的性能，在ImageNet上实现了高达2.1%的精度改进，在汽车上实现了4.3%的精度改进。在Pascal VOC上，NetAug以相同的计算成本提供了2.96%的mAP改进。摘要：We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks. Existing regularization techniques (e.g., data augmentation, dropout) have shown much success on large neural networks (e.g., ResNet50) by adding noise to overcome over-fitting. However, we found these techniques hurt the performance of tiny neural networks. We argue that training tiny models are different from large models: rather than augmenting the data, we should augment the model, since tiny models tend to suffer from under-fitting rather than over-fitting due to limited capacity. To alleviate this issue, NetAug augments the network (reverse dropout) instead of inserting noise into the dataset or the network. It puts the tiny model into larger models and encourages it to work as a sub-model of larger models to get extra supervision, in addition to functioning as an independent model. At test time, only the tiny model is used for inference, incurring zero inference overhead. We demonstrate the effectiveness of NetAug on image classification and object detection. NetAug consistently improves the performance of tiny models, achieving up to 2.1% accuracy improvement on ImageNet, and 4.3% on Cars. On Pascal VOC, NetAug provides 2.96% mAP improvement with the same computational cost.

【8】 Online Continual Learning Via Candidates Voting 标题：通过考生投票实现在线持续学习链接：https://arxiv.org/abs/2110.08855

作者：Jiangpeng He,Fengqing Zhu 机构：School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana USA 备注：Accepted paper at Winter Conference on Applications of Computer Vision (WACV 2022) 摘要：在线场景中的持续学习旨在从数据流中学习一系列新任务，每个任务只使用一次数据进行训练，这比假设新任务中的数据都可用的离线模式更现实。然而，在具有挑战性的类增量设置中，该模型对推理过程中迄今为止看到的所有类进行分类，这一问题仍然没有得到充分的探讨。特别是，性能方面的困难在于任务数量的增加或每个任务需要学习的额外课程。此外，大多数现有的方法都需要存储原始数据作为知识重放的示例，这对于内存预算有限或涉及隐私的某些应用程序来说可能是不可行的。在这项工作中，我们介绍了一种有效的、记忆效率高的在线连续学习方法，该方法通过从每个学习任务中选择候选任务，并使用存储的特征嵌入代替原始数据作为示例。我们提出的方法在不同的基准数据集（包括CIFAR-10、CIFAR-100和CORE-50）下实现了图像分类任务，取得了最佳的在线持续学习效果，同时与现有工作相比，所需的内存资源要少得多。摘要：Continual learning in online scenario aims to learn a sequence of new tasks from data stream using each data only once for training, which is more realistic than in offline mode assuming data from new task are all available. However, this problem is still under-explored for the challenging class-incremental setting in which the model classifies all classes seen so far during inference. Particularly, performance struggles with increased number of tasks or additional classes to learn for each task. In addition, most existing methods require storing original data as exemplars for knowledge replay, which may not be feasible for certain applications with limited memory budget or privacy concerns. In this work, we introduce an effective and memory-efficient method for online continual learning under class-incremental setting through candidates selection from each learned task together with prior incorporation using stored feature embeddings instead of original data as exemplars. Our proposed method implemented for image classification task achieves the best results under different benchmark datasets for online continual learning including CIFAR-10, CIFAR-100 and CORE-50 while requiring much less memory resource compared with existing works.

【9】 Exploring Novel Pooling Strategies for Edge Preserved Feature Maps in Convolutional Neural Networks 标题：探索卷积神经网络中保边特征映射的新池化策略链接：https://arxiv.org/abs/2110.08842

作者：Adithya Sineesh,Mahesh Raveendranatha Panicker 机构：a National Institute of Technology Trichy, Tamil Nadu, India, b Indian Institute of Technology Palakkad, Kerala, India 备注：29 pages, Submitted to Elsevier Pattern Recognition for review 摘要：随着抗混叠卷积神经网络（CNN）的引入，CNN中的池处理方式重新出现了一些复苏。消除混叠CNN的基本构造块是在合并操作之前应用高斯平滑，以减少由于混叠引起的失真，从而使CNN具有平移不变性。基于小波的方法也被提出作为一种额外的噪声去除能力的可能性，并为甚至分割任务提供了有趣的结果。然而，在假设高频分量是噪声的情况下，所有提出的方法都完全去除了高频分量。但是，通过删除高频分量，特征贴图中的边也会平滑。在这项工作中，对分类、分割和自动编码器的边缘保持池选项进行了详尽的分析。提出了两种新的合并方法：拉普拉斯高斯合并注意合并（LGCA）和基于小波的近似细节系数合并注意合并（WADCA）。结果表明，所提出的池方法在分类、分割和自动编码器方面优于传统池和模糊池。摘要：With the introduction of anti-aliased convolutional neural networks (CNN), there has been some resurgence in relooking the way pooling is done in CNNs. The fundamental building block of the anti-aliased CNN has been the application of Gaussian smoothing before the pooling operation to reduce the distortion due to aliasing thereby making CNNs shift invariant. Wavelet based approaches have also been proposed as a possibility of additional noise removal capability and gave interesting results for even segmentation tasks. However, all the approaches proposed completely remove the high frequency components under the assumption that they are noise. However, by removing high frequency components, the edges in the feature maps are also smoothed. In this work, an exhaustive analysis of the edge preserving pooling options for classification, segmentation and autoencoders are presented. Two novel pooling approaches are presented such as Laplacian-Gaussian Concatenation with Attention (LGCA) pooling and Wavelet based approximate-detailed coefficient concatenation with attention (WADCA) pooling. The results suggest that the proposed pooling approaches outperform the conventional pooling as well as blur pooling for classification, segmentation and autoencoders.

【10】 Compression-aware Projection with Greedy Dimension Reduction for Convolutional Neural Network Activations 标题：卷积神经网络激活的贪婪降维压缩感知投影链接：https://arxiv.org/abs/2110.08828

作者：Yu-Shan Tai,Chieh-Fang Teng,Cheng-Yang Chang,An-Yeu Wu 机构：Graduate Institute of Electrical Engineering, National Taiwan University, Taipei, Taiwan 备注：5 pages, 5 figures, submitted to 2022 ICASSP 摘要：卷积神经网络（CNN）在广泛的领域取得了显著的性能。然而，激活的密集内存访问带来了相当大的能耗，阻碍了CNN在资源受限的边缘设备上的部署。现有的激活压缩工作建议将特征映射转换为更高的压缩性，从而实现降维。然而，在尺寸锐减的情况下，这些方法会导致严重的精度下降。为了提高分类精度和压缩比之间的平衡，我们提出了一种压缩感知投影系统，该系统采用可学习投影来补偿重建损失。此外，通过同时考虑精度和#比特减少，引入贪婪选择度量来优化分层压缩比分配。我们的测试结果表明，所提出的方法在MobileNetV2/ResNet18/VGG16上有效地减少了2.91x~5.97x的内存访问，并且精度下降可以忽略不计。摘要：Convolutional neural networks (CNNs) achieve remarkable performance in a wide range of fields. However, intensive memory access of activations introduces considerable energy consumption, impeding deployment of CNNs on resourceconstrained edge devices. Existing works in activation compression propose to transform feature maps for higher compressibility, thus enabling dimension reduction. Nevertheless, in the case of aggressive dimension reduction, these methods lead to severe accuracy drop. To improve the trade-off between classification accuracy and compression ratio, we propose a compression-aware projection system, which employs a learnable projection to compensate for the reconstruction loss. In addition, a greedy selection metric is introduced to optimize the layer-wise compression ratio allocation by considering both accuracy and #bits reduction simultaneously. Our test results show that the proposed methods effectively reduce 2.91x~5.97x memory access with negligible accuracy drop on MobileNetV2/ResNet18/VGG16.

【11】 PixelPyramids: Exact Inference Models from Lossless Image Pyramids 标题：像素金字塔：无损图像金字塔的精确推理模型链接：https://arxiv.org/abs/2110.08787

作者：Shweta Mahajan,Stefan Roth 机构：Department of Computer Science, TU Darmstadt, hessian.AI 备注：To appear at ICCV 2021 摘要：自回归模型是一类精确推理方法，具有高度灵活的函数形式，可产生最先进的自然图像密度估计。然而，维度上的顺序使得这些模型的计算成本很高，并且限制了它们对低分辨率图像的适用性。在这项工作中，我们提出了像素金字塔，这是一种块自回归方法，采用无损金字塔分解和特定尺度表示来编码图像像素的联合分布。关键的是，与完全自回归方法相比，它提供了更稀疏的依赖结构。我们的像素金字塔为各种图像数据集（尤其是高分辨率数据）的密度估计提供了最先进的结果。对于CelebA HQ 1024 x 1024，我们观察到，尽管采样速度甚至优于易于并行化的基于流的模型，但密度估计（以比特/暗为单位）提高到基线的约44%。摘要：Autoregressive models are a class of exact inference approaches with highly flexible functional forms, yielding state-of-the-art density estimates for natural images. Yet, the sequential ordering on the dimensions makes these models computationally expensive and limits their applicability to low-resolution imagery. In this work, we propose Pixel-Pyramids, a block-autoregressive approach employing a lossless pyramid decomposition with scale-specific representations to encode the joint distribution of image pixels. Crucially, it affords a sparser dependency structure compared to fully autoregressive approaches. Our PixelPyramids yield state-of-the-art results for density estimation on various image datasets, especially for high-resolution data. For CelebA-HQ 1024 x 1024, we observe that the density estimates (in terms of bits/dim) are improved to ~44% of the baseline despite sampling speeds superior even to easily parallelizable flow-based models.

【12】 Fully-Connected Tensor Network Decomposition for Robust Tensor Completion Problem 标题：鲁棒张量完成问题的全连通张量网络分解链接：https://arxiv.org/abs/2110.08754

作者：Yun-Yang Liu,Xi-Le Zhao,Guang-Jing Song,Yu-Bang Zheng,Ting-Zhu Huang 摘要：鲁棒张量完成（RTC）问题越来越受到人们的关注，该问题旨在从被稀疏张量污染的部分观测张量重构低秩张量。本文利用全连通张量网络（FCTN）分解的优越表达式，针对RTC问题提出了一种基于$\textbf{FCTN}$的$\textbf{r}$obust$\textbf{c}$onvex优化模型（RC-FCTN）。然后，我们严格为RC-FCTN建立准确的恢复保证。为了求解约束优化模型RC-FCTN，我们提出了一种基于交替方向乘子法（ADMM）的算法，该算法具有全局收敛性。此外，我们为RTC问题提出了一个基于$\textbf{FCTN}$的$\textbf{r}$obust$\textbf{n}$on$\textbf{c}$onvex优化模型（RNC-FCTN）。提出了一种基于近似交替极小化（PAM）的算法来求解所提出的RNC-FCTN。同时，从理论上推导了基于PAM算法的收敛性。在视频补全和视频背景减法等应用中的综合数值实验表明，所提出的方法优于几种最新的方法。摘要：The robust tensor completion (RTC) problem, which aims to reconstruct a low-rank tensor from partially observed tensor contaminated by a sparse tensor, has received increasing attention. In this paper, by leveraging the superior expression of the fully-connected tensor network (FCTN) decomposition, we propose a $\textbf{FCTN}$-based $\textbf{r}$obust $\textbf{c}$onvex optimization model (RC-FCTN) for the RTC problem. Then, we rigorously establish the exact recovery guarantee for the RC-FCTN. For solving the constrained optimization model RC-FCTN, we develop an alternating direction method of multipliers (ADMM)-based algorithm, which enjoys the global convergence guarantee. Moreover, we suggest a $\textbf{FCTN}$-based $\textbf{r}$obust $\textbf{n}$on$\textbf{c}$onvex optimization model (RNC-FCTN) for the RTC problem. A proximal alternating minimization (PAM)-based algorithm is developed to solve the proposed RNC-FCTN. Meanwhile, we theoretically derive the convergence of the PAM-based algorithm. Comprehensive numerical experiments in several applications, such as video completion and video background subtraction, demonstrate that proposed methods are superior to several state-of-the-art methods.

【13】 SIN:Superpixel Interpolation Network 标题：SIN：超像素插值网络链接：https://arxiv.org/abs/2110.08702

作者：Qing Yuan,Songfeng Lu,Yan Huang,Wuxin Sha 机构： School of Computer Science and Technology, Huazhong university of Science and, Technology, Wuhan, China, School of Cyber Science & Engineering, Huazhong university of Science and, Shenzhen Huazhong University of Science and Technology Research Institute 备注：15 pages, 8 figures, to be published in PRICAI-2021 摘要：超像素由于其代表性和计算效率，在计算机视觉任务中得到了广泛的应用。同时，深度学习和端到端框架在包括计算机视觉在内的各个领域都取得了巨大的进展。然而，现有的超像素算法无法以端到端的方式集成到后续任务中。传统算法和基于深度学习的算法是超像素分割的两大主流。前者是不可微的，后者需要一个不可微的后处理步骤来加强连通性，这限制了超级像素和下游任务的集成。在本文中，我们提出了一种基于深度学习的超像素分割算法SIN，该算法可以以端到端的方式与下游任务集成。由于视觉跟踪等下游任务需要实时速度，因此生成超像素的速度也很重要。为了消除后处理步骤，我们的算法从一开始就强制执行空间连接。超级像素由采样像素初始化，其他像素通过多个更新步骤分配给超级像素。每个步骤包括一个水平和一个垂直插值，这是加强空间连通性的关键。利用全卷积网络的多层输出预测插值的关联分数。实验结果表明，我们的方法运行速度约为80fps，与现有的方法相比，性能良好。此外，我们还设计了一个简单而有效的损失函数，大大减少了训练时间。对基于超像素任务的改进证明了算法的有效性。我们希望SIN将以端到端的方式集成到下游任务中，并使基于超级像素的社区受益。代码位于：\href{https://github.com/yuanqqq/SIN}{https://github.com/yuanqqq/SIN}. 摘要：Superpixels have been widely used in computer vision tasks due to their representational and computational efficiency. Meanwhile, deep learning and end-to-end framework have made great progress in various fields including computer vision. However, existing superpixel algorithms cannot be integrated into subsequent tasks in an end-to-end way. Traditional algorithms and deep learning-based algorithms are two main streams in superpixel segmentation. The former is non-differentiable and the latter needs a non-differentiable post-processing step to enforce connectivity, which constraints the integration of superpixels and downstream tasks. In this paper, we propose a deep learning-based superpixel segmentation algorithm SIN which can be integrated with downstream tasks in an end-to-end way. Owing to some downstream tasks such as visual tracking require real-time speed, the speed of generating superpixels is also important. To remove the post-processing step, our algorithm enforces spatial connectivity from the start. Superpixels are initialized by sampled pixels and other pixels are assigned to superpixels through multiple updating steps. Each step consists of a horizontal and a vertical interpolation, which is the key to enforcing spatial connectivity. Multi-layer outputs of a fully convolutional network are utilized to predict association scores for interpolations. Experimental results show that our approach runs at about 80fps and performs favorably against state-of-the-art methods. Furthermore, we design a simple but effective loss function which reduces much training time. The improvements of superpixel-based tasks demonstrate the effectiveness of our algorithm. We hope SIN will be integrated into downstream tasks in an end-to-end way and benefit the superpixel-based community. Code is available at: \href{https://github.com/yuanqqq/SIN}{https://github.com/yuanqqq/SIN}.

【14】 BNAS v2: Learning Architectures for Binary Networks with Empirical Improvements 标题：BNAs v2：具有经验改进的二进制网络学习体系结构链接：https://arxiv.org/abs/2110.08562

作者：Dahyun Kim,Kunal Pratap Singh,Jonghyun Choi 机构：GIST, South Korea, Allen Institute for AI 备注：arXiv admin note: text overlap with arXiv:2002.06963 摘要：大多数二进制网络的主干架构都是众所周知的浮点（FP）架构，如ResNet系列。针对为FP网络设计的架构可能不是二进制网络的最佳架构这一问题，我们建议通过定义新的二进制架构搜索空间和新的搜索目标来搜索二进制网络架构（BNA）。具体地说，在基于单元的搜索方法的基础上，我们定义了二进制层类型的新搜索空间，设计了一个新的单元模板，重新发现了Zeroise层的实用性，并建议使用Zeroise层而不是将其用作占位符。新的搜索目标使早期搜索多样化，以学习性能更好的二进制体系结构。我们表明，尽管二进制网络中固有的量化误差，我们的方法仍能以稳定的训练曲线搜索体系结构。定量分析表明，我们所搜索的架构优于现有的最先进的二进制网络中使用的架构，并优于或执行与最先进的二进制网络，采用不同的技术，而不是建筑变化。此外，我们还进一步提出了对搜索架构的训练方案的改进。对于我们搜索的体系结构，使用新的训练方案，我们通过非平凡的裕度超越所有以前的方法，通过二进制网络实现了最先进的性能。摘要：Backbone architectures of most binary networks are well-known floating point (FP) architectures such as the ResNet family. Questioning that the architectures designed for FP networks might not be the best for binary networks, we propose to search architectures for binary networks (BNAS) by defining a new search space for binary architectures and a novel search objective. Specifically, based on the cell based search method, we define the new search space of binary layer types, design a new cell template, and rediscover the utility of and propose to use the Zeroise layer instead of using it as a placeholder. The novel search objective diversifies early search to learn better performing binary architectures. We show that our method searches architectures with stable training curves despite the quantization error inherent in binary networks. Quantitative analyses demonstrate that our searched architectures outperform the architectures used in state-of-the-art binary networks and outperform or perform on par with state-of-the-art binary networks that employ various techniques other than architectural changes. In addition, we further propose improvements to the training scheme of our searched architectures. With the new training scheme for our searched architectures, we achieve the state-of-the-art performance by binary networks by outperforming all previous methods by non-trivial margins.

【15】 Grayscale Based Algorithm for Remote Sensing with Deep Learning 标题：基于灰度的深度学习遥感算法链接：https://arxiv.org/abs/2110.08493

作者：Sai Ganesh CS,Aouthithiye Barathwaj SR Y,R. Azhagumurugan,R. Swethaa S 机构：a Department of Electrical and Electronics Engineering, Sri Sai Ram Engineering, b Department of Electronics and Communication Engineering, Sri Sai Ram, Engineering College, Sai Leo Nagar, West Tambaram, Chennai, India, Corresponding Author:, Tel: + 备注：14 pages, 4 figures, Journal Expert Systems with Applications 摘要：遥感是在不与目标进行物理接触的情况下对目标进行的图像采集。目前，遥感数据由于其缩短的图像采集周期而受到广泛的青睐。由于影响光通过卫星采集的不同介质传播的各种因素，地面目标的遥感更具挑战性。一些基于卷积神经网络的算法正在遥感领域得到实现。监督学习是一种机器学习技术，其中数据在训练前根据其类别进行标记。为了更准确地检测和分类目标，采用了基于包围盒和锚盒的YOLOv3算法。为了处理光在大气中传播的各种影响，引入了基于灰度的YOLOv3配置。为了更好地预测和解决瑞利散射效应，提出了基于RGB的灰度算法。采用基于灰度的YOLO3算法对采集的图像进行分析和训练，实现目标检测。结果表明，与传统的YOLOv3方法相比，基于灰度的方法能够更准确有效地检测目标。摘要：Remote sensing is the image acquisition of a target without having physical contact with it. Nowadays remote sensing data is widely preferred due to its reduced image acquisition period. The remote sensing of ground targets is more challenging because of the various factors that affect the propagation of light through different mediums from a satellite acquisition. Several Convolutional Neural Network-based algorithms are being implemented in the field of remote sensing. Supervised learning is a machine learning technique where the data is labelled according to their classes prior to the training. In order to detect and classify the targets more accurately, YOLOv3, an algorithm based on bounding and anchor boxes is adopted. In order to handle the various effects of light travelling through the atmosphere, Grayscale based YOLOv3 configuration is introduced. For better prediction and for solving the Rayleigh scattering effect, RGB based grayscale algorithms are proposed. The acquired images are analysed and trained with the grayscale based YOLO3 algorithm for target detection. The results show that the grayscale-based method can sense the target more accurately and effectively than the traditional YOLOv3 approach.

【16】 A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models 标题：一个好的提示符抵得上数百万个参数吗？基于低资源提示的视觉语言模型学习链接：https://arxiv.org/abs/2110.08484

作者：Woojeong Jin,Yu Cheng,Yelong Shen,Weizhu Chen,Xiang Ren 机构：University of Southern California, Microsoft Corporation 备注：Preprint 摘要：大型预训练视觉语言（VL）模型可以通过少量示例学习新任务，也可以在不进行微调的情况下推广到新任务。然而，这些巨大的VL模型由于其不切实际的巨大模型尺寸和缓慢的推理速度，很难部署到实际应用中。在这项工作中，我们提出了FewVLM，一种基于视觉语言任务的少数镜头提示学习者。我们使用前缀语言建模（PrefixLM）和掩码语言建模（MaskedLM）预训练了一个序列到序列转换器模型，并引入了简单的提示来提高VQA和图像字幕的Zero-Shot和Few-Shot性能。在五个VQA和字幕数据集上的实验结果表明\method\xspace优于Frozed，后者在零拍VQAv2上比我们的大31倍，达到18.2%，并且与一个246$\times$大的模型PICa的结果相当。我们观察到：（1）提示显著影响Zero-Shot性能，但对Few-Shot性能影响不大，（2）MaskedLM帮助Few-ShotVQA任务，而PrefixLM提高字幕性能，（3）当训练集大小较小时，性能显著提高。摘要：Large pretrained vision-language (VL) models can learn a new task with a handful of examples or generalize to a new task without fine-tuning. However, these gigantic VL models are hard to deploy for real-world applications due to their impractically huge model size and slow inference speed. In this work, we propose FewVLM, a few-shot prompt-based learner on vision-language tasks. We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM), and introduce simple prompts to improve zero-shot and few-shot performance on VQA and image captioning. Experimental results on five VQA and captioning datasets show that \method\xspace outperforms Frozen which is 31 times larger than ours by 18.2% point on zero-shot VQAv2 and achieves comparable results to a 246$\times$ larger model, PICa. We observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) MaskedLM helps few-shot VQA tasks while PrefixLM boosts captioning performance, and (3) performance significantly increases when training set size is small.

【17】 TorchEsegeta: Framework for Interpretability and Explainability of Image-based Deep Learning Models 标题：TorchEsegeta：基于图像的深度学习模型的可解释性和可解释性框架链接：https://arxiv.org/abs/2110.08429

作者：Soumick Chatterjee,Arnab Das,Chirag Mandal,Budhaditya Mukhopadhyay,Manish Vipinraj,Aniruddh Shukla,Rajatha Nagaraja Rao,Chompunuch Sarasaen,Oliver Speck,Andreas Nürnberger 机构：Data and Knowledge Engineering Group, Otto von Guericke University Magdeburg, Germany, Biomedical Magnetic Resonance, Otto von Guericke University Magdeburg, Germany, Institute for Medical Engineering, Otto von Guericke University Magdeburg, Germany 摘要：临床医生通常非常怀疑在实践中应用自动图像处理方法，尤其是基于深度学习的方法。造成这种情况的一个主要原因是这些方法的黑匣子性质以及缺少对自动衍生决策的洞察的固有问题。为了增加对这些方法的信任，本文通过描述对算法决策影响最大的解剖区域，提出了有助于解释深度学习算法结果的方法。此外，本研究提出了一个统一的框架，TorchEsegeta，用于将各种可解释性和可解释性技术应用于深度学习模型，并为临床医生提供视觉解释和解释，以证实其临床发现。此外，这将有助于获得对此类方法的信心。该框架建立在现有的可解释性和可解释性技术的基础上，这些技术目前主要关注分类模型，并将其扩展到分割任务。此外，这些方法已适用于体积分析的三维模型。提出的框架提供了使用不忠和敏感度指标定量比较视觉解释的方法。数据科学家可以使用该框架对其模型进行事后解释和解释，开发更多可解释的工具，并向临床医生展示研究结果，以增强他们对此类模型的信心。所提出的框架是基于一个用例场景进行评估的，该场景场景中的血管分割模型是在人类大脑的战斗时间（TOF）磁共振血管造影（MRA）图像上训练的。定量和定性结果的比较研究，不同的模型和解释方法提出。此外，本文还对现有的几种可解释性和可解释性方法进行了广泛的概述。摘要：Clinicians are often very sceptical about applying automatic image processing approaches, especially deep learning based methods, in practice. One main reason for this is the black-box nature of these approaches and the inherent problem of missing insights of the automatically derived decisions. In order to increase trust in these methods, this paper presents approaches that help to interpret and explain the results of deep learning algorithms by depicting the anatomical areas which influence the decision of the algorithm most. Moreover, this research presents a unified framework, TorchEsegeta, for applying various interpretability and explainability techniques for deep learning models and generate visual interpretations and explanations for clinicians to corroborate their clinical findings. In addition, this will aid in gaining confidence in such methods. The framework builds on existing interpretability and explainability techniques that are currently focusing on classification models, extending them to segmentation tasks. In addition, these methods have been adapted to 3D models for volumetric analysis. The proposed framework provides methods to quantitatively compare visual explanations using infidelity and sensitivity metrics. This framework can be used by data scientists to perform post-hoc interpretations and explanations of their models, develop more explainable tools and present the findings to clinicians to increase their faith in such models. The proposed framework was evaluated based on a use case scenario of vessel segmentation models trained on Time-of-fight (TOF) Magnetic Resonance Angiogram (MRA) images of the human brain. Quantitative and qualitative results of a comparative study of different models and interpretability methods are presented. Furthermore, this paper provides an extensive overview of several existing interpretability and explainability methods.

【18】 Solving Image PDEs with a Shallow Network 标题：用浅层网络求解图像偏微分方程链接：https://arxiv.org/abs/2110.08327

作者：Pascal Tom Getreuer,Peyman Milanfar,Xiyang Luo 备注：21 pages, 22 figures, references arXiv:1802.06130, arXiv:1711.10700, arXiv:1606.01299 摘要：偏微分方程（PDE）通常被用作物理过程的模型，但在基于PDE的图像处理中也非常有趣。然而，当涉及到它们在成像中的应用时，用于求解偏微分方程的传统数值方法往往需要非常精细的网格分辨率以实现稳定性，因此具有不实际的高计算成本。这项工作将BLADE（Best Linear Adaptive Enhancement，最佳线性自适应增强）这一浅层可学习滤波框架应用于偏微分方程求解，结果表明，该方法高效、准确，在粗网格分辨率下比经典方法更可靠。因此，该模型可以灵活地用于成像中的各种问题。摘要：Partial differential equations (PDEs) are typically used as models of physical processes but are also of great interest in PDE-based image processing. However, when it comes to their use in imaging, conventional numerical methods for solving PDEs tend to require very fine grid resolution for stability, and as a result have impractically high computational cost. This work applies BLADE (Best Linear Adaptive Enhancement), a shallow learnable filtering framework, to PDE solving, and shows that the resulting approach is efficient and accurate, operating more reliably at coarse grid resolutions than classical methods. As such, the model can be flexibly used for a wide variety of problems in imaging.

其他(23篇)

【1】 Improving Robustness using Generated Data 标题：使用生成的数据提高稳健性链接：https://arxiv.org/abs/2110.09468

作者：Sven Gowal,Sylvestre-Alvise Rebuffi,Olivia Wiles,Florian Stimberg,Dan Andrei Calian,Timothy Mann 机构：DeepMind, London 备注：Accepted at NeurIPS 2021 摘要：最近的研究表明，稳健的训练需要比标准分类所需的数据集大得多的数据集。在CIFAR-10和CIFAR-100上，这转化为仅根据原始训练集的数据训练的模型与使用从“8000万微小图像”数据集（TI-80M）提取的附加数据训练的模型之间存在相当大的鲁棒精度差距。在本文中，我们探讨了如何利用仅在原始训练集上训练的生成模型来人为地增加原始训练集的大小，并提高对抗$\ellp$范数有界扰动的鲁棒性。我们确定了加入额外生成的数据可以提高鲁棒性的充分条件，并证明了与使用额外真实数据训练的模型相比，可以显著减小鲁棒精度差距。令人惊讶的是，我们甚至表明，即使添加非真实随机数据（由高斯采样生成）也可以提高鲁棒性。我们分别针对大小为$\epsilon=8/255$和$\epsilon=128/255$的$\ell\infty$和$\ell\u 2$范数有界扰动对CIFAR-10、CIFAR-100、SVHN和TinyImageNet的方法进行了评估。与以前最先进的方法相比，我们在鲁棒精度方面有了很大的绝对改进。针对大小为$\epsilon=8/255$的$\ell\infty$范数有界扰动，我们的模型在CIFAR-10和CIFAR-100上分别达到66.10%和33.49%的鲁棒精度（在最先进的基础上提高了+8.96%和+3.29%）。对于大小为$\epsilon=128/255$的$\ellu_2$范数有界扰动，我们的模型在CIFAR-10上达到78.31%（+3.81%）。这些结果超过了以前大多数使用外部数据的工作。摘要：Recent work argues that robust training requires substantially larger datasets than those required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a sizable robust-accuracy gap between models trained solely on data from the original training set and those trained with additional data extracted from the "80 Million Tiny Images" dataset (TI-80M). In this paper, we explore how generative models trained solely on the original training set can be leveraged to artificially increase the size of the original training set and improve adversarial robustness to $\ell_p$ norm-bounded perturbations. We identify the sufficient conditions under which incorporating additional generated data can improve robustness, and demonstrate that it is possible to significantly reduce the robust-accuracy gap to models trained with additional real data. Surprisingly, we even show that even the addition of non-realistic random data (generated by Gaussian sampling) can improve robustness. We evaluate our approach on CIFAR-10, CIFAR-100, SVHN and TinyImageNet against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of size $\epsilon = 8/255$ and $\epsilon = 128/255$, respectively. We show large absolute improvements in robust accuracy compared to previous state-of-the-art methods. Against $\ell_\infty$ norm-bounded perturbations of size $\epsilon = 8/255$, our models achieve 66.10% and 33.49% robust accuracy on CIFAR-10 and CIFAR-100, respectively (improving upon the state-of-the-art by +8.96% and +3.29%). Against $\ell_2$ norm-bounded perturbations of size $\epsilon = 128/255$, our model achieves 78.31% on CIFAR-10 (+3.81%). These results beat most prior works that use external data.

【2】 Comparing Deep Neural Nets with UMAP Tour 标题：深度神经网络与UMAP Tour的比较链接：https://arxiv.org/abs/2110.09431

作者：Mingwei Li,Carlos Scheidegger 机构：Department of Computer Science, The University of Arizona, Tucson, AZ 摘要：神经网络应该是人类可以理解的。特别是，人们对在层中学习的概念以及层之间的相似性越来越感兴趣。在这项工作中，构建了一个工具UMAP Tour，用于使用对齐良好的实例级表示直观地检查和比较真实世界神经网络模型的内部行为。可视化中使用的方法还意味着神经网络层之间的一种新的相似性度量。使用可视化工具和相似性度量，我们发现在最先进的模型中学习到的概念以及它们之间的差异，例如GoogLeNet和ResNet。摘要：Neural networks should be interpretable to humans. In particular, there is a growing interest in concepts learned in a layer and similarity between layers. In this work, a tool, UMAP Tour, is built to visually inspect and compare internal behavior of real-world neural network models using well-aligned, instance-level representations. The method used in the visualization also implies a new similarity measure between neural network layers. Using the visual tool and the similarity measure, we find concepts learned in state-of-the-art models and dissimilarities between them, such as GoogLeNet and ResNet.

【3】 Distinguishing Natural and Computer-Generated Images using Multi-Colorspace fused EfficientNet 标题：利用多色空间融合高效网区分自然图像和计算机生成图像链接：https://arxiv.org/abs/2110.09428

作者：Manjary P Gangan,Anoop K,Lajish V L 备注：13 pages 摘要：区分自然图像和照片真实感计算机生成图像的问题，一次可以解决自然图像与计算机图形或自然图像与GAN图像的问题。但是在真实的图像取证场景中，考虑所有类别的图像生成是非常必要的，因为在大多数情况下图像生成是未知的。据我们所知，我们第一次将自然图像与照片真实感计算机生成图像的区分问题作为一项三级分类任务来处理，将自然图像、计算机图形学和图像进行分类。对于这项任务，我们提出了一个多颜色空间融合效率网模型，通过并行融合三个遵循转移学习方法的效率网网络，其中每个网络在不同颜色空间（RGB、LCH和HSV）中运行，在分析了图像取证问题中各种颜色空间变换的效果后选择。我们的模型在准确性、对后处理的鲁棒性以及对其他数据集的通用性方面优于基线。我们进行心理物理学实验，以了解人类如何准确地区分自然图像、计算机图形和GAN图像，我们可以观察到人类在对这些图像（尤其是计算机生成的图像）进行分类时遇到困难，这表明任务需要计算算法。我们还通过视觉解释分析模型的行为，以了解有助于模型决策的显著区域，并与人类参与者以区域标记形式提供的手动解释进行比较，我们可以在这两种解释中观察到相似之处，这表明我们的模型具有做出有意义决策的强大性质。摘要：The problem of distinguishing natural images from photo-realistic computer-generated ones either addresses natural images versus computer graphics or natural images versus GAN images, at a time. But in a real-world image forensic scenario, it is highly essential to consider all categories of image generation, since in most cases image generation is unknown. We, for the first time, to our best knowledge, approach the problem of distinguishing natural images from photo-realistic computer-generated images as a three-class classification task classifying natural, computer graphics, and GAN images. For the task, we propose a Multi-Colorspace fused EfficientNet model by parallelly fusing three EfficientNet networks that follow transfer learning methodology where each network operates in different colorspaces, RGB, LCH, and HSV, chosen after analyzing the efficacy of various colorspace transformations in this image forensics problem. Our model outperforms the baselines in terms of accuracy, robustness towards post-processing, and generalizability towards other datasets. We conduct psychophysics experiments to understand how accurately humans can distinguish natural, computer graphics, and GAN images where we could observe that humans find difficulty in classifying these images, particularly the computer-generated images, indicating the necessity of computational algorithms for the task. We also analyze the behavior of our model through visual explanations to understand salient regions that contribute to the model's decision making and compare with manual explanations provided by human participants in the form of region markings, where we could observe similarities in both the explanations indicating the powerful nature of our model to take the decisions meaningfully.

【4】 FacialGAN: Style Transfer and Attribute Manipulation on Synthetic Faces 标题：FacialGAN：合成面上的样式传递和属性操作链接：https://arxiv.org/abs/2110.09425

作者：Ricard Durall,Jireh Jam,Dominik Strassel,Moi Hoon Yap,Janis Keuper 机构： Fraunhofer ITWM, Kaiserslautern, Germany, IWR, University of Heidelberg, Heidelberg, Germany, Manchester Metropolitan University, Manchester, United Kingdom, IMLA, Offenburg University, Offenburg, Germany 摘要：人脸图像处理是一项生成任务，其中输出的人脸根据人脸属性和样式向预定目标方向移动。最近的作品在风格转换和属性翻译等多种编辑技巧上取得了巨大成功。然而，当前的方法要么专注于纯粹的样式转换，要么专注于预定义属性集的转换，并且交互受到限制。为了解决这个问题，我们提出了FacialGAN，这是一个能够同时进行丰富风格转换和交互式面部属性操作的新框架。在保持源图像的身份的同时，我们将目标图像的各种样式转移到源图像。然后，我们合并分割掩模的几何信息，以提供对人脸属性的细粒度操作。最后，引入多目标学习策略来优化每个特定任务的损失。以CelebAMask HQ为语义掩码标签，在CelebA HQ数据集上进行的实验表明，我们的模型能够在风格转换、属性操作、多样性和人脸验证方面产生令人信服的视觉效果。为了再现性，我们提供了一个交互式开源工具来执行面部操作，以及该模型的Pytorch实现。摘要：Facial image manipulation is a generation task where the output face is shifted towards an intended target direction in terms of facial attribute and styles. Recent works have achieved great success in various editing techniques such as style transfer and attribute translation. However, current approaches are either focusing on pure style transfer, or on the translation of predefined sets of attributes with restricted interactivity. To address this issue, we propose FacialGAN, a novel framework enabling simultaneous rich style transfers and interactive facial attributes manipulation. While preserving the identity of a source image, we transfer the diverse styles of a target image to the source image. We then incorporate the geometry information of a segmentation mask to provide a fine-grained manipulation of facial attributes. Finally, a multi-objective learning strategy is introduced to optimize the loss of each specific tasks. Experiments on the CelebA-HQ dataset, with CelebAMask-HQ as semantic mask labels, show our model's capacity in producing visually compelling results in style transfer, attribute manipulation, diversity and face verification. For reproducibility, we provide an interactive open-source tool to perform facial manipulations, and the Pytorch implementation of the model.

【5】 NeuralBlox: Real-Time Neural Representation Fusion for Robust Volumetric Mapping 标题：NeuralBlox：用于鲁棒体积映射的实时神经表示融合链接：https://arxiv.org/abs/2110.09415

作者：Stefan Lionar,Lukas Schmid,Cesar Cadena,Roland Siegwart,Andrei Cramariuc 机构：Autonomous Systems Lab, ETH Z¨urich, Switzerland 备注：3DV 2021. Equal contribution between the first two authors. Code: this https URL 摘要：我们提出了一种新的三维映射方法，利用神经隐式表示的最新进展进行三维重建。大多数现有的最先进的神经隐式表示方法仅限于对象级重建，并且不能在给定新数据的情况下增量执行更新。在这项工作中，我们提出了一种融合策略和训练管道，以增量方式构建和更新神经隐式表示，从而能够从连续的局部观测中重建大型场景。通过将任意大小的场景表示为潜在代码的网格并直接在潜在空间中执行更新，我们表明即使在CPU上也可以实时获得增量构建的占用地图。与传统方法（如截断符号距离场（TSDF））相比，我们的地图表示方法在给定噪声输入的情况下，在生成更好的场景完整性方面具有更高的鲁棒性。我们展示了我们的方法在实际数据集上的性能，这些数据集具有不同程度的附加姿态噪声。摘要：We present a novel 3D mapping method leveraging the recent progress in neural implicit representation for 3D reconstruction. Most existing state-of-the-art neural implicit representation methods are limited to object-level reconstructions and can not incrementally perform updates given new data. In this work, we propose a fusion strategy and training pipeline to incrementally build and update neural implicit representations that enable the reconstruction of large scenes from sequential partial observations. By representing an arbitrarily sized scene as a grid of latent codes and performing updates directly in latent space, we show that incrementally built occupancy maps can be obtained in real-time even on a CPU. Compared to traditional approaches such as Truncated Signed Distance Fields (TSDFs), our map representation is significantly more robust in yielding a better scene completeness given noisy inputs. We demonstrate the performance of our approach in thorough experimental validation on real-world datasets with varying degrees of added pose noise.

【6】 Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes 标题：适用于不同大小半规则网格的网格卷积自动编码器链接：https://arxiv.org/abs/2110.09401

作者：Sara Hahner,Jochen Garcle 机构：Fraunhofer Center for Machine Learning and SCAI, Jochen Garcke, University of Bonn 摘要：由于低维嵌入可用于可视化底层动力学，因此自动编码器可加速变形三维曲面网格的分析。但是，最先进的网格卷积自动编码器需要由自动编码器处理的所有输入网格的固定连接。这是由于使用了频谱卷积层或依赖于网格的池操作。因此，一个人可以研究的数据集类型是有限的，所学的知识不能转移到表现出类似行为的其他数据集。为了解决这个问题，我们将曲面的离散化转化为具有局部规则连通性且网格划分为层次的半规则网格。这允许我们将相同的空间卷积滤波器应用于局部邻域，并定义可应用于每个半规则网格的池算子。我们将相同的网格自动编码器应用于不同的数据集，我们的重建误差比最先进的模型的误差低50%以上，这些模型必须分别针对每个网格进行训练。此外，我们通过在不同类别的网格上训练的自动编码器来可视化看不见的网格序列的基本动力学。摘要：The analysis of deforming 3D surface meshes is accelerated by autoencoders since the low-dimensional embeddings can be used to visualize underlying dynamics. But, state-of-the-art mesh convolutional autoencoders require a fixed connectivity of all input meshes handled by the autoencoder. This is due to either the use of spectral convolutional layers or mesh dependent pooling operations. Therefore, the types of datasets that one can study are limited and the learned knowledge cannot be transferred to other datasets that exhibit similar behavior. To address this, we transform the discretization of the surfaces to semi-regular meshes that have a locally regular connectivity and whose meshing is hierarchical. This allows us to apply the same spatial convolutional filters to the local neighborhoods and to define a pooling operator that can be applied to every semi-regular mesh. We apply the same mesh autoencoder to different datasets and our reconstruction error is more than 50% lower than the error from state-of-the-art models, which have to be trained for every mesh separately. Additionally, we visualize the underlying dynamics of unseen mesh sequences with an autoencoder trained on different classes of meshes.

【7】 Neuro-Symbolic Forward Reasoning 标题：神经符号正向推理链接：https://arxiv.org/abs/2110.09383

作者：Hikaru Shindo,Devendra Singh Dhami,Kristian Kersting 机构：AI and Machine Learning Group, Dept. of Computer Science, TU Darmstadt, Germany, Centre for Cognitive Science, TU Darmstadt, Germany, Hessian Center for AI (hessian.AI), Darmstadt, Germany 备注：Preprint 摘要：推理是人类智能的重要组成部分，因此一直是人工智能研究的一个长期目标。随着深度学习最近的成功，将推理与深度学习系统相结合，即神经符号人工智能已成为一个主要的兴趣领域。我们提出了神经符号前向推理器（NSFR），这是一种利用一阶逻辑的可微前向链推理任务的新方法。其关键思想是将可微前向链推理与以对象为中心的（深度）学习相结合。可微前向链式推理平滑地计算逻辑蕴涵，即以可微的方式从给定的事实和规则中推导出新的事实。以对象为中心的学习方法将原始输入分解为对象表示。因此，它允许我们提供一个一致的框架来执行来自原始输入的前向链接推断。NSFR将原始输入分解为以对象为中心的表示，将其转换为概率地原子，最后使用加权规则进行可微前向链推理。我们对以对象为中心的推理数据集、2D Kandinsky模式和3D CLEVR Hans以及各种任务的综合实验评估表明了我们方法的有效性和优势。摘要：Reasoning is an essential part of human intelligence and thus has been a long-standing goal in artificial intelligence research. With the recent success of deep learning, incorporating reasoning with deep learning systems, i.e., neuro-symbolic AI has become a major field of interest. We propose the Neuro-Symbolic Forward Reasoner (NSFR), a new approach for reasoning tasks taking advantage of differentiable forward-chaining using first-order logic. The key idea is to combine differentiable forward-chaining reasoning with object-centric (deep) learning. Differentiable forward-chaining reasoning computes logical entailments smoothly, i.e., it deduces new facts from given facts and rules in a differentiable manner. The object-centric learning approach factorizes raw inputs into representations in terms of objects. Thus, it allows us to provide a consistent framework to perform the forward-chaining inference from raw inputs. NSFR factorizes the raw inputs into the object-centric representations, converts them into probabilistic ground atoms, and finally performs differentiable forward-chaining inference using weighted rules for inference. Our comprehensive experimental evaluations on object-centric reasoning data sets, 2D Kandinsky patterns and 3D CLEVR-Hans, and a variety of tasks show the effectiveness and advantage of our approach.

【8】 Mixed Reality using Illumination-aware Gradient Mixing in Surgical Telepresence: Enhanced Multi-layer Visualization 标题：手术临场感中使用照明感知梯度混合的混合现实：增强的多层可视化链接：https://arxiv.org/abs/2110.09318

作者：Nirakar Puri,Abeer Alsadoon,P. W. C. Prasad,Nada Alsalami,Tarik A. Rashid 机构：Illumination-aware Gradient Mixing in Surgical Telepresence: Enhanced Multi-layer Visualization. Multimedia Tools and Applications., Mixed Reality using Illumination-aware Gradient Mixing in Surgical 备注：None 摘要：背景与目的：使用增强感知的手术临场感已经得到应用，但混合现实仍在研究中，只是理论上的。这项工作的目的是提出一种解决方案，当输入源和目标视频中的照明强度变化时，通过生成全局一致的视频来提高最终合并视频中的可视化效果。方法：提出的系统使用增强的多层可视化，并使用光照感知视频合成算法进行光照感知梯度混合。粒子群优化算法用于从前景和背景区域以及图像像素相关性中寻找最佳样本对，以估计alpha蒙版。粒子群优化算法有助于在未知区域获得未知像素的原始颜色和深度。结果：我们的结果显示，在肠道、颌骨和乳房的每个样本中，通过减少为未知区域选择最佳样本对的均方误差，提高了准确性。与最先进的系统相比，该减少量为16.48%。结果，能见度精度从89.4%提高到97.7%，这有助于即使在光线不同的情况下也能清除手部视觉。结论：照明效果和alpha像素相关性提高了可视化精度，生成了全局一致的合成结果，并在合成具有高照明效果和反向照明效果的两个视频时保持了时间一致性。此外，本文还为未知区域选择最佳采样对以获得原始颜色和深度提供了解决方案。摘要：Background and aim: Surgical telepresence using augmented perception has been applied, but mixed reality is still being researched and is only theoretical. The aim of this work is to propose a solution to improve the visualization in the final merged video by producing globally consistent videos when the intensity of illumination in the input source and target video varies. Methodology: The proposed system uses an enhanced multi-layer visualization with illumination-aware gradient mixing using Illumination Aware Video Composition algorithm. Particle Swarm Optimization Algorithm is used to find the best sample pair from foreground and background region and image pixel correlation to estimate the alpha matte. Particle Swarm Optimization algorithm helps to get the original colour and depth of the unknown pixel in the unknown region. Result: Our results showed improved accuracy caused by reducing the Mean squared Error for selecting the best sample pair for unknown region in 10 each sample for bowel, jaw and breast. The amount of this reduction is 16.48% from the state of art system. As a result, the visibility accuracy is improved from 89.4 to 97.7% which helped to clear the hand vision even in the difference of light. Conclusion: Illumination effect and alpha pixel correlation improves the visualization accuracy and produces a globally consistent composition results and maintains the temporal coherency when compositing two videos with high and inverse illumination effect. In addition, this paper provides a solution for selecting the best sampling pair for the unknown region to obtain the original colour and depth.

【9】 SafeAccess+: An Intelligent System to make Smart Home Safer and Americans with Disability Act Compliant 标题：SafeAccess+：一种智能系统，使智能家庭更安全，并符合美国残疾人法案链接：https://arxiv.org/abs/2110.09273

作者：Shahinur Alam 机构：Department of Electrical and Computer Engineering, The University of Memphis, Engineering Science Building, Memphis, TN 摘要：智能家居正变得无处不在，但它们并不符合美国残疾人法案（ADA）。配备符合ADA标准的设备和服务的智能家居对于残疾人（即视觉障碍和行动受限）提高独立性、安全性和生活质量至关重要。尽管智能家居技术取得了诸多进步，但仍存在一些基本的设计和实施问题。例如，当有人敲门或按门铃时，残疾人往往会感到不安。在本文中，我们提出了一个名为“SafeAccess+”的智能系统，以构建更安全且符合ADA的场所（例如智能家居、办公室）。SafeAccess+的主要功能是：1）监控场所内外并识别进入人员；2）为用户提供相关信息，以评估即将到来的威胁（如入室盗窃、抢劫）和正在进行的犯罪3），使用户可以安全地进入朋友/家人的家中。我们已经解决了几个技术和研究挑战：-开发检测和识别人员/活动的模型，生成图像描述，设计符合ADA的终端系统。此外，我们还设计了一款智能门原型，展示了概念验证。该房地预计将配备置于战略位置的摄像机，便于全天候监控该房地，以识别入境人员并生成图像描述。系统根据图像描述生成预结构化消息，以评估传入的威胁并立即通知用户。通过严格的定量评估，确保了模型的完整性和通用性。使用PYTHEIA量表对系统的用户满意度和可靠性进行了测量，并被评为优秀（内部一致性Cronbach's alpha为0.784，重测信度为0.939）摘要：Smart homes are becoming ubiquitous, but they are not Americans with Disability Act (ADA) compliant. Smart homes equipped with ADA compliant appliances and services are critical for people with disabilities (i.e., visual impairments and limited mobility) to improve independence, safety, and quality of life. Despite all advancements in smart home technologies, some fundamental design and implementation issues remain. For example, people with disabilities often feel insecure to respond when someone knocks on the door or rings the doorbell. In this paper, we present an intelligent system called "SafeAccess+" to build safer and ADA compliant premises (e.g. smart homes, offices). The key functionalities of the SafeAccess+ are: 1) Monitoring the inside/outside of premises and identifying incoming people; 2) Providing users relevant information to assess incoming threats (e.g., burglary, robbery) and ongoing crimes 3) Allowing users to grant safe access to homes for friends/family members. We have addressed several technical and research challenges: - developing models to detect and recognize person/activity, generating image descriptions, designing ADA compliant end-end system. In addition, we have designed a prototype smart door showcasing the proof-of-concept. The premises are expected to be equipped with cameras placed in strategic locations that facilitate monitoring the premise 24/7 to identify incoming persons and to generate image descriptions. The system generates a pre-structured message from the image description to assess incoming threats and immediately notify the users. The completeness and generalization of models have been ensured through a rigorous quantitative evaluation. The users' satisfaction and reliability of the system has been measured using PYTHEIA scale and was rated excellent (Internal Consistency-Cronbach's alpha is 0.784, Test-retest reliability is 0.939 )

【10】 SynCoLFinGer: Synthetic Contactless Fingerprint Generator 标题：SynCoLFinGer：合成非接触式指纹发生器链接：https://arxiv.org/abs/2110.09144

作者：Jannis Priesnitz,Christian Rathgeb,Nicolas Buchmann,Christoph Busch 机构：∗Hochschule Darmstadt, Germany, †Freie Universit¨at Berlin, Germany 摘要：我们提出了第一种合成非接触式指纹图像的方法，称为SynCoLFinGer。为此，对非接触式指纹图像中与捕获、对象特征和环境影响有关的组成部分进行建模，并使用SFinGe算法将其应用于合成生成的脊线图案。该方法能够生成对应于单个手指的不同合成样本，并且可以参数化生成不同质量级别的非接触指纹图像。通过使用自适应NFIQ 2.0算法评估生物特征样本质量，并使用最先进的非接触式指纹识别系统评估生物特征实用程序，确认合成生成的非接触式指纹与真实指纹的相似性。摘要：We present the first method for synthetic generation of contactless fingerprint images, referred to as SynCoLFinGer. To this end, the constituent components of contactless fingerprint images regarding capturing, subject characteristics, and environmental influences are modeled and applied to a synthetically generated ridge pattern using the SFinGe algorithm. The proposed method is able to generate different synthetic samples corresponding to a single finger and it can be parameterized to generate contactless fingerprint images of various quality levels. The resemblance of the synthetically generated contactless fingerprints to real fingerprints is confirmed by evaluating biometric sample quality using an adapted NFIQ 2.0 algorithm and biometric utility using a state-of-the-art contactless fingerprint recognition system.

【11】 Differentiable Rendering with Perturbed Optimizers 标题：带扰动优化器的可微渲染链接：https://arxiv.org/abs/2110.09107

作者：Quentin Le Lidec,Ivan Laptev,Cordelia Schmid,Justin Carpentier 机构： PSL ResearchUniversity 摘要：从二维图像投影中推理三维场景是计算机视觉的核心问题之一。这个逆问题和不适定问题的解决方案通常涉及到寻找最能解释观测图像数据的模型。值得注意的是，图像既取决于观察场景的特性，也取决于图像形成的过程。因此，如果应该使用优化技术来解释图像，那么设计用于将三维场景投影到图像中的可微函数（也称为可微渲染）至关重要。以前的可微渲染方法通常用平滑近似代替不可微操作，从而影响后续的3D估计。在本文中，我们采用一种更一般的方法，通过随机优化的棱镜和相关的扰动优化器的概念来研究可微渲染器。特别是，我们的工作强调了一些著名的可微渲染器公式和随机平滑优化器之间的联系，并引入了可微扰动渲染器。我们还提出了一种方差减少机制，以减轻扰动优化器固有的计算负担，并引入了一种自适应方案来自动调整渲染过程的平滑参数。我们将我们的方法应用于三维场景重建，并在6D姿势估计和三维网格重建任务中展示了它的优势。通过提供可用作强监控信号的信息梯度，我们展示了与使用平滑梯度近似的最新替代方案相比，扰动渲染器获得更精确解的优势。摘要：Reasoning about 3D scenes from their 2D image projections is one of the core problems in computer vision. Solutions to this inverse and ill-posed problem typically involve a search for models that best explain observed image data. Notably, images depend both on the properties of observed scenes and on the process of image formation. Hence, if optimization techniques should be used to explain images, it is crucial to design differentiable functions for the projection of 3D scenes into images, also known as differentiable rendering. Previous approaches to differentiable rendering typically replace non-differentiable operations by smooth approximations, impacting the subsequent 3D estimation. In this paper, we take a more general approach and study differentiable renderers through the prism of randomized optimization and the related notion of perturbed optimizers. In particular, our work highlights the link between some well-known differentiable renderer formulations and randomly smoothed optimizers, and introduces differentiable perturbed renderers. We also propose a variance reduction mechanism to alleviate the computational burden inherent to perturbed optimizers and introduce an adaptive scheme to automatically adjust the smoothing parameters of the rendering process. We apply our method to 3D scene reconstruction and demonstrate its advantages on the tasks of 6D pose estimation and 3D mesh reconstruction. By providing informative gradients that can be used as a strong supervisory signal, we demonstrate the benefits of perturbed renderers to obtain more accurate solutions when compared to the state-of-the-art alternatives using smooth gradient approximations.

【12】 Localization with Sampling-Argmax 标题：带采样的定位-Argmax 链接：https://arxiv.org/abs/2110.08825

作者：Jiefeng Li,Tong Chen,Ruiqi Shi,Yujing Lou,Yong-Lu Li,Cewu Lu 机构：Shanghai Jiao Tong University 备注：NeurIPS 2021 摘要：Soft-argmax operation is commonly adopted in detection-based methods to localize the target position in a differentiable manner. However, training the neural network with soft-argmax makes the shape of the probability map unconstrained. Consequently, the model lacks pixel-wise supervision through the map during training, leading to performance degradation. In this work, we propose sampling-argmax, a differentiable training method that imposes implicit constraints to the shape of the probability map by minimizing the expectation of the localization error. To approximate the expectation, we introduce a continuous formulation of the output distribution and develop a differentiable sampling process. The expectation can be approximated by calculating the average error of all samples drawn from the output distribution. We show that sampling-argmax can seamlessly replace the conventional soft-argmax operation on various localization tasks. Comprehensive experiments demonstrate the effectiveness and flexibility of the proposed method. Code is available at https://github.com/Jeff-sjtu/sampling-argmax

【13】 Revealing Disocclusions in Temporal View Synthesis through Infilling Vector Prediction 标题：利用填充矢量预测揭示时态视图合成中的错位链接：https://arxiv.org/abs/2110.08805

作者：Vijayalakshmi Kanchana,Nagabhushan Somraj,Suraj Yadwad,Rajiv Soundararajan 机构：Indian Institute of Science, Bengaluru, India 备注：WACV 2022. this https URL 摘要：We consider the problem of temporal view synthesis, where the goal is to predict a future video frame from the past frames using knowledge of the depth and relative camera motion. In contrast to revealing the disoccluded regions through intensity based infilling, we study the idea of an infilling vector to infill by pointing to a non-disoccluded region in the synthesized view. To exploit the structure of disocclusions created by camera motion during their infilling, we rely on two important cues, temporal correlation of infilling directions and depth. We design a learning framework to predict the infilling vector by computing a temporal prior that reflects past infilling directions and a normalized depth map as input to the network. We conduct extensive experiments on a large scale dataset we build for evaluating temporal view synthesis in addition to the SceneNet RGB-D dataset. Our experiments demonstrate that our infilling vector prediction approach achieves superior quantitative and qualitative infilling performance compared to other approaches in literature.

【14】 Nonlinear Transform Induced Tensor Nuclear Norm for Tensor Completion 标题：张量完备的非线性变换诱导张量核范数链接：https://arxiv.org/abs/2110.08774

作者：Ben-Zheng Li,Xi-Le Zhao,Teng-Yu Ji,Xiong-Jun Zhang,Ting-Zhu Huang 备注：Nonlinear transform, tensor nuclear norm, proximal alternating minimization, tensor completion 摘要：The linear transform-based tensor nuclear norm (TNN) methods have recently obtained promising results for tensor completion. The main idea of this type of methods is exploiting the low-rank structure of frontal slices of the targeted tensor under the linear transform along the third mode. However, the low-rankness of frontal slices is not significant under linear transforms family. To better pursue the low-rank approximation, we propose a nonlinear transform-based TNN (NTTNN). More concretely, the proposed nonlinear transform is a composite transform consisting of the linear semi-orthogonal transform along the third mode and the element-wise nonlinear transform on frontal slices of the tensor under the linear semi-orthogonal transform, which are indispensable and complementary in the composite transform to fully exploit the underlying low-rankness. Based on the suggested low-rankness metric, i.e., NTTNN, we propose a low-rank tensor completion (LRTC) model. To tackle the resulting nonlinear and nonconvex optimization model, we elaborately design the proximal alternating minimization (PAM) algorithm and establish the theoretical convergence guarantee of the PAM algorithm. Extensive experimental results on hyperspectral images, multispectral images, and videos show that the our method outperforms linear transform-based state-of-the-art LRTC methods qualitatively and quantitatively.

【15】 AE-StyleGAN: Improved Training of Style-Based Auto-Encoders 标题：AE-StyleGan：改进的基于风格的自动编码器训练链接：https://arxiv.org/abs/2110.08718

作者：Ligong Han,Sri Harsha Musunuri,Martin Renqiang Min,Ruijiang Gao,Yu Tian,Dimitris Metaxas 机构：Rutgers University ,NEC Labs America ,The University of Texas at Austin 备注：Accepted at WACV-22 摘要：StyleGANs have shown impressive results on data generation and manipulation in recent years, thanks to its disentangled style latent space. A lot of efforts have been made in inverting a pretrained generator, where an encoder is trained ad hoc after the generator is trained in a two-stage fashion. In this paper, we focus on style-based generators asking a scientific question: Does forcing such a generator to reconstruct real data lead to more disentangled latent space and make the inversion process from image to latent space easy? We describe a new methodology to train a style-based autoencoder where the encoder and generator are optimized end-to-end. We show that our proposed model consistently outperforms baselines in terms of image inversion and generation quality. Supplementary, code, and pretrained models are available on the project website.

【16】 Automated Remote Sensing Forest Inventory Using Satelite Imagery 标题：利用卫星影像进行森林资源自动遥感调查链接：https://arxiv.org/abs/2110.08590

作者：Abduragim Shtanchaev,Artur Bille,Olga Sutyrina,Sara Elelimy 机构：Skolkovo Institute of Science and Technology, Bolshoy Boulevard , bld. , Moscow, Russia, Ulm University, Helmholtzstraße , Ulm, Germany, Corresponding Author 备注：15 pages, 11 figures, 71th International Astronautical Congress (IAC) - The CyberSpace Edition 摘要：For many countries like Russia, Canada, or the USA, a robust and detailed tree species inventory is essential to manage their forests sustainably. Since one can not apply unmanned aerial vehicle (UAV) imagery-based approaches to large-scale forest inventory applications, the utilization of machine learning algorithms on satellite imagery is a rising topic of research. Although satellite imagery quality is relatively low, additional spectral channels provide a sufficient amount of information for tree crown classification tasks. Assuming that tree crowns are detected already, we use embeddings of tree crowns generated by Autoencoders as a data set to train classical Machine Learning algorithms. We compare our Autoencoder (AE) based approach to traditional convolutional neural networks (CNN) end-to-end classifiers.

【17】 Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation 标题：先探索后移动：一种可行的嵌入式导航路径估计和记忆回溯框架链接：https://arxiv.org/abs/2110.08571

作者：Yang Wu,Shirui Feng,Guanbin Li,Liang Lin 机构：School of Computer Science and Engineering, Sun Yat-sen University 摘要：An embodied task such as embodied question answering (EmbodiedQA), requires an agent to explore the environment and collect clues to answer a given question that related with specific objects in the scene. The solution of such task usually includes two stages, a navigator and a visual Q&A module. In this paper, we focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense, which essentially results in a failure finding target when robot is spawn in unknown environments. Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling (PEMR) framework. PEMR includes a "looking ahead" process, \textit{i.e.} a visual feature extractor module that estimates feasible paths for gathering 3D navigational information, which is mimicking the human sense of direction. PEMR contains another process ``looking behind'' process that is a memory recall mechanism aims at fully leveraging past experience collected by the feature extractor. Last but not the least, to encourage the navigator to learn more accurate prior expert experience, we improve the original benchmark dataset and provide a family of evaluation metrics for diagnosing both navigation and question answering modules. We show strong experimental results of PEMR on the EmbodiedQA navigation task.

【18】 Starkit: RoboCup Humanoid KidSize 2021 Worldwide Champion Team Paper 标题：Starkit：RoboCup人形KidSize 2021世界冠军团体论文链接：https://arxiv.org/abs/2110.08377

作者：Egor Davydenko,Ivan Khokhlov,Vladimir Litvinenko,Ilya Ryakin,Ilya Osokin,Azer Babaev 机构：Team Starkit, Moscow Institute of Physics and Technology, Russia 备注：15 pages, 10 figures 摘要：This article is devoted to the features that were under development between RoboCup 2019 Sydney and RoboCup 2021 Worldwide. These features include vision-related matters, such as detection and localization, mechanical and algorithmic novelties. Since the competition was held virtually, the simulation-specific features are also considered in the article. We give an overview of the approaches that were tried out along with the analysis of their preconditions, perspectives and the evaluation of their performance.

【19】 Counting Objects by Diffused Index: geometry-free and training-free approach 标题：按扩散指数计数物体：无需几何和无需训练的方法链接：https://arxiv.org/abs/2110.08365

作者：Mengyi Tang,Maryam Yashtini,Sung Ha Kang 摘要：Counting objects is a fundamental but challenging problem. In this paper, we propose diffusion-based, geometry-free, and learning-free methodologies to count the number of objects in images. The main idea is to represent each object by a unique index value regardless of its intensity or size, and to simply count the number of index values. First, we place different vectors, refer to as seed vectors, uniformly throughout the mask image. The mask image has boundary information of the objects to be counted. Secondly, the seeds are diffused using an edge-weighted harmonic variational optimization model within each object. We propose an efficient algorithm based on an operator splitting approach and alternating direction minimization method, and theoretical analysis of this algorithm is given. An optimal solution of the model is obtained when the distributed seeds are completely diffused such that there is a unique intensity within each object, which we refer to as an index. For computational efficiency, we stop the diffusion process before a full convergence, and propose to cluster these diffused index values. We refer to this approach as Counting Objects by Diffused Index (CODI). We explore scalar and multi-dimensional seed vectors. For Scalar seeds, we use Gaussian fitting in histogram to count, while for vector seeds, we exploit a high-dimensional clustering method for the final step of counting via clustering. The proposed method is flexible even if the boundary of the object is not clear nor fully enclosed. We present counting results in various applications such as biological cells, agriculture, concert crowd, and transportation. Some comparisons with existing methods are presented.

【20】 Body Part Regression for CT Images 标题：CT图像的人体部位回归链接：https://arxiv.org/abs/2110.09148

作者：Sarah Schuhegger 机构：born in Traunstein, arXiv:,.,v, [eess.IV] , Oct 摘要：One of the greatest challenges in the medical imaging domain is to successfully transfer deep learning models into clinical practice. Since models are often trained on a specific body region, a robust transfer into the clinic necessitates the selection of images with body regions that fit the algorithm to avoid false-positive predictions in unknown regions. Due to the insufficient and inaccurate nature of manually-defined imaging meta-data, automated body part recognition is a key ingredient towards the broad and reliable adoption of medical deep learning models. While some approaches to this task have been presented in the past, building and evaluating robust algorithms for fine-grained body part recognition remains challenging. So far, no easy-to-use method exists to determine the scanned body range of medical Computed Tomography (CT) volumes. In this thesis, a self-supervised body part regression model for CT volumes is developed and trained on a heterogeneous collection of CT studies. Furthermore, it is demonstrated how the algorithm can contribute to the robust and reliable transfer of medical models into the clinic. Finally, easy application of the developed method is ensured by integrating it into the medical platform toolkit Kaapana and providing it as a python package at https://github.com/MIC-DKFZ/BodyPartRegression .

【21】 Salt and pepper noise removal method based on stationary Framelet transform with non-convex sparsity regularization 标题：基于非凸稀疏正则化平稳Framelet变换的椒盐噪声去除方法链接：https://arxiv.org/abs/2110.09113

作者：Yingpin Chen,Lingzhi Wang,Huiying Huang,Jianhua Song,Chaoqun Yu,Yanping Xu 机构：(, School of Mathematical Sciences, University of Electronic Science and Technology,;, School of Physics and Information Engineering, Minnan Normal University,) 摘要：Salt and pepper noise removal is a common inverse problem in image processing, and it aims to restore image information with high quality. Traditional salt and pepper denoising methods have two limitations. First, noise characteristics are often not described accurately. For example, the noise location information is often ignored and the sparsity of the salt and pepper noise is often described by L1 norm, which cannot illustrate the sparse variables clearly. Second, conventional methods separate the contaminated image into a recovered image and a noise part, thus resulting in recovering an image with unsatisfied smooth parts and detail parts. In this study, we introduce a noise detection strategy to determine the position of the noise, and a non-convex sparsity regularization depicted by Lp quasi-norm is employed to describe the sparsity of the noise, thereby addressing the first limitation. The morphological component analysis framework with stationary Framelet transform is adopted to decompose the processed image into cartoon, texture, and noise parts to resolve the second limitation. In this framework, the stationary Framelet regularizations with different parameters control the restoration of the cartoon and texture parts. In this way, the two parts are recovered separately to avoid mutual interference. Then, the alternating direction method of multipliers (ADMM) is employed to solve the proposed model. Finally, experiments are conducted to verify the proposed method and compare it with some current state-of-the-art denoising methods. The experimental results show that the proposed method can remove salt and pepper noise while preserving the details of the processed image.

【22】 Rheumatoid Arthritis: Automated Scoring of Radiographic Joint Damage 标题：类风湿性关节炎：放射学关节损伤的自动评分链接：https://arxiv.org/abs/2110.08812

作者：Yan Ming Tan,Raphael Quek Hao Chong,Carol Anne Hargreaves 机构： Department of Statistics and Data Science, National University of Singapore;, Department of Electrical & Computer Engineering, National University of, Singapore, Singapore; 摘要：Rheumatoid arthritis is an autoimmune disease that causes joint damage due to inflammation in the soft tissue lining the joints known as the synovium. It is vital to identify joint damage as soon as possible to provide necessary treatment early and prevent further damage to the bone structures. Radiographs are often used to assess the extent of the joint damage. Currently, the scoring of joint damage from the radiograph takes expertise, effort, and time. Joint damage associated with rheumatoid arthritis is also not quantitated in clinical practice and subjective descriptors are used. In this work, we describe a pipeline of deep learning models to automatically identify and score rheumatoid arthritic joint damage from a radiographic image. Our automatic tool was shown to produce scores with extremely high balanced accuracy within a couple of minutes and utilizing this would remove the subjectivity of the scores between human reviewers.

【23】 Deep Image Debanding 标题：深度图像去带链接：https://arxiv.org/abs/2110.08569

作者：Raymond Zhou,Shahrukh Athar,Zhongling Wang,Zhou Wang 机构：Department of Electrical & Computer Engineering, University of Waterloo, Canada 备注：5 pages, 4 figures, 5 tables 摘要：Banding or false contour is an annoying visual artifact whose impact is even more pronounced in ultra high definition, high dynamic range, and wide colour gamut visual content, which is becoming increasingly popular. Since users associate a heightened expectation of quality with such content and banding leads to deteriorated visual quality-of-experience, the area of banding removal or debanding has taken paramount importance. Existing debanding approaches are mostly knowledge-driven. Despite the widespread success of deep learning in other areas of image processing and computer vision, data-driven debanding approaches remain surprisingly missing. In this work, we make one of the first attempts to develop a deep learning based banding artifact removal method for images and name it deep debanding network (deepDeband). For its training, we construct a large-scale dataset of 51,490 pairs of corresponding pristine and banded image patches. Performance evaluation shows that deepDeband is successful at greatly reducing banding artifacts in images, outperforming existing methods both quantitatively and visually.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-10-19，如有侵权请联系 cloudcommunity@tencent.com 删除

linux