计算机视觉与模式识别学术速递[12.20]

公众号-arXiv每日学术速递

发布于 2021-12-24 08:48:18

1.4K0

发布于 2021-12-24 08:48:18

文章被收录于专栏：arXiv每日学术速递

cs.CV 方向，今日共计70篇

Transformer(5篇)

【1】 Efficient Visual Tracking with Exemplar Transformers 标题：利用样本变换器实现高效的视觉跟踪链接：https://arxiv.org/abs/2112.09686

作者：Philippe Blatter,Menelaos Kanakis,Martin Danelljan,Luc Van Gool 机构：ETH Z¨urich 备注：Main Paper: 8 pages, 5 figures, 5 tables Supplementary: 4 pages, 3 figures, 3 tables 摘要：更复杂、功能更强大的神经网络模型的设计极大地提升了视觉目标跟踪的最新水平。这些进步可归因于更深层次的网络，或是Transformer等新构件的引入。然而，在追求更高的跟踪性能时，高效的跟踪体系结构却很少受到关注。在本文中，我们介绍了一种用于实时视觉目标跟踪的高效变换器——样例变换器。E.T.Track是我们的视觉跟踪器，它包含了示例Transformer层，在CPU上以47 fps的速度运行。这比其他基于Transformer的型号快8倍，使其成为唯一基于Transformer的实时跟踪器。与可在标准CPU上实时运行的轻量级跟踪器相比，E.T.Track在LaSOT、OTB-100、NFS、TrackingNet和VOT-ST2020数据集上始终优于所有其他方法。该代码很快将于https://github.com/visionml/pytracking. 摘要：The design of more complex and powerful neural network models has significantly advanced the state-of-the-art in visual object tracking. These advances can be attributed to deeper networks, or to the introduction of new building blocks, such as transformers. However, in the pursuit of increased tracking performance, efficient tracking architectures have received surprisingly little attention. In this paper, we introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU. This is up to 8 times faster than other transformer-based models, making it the only real-time transformer-based tracker. When compared to lightweight trackers that can operate in real-time on standard CPUs, E.T.Track consistently outperforms all other methods on the LaSOT, OTB-100, NFS, TrackingNet and VOT-ST2020 datasets. The code will soon be released on https://github.com/visionml/pytracking.

【2】 Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers 标题：基于图神经网络驱动Transformer的神经形态摄像机去噪链接：https://arxiv.org/abs/2112.09685

作者：Yusra Alkendi,Rana Azzam,Abdulla Ayyad,Sajid Javed,Lakmal Seneviratne,Yahya Zweiri 摘要：神经形态视觉是一种受生物启发的技术，它引发了计算机视觉领域的范式转变，并成为众多应用的关键促成因素。这项技术具有显著的优势，包括降低功耗、减少处理需求和提高通信速度。然而，神经形态摄像机受到大量测量噪声的影响。这种噪声会恶化基于神经形态事件的感知和导航算法的性能。在本文中，我们提出了一种新的噪声过滤算法来消除观测场景中不代表真实对数强度变化的事件。我们采用一种图形神经网络（GNN）驱动的变换算法，称为GNN变换，将原始流中的每个活动事件像素分类为真实的对数强度变化或噪声。在GNN中，执行了一个称为EventConv的消息传递框架，以反映事件之间的时空相关性，同时保留其异步性质。我们还介绍了已知物体地面真值标记（KoGTL）方法，用于在各种照明条件下生成事件流的近似地面真值标记。KoGTL用于根据在具有挑战性的光照条件下记录的实验生成标记数据集。这些数据集用于训练和广泛测试我们提出的算法。当在看不见的数据集上测试时，所提出的算法在过滤精度方面比现有方法高12%。此外，还对公开的数据集进行了额外的测试，以证明该算法在存在光照变化和不同运动动力学的情况下的泛化能力。与现有的解决方案相比，定性结果验证了该算法在保留有意义的场景事件的同时消除噪声的优越性。摘要：Neuromorphic vision is a bio-inspired technology that has triggered a paradigm shift in the computer-vision community and is serving as a key-enabler for a multitude of applications. This technology has offered significant advantages including reduced power consumption, reduced processing needs, and communication speed-ups. However, neuromorphic cameras suffer from significant amounts of measurement noise. This noise deteriorates the performance of neuromorphic event-based perception and navigation algorithms. In this paper, we propose a novel noise filtration algorithm to eliminate events which do not represent real log-intensity variations in the observed scene. We employ a Graph Neural Network (GNN)-driven transformer algorithm, called GNN-Transformer, to classify every active event pixel in the raw stream into real-log intensity variation or noise. Within the GNN, a message-passing framework, called EventConv, is carried out to reflect the spatiotemporal correlation among the events, while preserving their asynchronous nature. We also introduce the Known-object Ground-Truth Labeling (KoGTL) approach for generating approximate ground truth labels of event streams under various illumination conditions. KoGTL is used to generate labeled datasets, from experiments recorded in challenging lighting conditions. These datasets are used to train and extensively test our proposed algorithm. When tested on unseen datasets, the proposed algorithm outperforms existing methods by 12% in terms of filtration accuracy. Additional tests are also conducted on publicly available datasets to demonstrate the generalization capabilities of the proposed algorithm in the presence of illumination variations and different motion dynamics. Compared to existing solutions, qualitative results verified the superior capability of the proposed algorithm to eliminate noise while preserving meaningful scene events.

【3】 SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers 标题：SiamTrans：使用预先训练的暹罗Transformer进行Zero-Shot多帧图像恢复链接：https://arxiv.org/abs/2112.09426

作者：Lin Liu,Shanxin Yuan,Jianzhuang Liu,Xin Guo,Youliang Yan,Qi Tian 机构：EEIS Department, University of Science and Technology of China, Huawei Noah’s Ark Lab, Huawei Cloud BU 备注：None 摘要：我们提出了一种新的零拍多帧图像恢复方法，用于去除在连续帧中变化的不需要的障碍元素（如雨、雪和云纹图案）。它分为三个阶段：Transformer预训练、零炮恢复和硬补丁优化。利用预先训练好的变换器，我们的模型能够分辨出真实图像信息和障碍元素之间的运动差异。对于零拍图像恢复，我们设计了一种新的模型，称为SiamTrans，它由暹罗Transformer、编码器和解码器构成。每个转换器都有一个时间注意层和几个自我注意层，以捕获多个帧的时间和空间信息。只有在去噪任务上经过预训练（自我监督），SiamTrans才能在三种不同的低级视觉任务（去噪、去噪和去噪）上进行测试。与相关方法相比，我们的方法取得了最好的性能，甚至优于监督学习方法。摘要：We propose a novel zero-shot multi-frame image restoration method for removing unwanted obstruction elements (such as rains, snow, and moire patterns) that vary in successive frames. It has three stages: transformer pre-training, zero-shot restoration, and hard patch refinement. Using the pre-trained transformers, our model is able to tell the motion difference between the true image information and the obstructing elements. For zero-shot image restoration, we design a novel model, termed SiamTrans, which is constructed by Siamese transformers, encoders, and decoders. Each transformer has a temporal attention layer and several self-attention layers, to capture both temporal and spatial information of multiple frames. Only pre-trained (self-supervised) on the denoising task, SiamTrans is tested on three different low-level vision tasks (deraining, demoireing, and desnowing). Compared with related methods, ours achieves the best performances, even outperforming those with supervised learning.

【4】 Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction 标题：具有深度信息交互的鲁棒点云配准全Transformer框架链接：https://arxiv.org/abs/2112.09385

作者：Guangyan Chen,Meiling Wang,Yufeng Yue,Qingxiang Zhang,Li Yuan 机构： Beijing Institute of Technology , National University of Singapore 备注：10pages, 7figures 摘要：最新的基于变换器的方法利用变换器的顺序不变性和对聚集信息的建模依赖性的优点，在点云配准方面取得了更高的性能。然而，它们仍然受到模糊特征提取、噪声敏感性和异常值的影响。其原因是：（1）CNN的局部感受野使其无法模拟全局关系，导致提取的特征容易受到噪声的影响；（2） Transformer的浅宽结构和位置编码的缺乏导致了由于低效的信息交互导致的模糊特征提取；（3）忽略几何相容性会导致内点和异常点之间的分类不准确。针对上述局限性，提出了一种用于点云注册的新型全Transformer网络，称为深度交互Transformer（DIT），该网络包括：（1）一个点云结构提取器（PSE），用于建模全局关系并使用Transformer编码器检索结构信息；（2）深度窄点特征变换器（PFT），通过位置编码促进跨两个点云的深度信息交互，以便变换器可以建立全面的关联并直接学习点之间的相对位置；（3）提出了一种基于几何匹配的对应可信度评估（GMCCE）方法，通过设计三角化描述符来度量空间一致性和估计内部可信度。在干净、有噪声、部分重叠的点云配准上的大量实验表明，我们的方法优于最新的方法。摘要：Recent Transformer-based methods have achieved advanced performance in point cloud registration by utilizing advantages of the Transformer in order-invariance and modeling dependency to aggregate information. However, they still suffer from indistinct feature extraction, sensitivity to noise, and outliers. The reasons are: (1) the adoption of CNNs fails to model global relations due to their local receptive fields, resulting in extracted features susceptible to noise; (2) the shallow-wide architecture of Transformers and lack of positional encoding lead to indistinct feature extraction due to inefficient information interaction; (3) the omission of geometrical compatibility leads to inaccurate classification between inliers and outliers. To address above limitations, a novel full Transformer network for point cloud registration is proposed, named the Deep Interaction Transformer (DIT), which incorporates: (1) a Point Cloud Structure Extractor (PSE) to model global relations and retrieve structural information with Transformer encoders; (2) a deep-narrow Point Feature Transformer (PFT) to facilitate deep information interaction across two point clouds with positional encoding, such that Transformers can establish comprehensive associations and directly learn relative position between points; (3) a Geometric Matching-based Correspondence Confidence Evaluation (GMCCE) method to measure spatial consistency and estimate inlier confidence by designing the triangulated descriptor. Extensive experiments on clean, noisy, partially overlapping point cloud registration demonstrate that our method outperforms state-of-the-art methods.

【5】 Towards End-to-End Image Compression and Analysis with Transformers 标题：基于Transformer的端到端图像压缩与分析链接：https://arxiv.org/abs/2112.09300

作者：Yuanchao Bai,Xu Yang,Xianming Liu,Junjun Jiang,Yaowei Wang,Xiangyang Ji,Wen Gao 机构：Harbin Institute of Technology,Peng Cheng Laboratory,Tsinghua University,Peking University 备注：Accepted by AAAI 2022; Code: this https URL 摘要：针对基于云的图像分类应用，我们提出了一种带转换器的端到端图像压缩和分析模型。我们的目标不是将现有的基于变换器的图像分类模型直接放在图像编解码器之后，而是重新设计视觉变换器（ViT）模型，以根据压缩特征执行图像分类，并利用变换器的长期信息促进图像压缩。具体地说，我们首先用一个由卷积神经网络建模的轻量级图像编码器替换ViT模型的修补干（即图像分割和嵌入）。由图像编码器生成的压缩特征被注入卷积电感偏置，并通过图像重建被馈送到Transformer进行图像分类。同时，我们提出了一个特征聚合模块，将压缩后的特征与选定的Transformer中间特征进行融合，并将聚合后的特征反馈给反褶积神经网络进行图像重建。聚合特征可以从Transformer的自我注意机制中获取长期信息，提高压缩性能。最后采用两步训练策略解决率失真精度优化问题。实验结果证明了该模型在图像压缩和分类任务中的有效性。摘要：We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i.e., image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.

检测相关(2篇)

【1】 AFDetV2: Rethinking the Necessity of the Second Stage for Object Detection from Point Clouds 标题：AFDetV2：对点云目标检测第二阶段必要性的再思考链接：https://arxiv.org/abs/2112.09205

作者：Yihan Hu,Zhuangzhuang Ding,Runzhou Ge,Wenxin Shao,Li Huang,Kun Li,Qiang Liu 机构：Horizon Robotics 备注：AAAI 2022; 1st Place Solution for the Real-time 3D Detection and the Most Efficient Model of the Waymo Open Dataset Challenges 2021 (this http URL) 摘要：点云三维检测有两种方法：单阶段方法和两阶段方法。前者的计算效率更高，后者通常提供更好的检测精度。通过仔细检查两阶段方法，我们发现如果设计得当，第一阶段可以产生精确的盒回归。在这种情况下，第二阶段主要是重新存储框，以便选择本地化程度更好的框。根据这一观察，我们设计了一个单级无锚网络，可以满足这些要求。该网络名为AFDetV2，通过在主干中加入自校准卷积块、关键点辅助监控和多任务头中的IoU预测分支，扩展了先前的工作。因此，在单级检测中，检测精度大大提高。为了评估我们的方法，我们在Waymo开放数据集和nuScenes数据集上进行了大量实验。我们观察到，我们的AFDetV2在这两个数据集上实现了最先进的结果，优于所有现有技术，包括单级和两级se3D探测器。AFDEDV2赢得了实时三维检测的Wavo开放数据集挑战2021的第一位。此外，我们的AFDetV2模型库的一个变体被挑战发起人命名为“最有效的模型”，显示出卓越的计算效率。为了证明这种单阶段方法的通用性，我们还将其应用于两阶段网络的第一阶段。毫无例外，结果表明，通过加强主干和重新排序方法，不再需要第二阶段细化。摘要：There have been two streams in the 3D detection from point clouds: single-stage methods and two-stage methods. While the former is more computationally efficient, the latter usually provides better detection accuracy. By carefully examining the two-stage approaches, we have found that if appropriately designed, the first stage can produce accurate box regression. In this scenario, the second stage mainly rescores the boxes such that the boxes with better localization get selected. From this observation, we have devised a single-stage anchor-free network that can fulfill these requirements. This network, named AFDetV2, extends the previous work by incorporating a self-calibrated convolution block in the backbone, a keypoint auxiliary supervision, and an IoU prediction branch in the multi-task head. As a result, the detection accuracy is drastically boosted in the single-stage. To evaluate our approach, we have conducted extensive experiments on the Waymo Open Dataset and the nuScenes Dataset. We have observed that our AFDetV2 achieves the state-of-the-art results on these two datasets, superior to all the prior arts, including both the single-stage and the two-stage se3D detectors. AFDetV2 won the 1st place in the Real-Time 3D Detection of the Waymo Open Dataset Challenge 2021. In addition, a variant of our model AFDetV2-Base was entitled the "Most Efficient Model" by the Challenge Sponsor, showing a superior computational efficiency. To demonstrate the generality of this single-stage method, we have also applied it to the first stage of the two-stage networks. Without exception, the results show that with the strengthened backbone and the rescoring approach, the second stage refinement is no longer needed.

【2】 ALEBk: Feasibility Study of Attention Level Estimation via Blink Detection applied to e-Learning 标题：ALEBk：基于眨眼检测的注意力水平估计应用于e-Learning的可行性研究链接：https://arxiv.org/abs/2112.09165

作者：Roberto Daza,Daniel DeAlcala,Aythami Morales,Ruben Tolosana,Ruth Cobos,Julian Fierrez 机构：School of Engineering, Autonomous University of Madrid 备注：Preprint of the paper presented to the Workshop on Artificial Intelligence for Education (AI4EDU) of AAAI 2022 摘要：本文提出了一种基于眨眼频率的远程注意水平估计的可行性研究。我们首先提出了一种基于卷积神经网络（CNNs）的眨眼检测系统，该系统在相关工作中具有很强的竞争力。使用该检测器，我们通过实验评估了眨眼率与在线课程中学生注意力水平之间的关系。该实验框架是使用公共多模式数据库进行的，该数据库用于眨眼检测和注意力水平估计，称为mEBAL，该数据库包含来自38名学生和多个采集传感器的数据，特别是，i）提供来自学生认知信息的时间信号的脑电图（EEG）波段，以及ii）RGB和NIR摄像头捕捉学生的面部姿势。研究结果表明，眨眼频率与注意力水平呈负相关。在我们提出的称为ALEBk的方法中使用了这种关系，该方法将注意力水平估计为眨眼频率的倒数。我们的研究结果开辟了一条新的研究路线，将该技术引入未来电子学习平台的注意力水平估计，以及基于人脸分析的行为生物特征识别的其他应用。摘要：This work presents a feasibility study of remote attention level estimation based on eye blink frequency. We first propose an eye blink detection system based on Convolutional Neural Networks (CNNs), very competitive with respect to related works. Using this detector, we experimentally evaluate the relationship between the eye blink rate and the attention level of students captured during online sessions. The experimental framework is carried out using a public multimodal database for eye blink detection and attention level estimation called mEBAL, which comprises data from 38 students and multiples acquisition sensors, in particular, i) an electroencephalogram (EEG) band which provides the time signals coming from the student's cognitive information, and ii) RGB and NIR cameras to capture the students face gestures. The results achieved suggest an inverse correlation between the eye blink frequency and the attention level. This relation is used in our proposed method called ALEBk for estimating the attention level as the inverse of the eye blink frequency. Our results open a new research line to introduce this technology for attention level estimation on future e-learning platforms, among other applications of this kind of behavioral biometrics based on face analysis.

分类|识别相关(7篇)

【1】 Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition 标题：用于半监督动作识别的交叉模型伪标记法链接：https://arxiv.org/abs/2112.09690

作者：Yinghao Xu,Fangyun Wei,Xiao Sun,Ceyuan Yang,Yujun Shen,Bo Dai,Bolei Zhou,Stephen Lin 机构：The Chinese University of Hong Kong, S-Lab, Nanyang Technological University, Microsoft Research Asia 备注：Project webpage: this https URL 摘要：由于数据标注的高成本，半监督动作识别是一项具有挑战性的重要任务。解决此问题的一种常见方法是使用伪标签分配未标记的数据，然后在训练中使用伪标签作为额外的监督。通常在最近的工作中，伪标签是通过在标记的数据上训练模型，然后使用模型中的自信预测来自学而获得的。在这项工作中，我们提出了一种更有效的伪标记方案，称为跨模型伪标记（CMPL）。具体地说，除了主主干之外，我们还引入了一个轻量级辅助网络，并要求它们预测彼此的伪标签。我们观察到，由于它们不同的结构偏差，这两个模型倾向于从相同的视频剪辑中学习互补表示。因此，通过利用跨模型预测作为监督，每个模型都可以从其对应模型中受益。在不同数据分区协议上的实验表明，我们的框架比现有的方案有了显著的改进。例如，CMPL仅使用RGB模式和$1\%$标记数据，在Kinetics-400和UCF-101上实现$17.6\%$和$25.1\%$顶级精度，分别比我们的基线模型FixMatch高出$9.0\%$和$10.3\%$。摘要：Semi-supervised action recognition is a challenging but important task due to the high cost of data annotation. A common approach to this problem is to assign unlabeled data with pseudo-labels, which are then used as additional supervision in training. Typically in recent work, the pseudo-labels are obtained by training a model on the labeled data, and then using confident predictions from the model to teach itself. In this work, we propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL). Concretely, we introduce a lightweight auxiliary network in addition to the primary backbone, and ask them to predict pseudo-labels for each other. We observe that, due to their different structural biases, these two models tend to learn complementary representations from the same video clips. Each model can thus benefit from its counterpart by utilizing cross-model predictions as supervision. Experiments on different data partition protocols demonstrate the significant improvement of our framework over existing alternatives. For example, CMPL achieves $17.6\%$ and $25.1\%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1\%$ labeled data, outperforming our baseline model, FixMatch, by $9.0\%$ and $10.3\%$, respectively.

【2】 Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition 标题：像素提取：一种新的低分辨率图像识别知识提取方案链接：https://arxiv.org/abs/2112.09532

作者：Guangyu Guo,Longfei Han,Junwei Han,Dingwen Zhang 机构：The Brain and Artificial Intelligence Laboratory, Northwestern Polytechnical University, Xi’an, China, Beijing Technology and Business University, Beijing, China 摘要：深度学习的巨大成功主要归功于大规模的网络架构和高质量的训练数据。然而，在内存和成像能力有限的便携式设备上部署最新的深度模型仍然是一个挑战。现有的一些工作已经开始通过知识提取来压缩模型。不幸的是，这些方法无法处理图像质量降低的图像，例如低分辨率（LR）图像。为此，我们做出了开创性的努力，从从高分辨率（HR）图像中学习到的重网络模型中提取有用的知识，到将处理LR图像的紧凑网络模型，从而通过新颖的像素提取推进当前的知识提取技术。为了实现这一目标，我们提出了一个教师助理学生（TAS）框架，该框架将知识提取分解为模型压缩阶段和高分辨率表示转换阶段。通过配备一个新的特征超分辨率（FSR）模块，我们的方法可以学习轻量级网络模型，该模型可以达到与重型教师模型相似的精度，但参数更少，推理速度更快，输入分辨率更低。在三个广泛使用的基准上的综合实验，即CUB-200-2011、PASCAL VOC 2007和ImageNetSub，证明了我们方法的有效性。摘要：The great success of deep learning is mainly due to the large-scale network architecture and the high-quality training data. However, it is still challenging to deploy recent deep models on portable devices with limited memory and imaging ability. Some existing works have engaged to compress the model via knowledge distillation. Unfortunately, these methods cannot deal with images with reduced image quality, such as the low-resolution (LR) images. To this end, we make a pioneering effort to distill helpful knowledge from a heavy network model learned from high-resolution (HR) images to a compact network model that will handle LR images, thus advancing the current knowledge distillation technique with the novel pixel distillation. To achieve this goal, we propose a Teacher-Assistant-Student (TAS) framework, which disentangles knowledge distillation into the model compression stage and the high resolution representation transfer stage. By equipping a novel Feature Super Resolution (FSR) module, our approach can learn lightweight network model that can achieve similar accuracy as the heavy teacher model but with much fewer parameters, faster inference speed, and lower-resolution inputs. Comprehensive experiments on three widely-used benchmarks, \ie, CUB-200-2011, PASCAL VOC 2007, and ImageNetSub, demonstrate the effectiveness of our approach.

【3】 Distillation of Human-Object Interaction Contexts for Action Recognition 标题：面向动作识别的人-物交互上下文提炼链接：https://arxiv.org/abs/2112.09448

作者：Muna Almushyti,Frederick W. Li 机构：Durham University, UK 摘要：时空关系建模对于识别人类行为至关重要，尤其是当人类与对象交互时，而多个对象在人类周围随时间不同而出现。大多数现有的动作识别模型侧重于学习场景的整体视觉线索，但忽略了信息性的细粒度特征，这些特征可以通过学习人与对象的关系和交互来捕获。在本文中，我们通过利用人与物之间的局部和全局上下文的交互来学习人与物之间的关系。因此，我们提出了全局-局部交互蒸馏网络（GLIDN），通过知识蒸馏通过空间和时间学习人与对象的交互，以实现细粒度场景理解。GLIDN将人和对象编码为图形节点，并通过图形注意网络学习局部和全局关系。局部上下文图通过捕获特定时间步长上的人和对象的共现关系，在帧级别上学习人和对象之间的关系。全局关系图是基于人与对象交互的视频级别构建的，用于识别整个视频序列中的长期关系。更重要的是，我们研究了如何从这些图形中提取知识，以改进人-物交互（HOI）识别。我们通过在两个数据集（包括Charades和CAD-120数据集）上进行综合实验来评估我们的模型。我们取得了比基线和对应方法更好的结果。摘要：Modeling spatial-temporal relations is imperative for recognizing human actions, especially when a human is interacting with objects, while multiple objects appear around the human differently over time. Most existing action recognition models focus on learning overall visual cues of a scene but disregard informative fine-grained features, which can be captured by learning human-object relationships and interactions. In this paper, we learn human-object relationships by exploiting the interaction of their local and global contexts. We hence propose the Global-Local Interaction Distillation Network (GLIDN), learning human and object interactions through space and time via knowledge distillation for fine-grained scene understanding. GLIDN encodes humans and objects into graph nodes and learns local and global relations via graph attention network. The local context graphs learn the relation between humans and objects at a frame level by capturing their co-occurrence at a specific time step. The global relation graph is constructed based on the video-level of human and object interactions, identifying their long-term relations throughout a video sequence. More importantly, we investigate how knowledge from these graphs can be distilled to their counterparts for improving human-object interaction (HOI) recognition. We evaluate our model by conducting comprehensive experiments on two datasets including Charades and CAD-120 datasets. We have achieved better results than the baselines and counterpart approaches.

【4】 Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation 标题：基于最优传输蒸馏的数据高效语言监督零点识别链接：https://arxiv.org/abs/2112.09445

作者：Bichen Wu,Ruizhe Cheng,Peizhao Zhang,Peter Vajda,Joseph E. Gonzalez 机构：Meta Reality Labs,UC Berkeley 备注：19 pages, 6 figures 摘要：传统的计算机视觉模型训练预测一组固定的预定义类别。最近，自然语言被证明是一种更广泛、更丰富的监督来源，它比受监督的“黄金”标签提供了更精细的视觉概念描述。以前的工作，如CLIP，使用InfoNCE-loss来训练一个模型来预测图像和文本标题之间的配对。然而，该剪辑需要大量数据，需要超过400米的图像-文本对进行训练。效率低下的部分原因是图像-文本对有噪声。为了解决这个问题，我们提出了OTTER（高效Zero-Shot识别的最优传输蒸馏），它使用在线熵最优传输来寻找软图像文本匹配作为对比学习的标签。基于预训练的图像和文本编码器，使用OTTER训练的模型只需3M图像-文本对即可实现强大的性能。与信息丢失、标签平滑和知识提炼相比，OTTER在Google Open Images（19958类）和腾讯ML Images的多标签ImageNet 10K（10032类）的Zero-Shot评估中始终优于这些基线。在7种不同的数据集/体系结构设置x 6个指标上进行了42次评估，OTTER在其中34项中的表现优于（32）或与（2）所有基线保持一致。摘要：Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions. CLIP, however, is data hungry and requires more than 400M image-text pairs for training. The inefficiency can be partially attributed to the fact that the image-text pairs are noisy. To address this, we propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning. Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero shot evaluation on Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.

【5】 Self-attention based anchor proposal for skeleton-based action recognition 标题：基于自我注意的基于骨架的动作识别锚点方案链接：https://arxiv.org/abs/2112.09413

作者：Ruijie Hou,Zhao Wang 机构：Zhejiang University,Bournemouth University 摘要：骨架序列由于其轻量级和紧凑的特点，被广泛应用于动作识别任务中。最近的图卷积网络（GCN）方法由于其对非欧几里德数据的良好建模能力，在基于骨架的动作识别方面取得了巨大的成功。GCN能够利用短程关节相关性，而缺乏直接建模对区分各种动作至关重要的远距离关节关系。因此，许多GCN方法试图采用分层机制来聚合更大范围的邻域信息。我们提出了一种新的基于自我注意的骨架锚提议（SAP）模块，用于对人体内部关系进行综合建模，以进行运动特征学习。建议的SAP模块旨在通过编码高阶角度信息，而不是现有分层GCN方法中使用的固定成对骨骼连接，使用三元组表示来探索人体内的固有关系。在提出的SAP模块中，设计了一种基于自注意的锚点选择方法，用于提取编码角度信息的根点。通过将提出的SAP模块与流行的时空图神经网络（如MSG3D）耦合，它在具有挑战性的基准数据集上实现了新的最先进精度。进一步的研究表明，我们提出的SAP模块是有效的，它能够明显提高许多流行的基于骨架的动作识别方法的性能。摘要：Skeleton sequences are widely used for action recognition task due to its lightweight and compact characteristics. Recent graph convolutional network (GCN) approaches have achieved great success for skeleton-based action recognition since its grateful modeling ability of non-Euclidean data. GCN is able to utilize the short-range joint dependencies while lack to directly model the distant joints relations that are vital to distinguishing various actions. Thus, many GCN approaches try to employ hierarchical mechanism to aggregate wider-range neighborhood information. We propose a novel self-attention based skeleton-anchor proposal (SAP) module to comprehensively model the internal relations of a human body for motion feature learning. The proposed SAP module aims to explore inherent relationship within human body using a triplet representation via encoding high order angle information rather than the fixed pair-wise bone connection used in the existing hierarchical GCN approaches. A Self-attention based anchor selection method is designed in the proposed SAP module for extracting the root point of encoding angular information. By coupling proposed SAP module with popular spatial-temporal graph neural networks, e.g. MSG3D, it achieves new state-of-the-art accuracy on challenging benchmark datasets. Further ablation study have shown the effectiveness of our proposed SAP module, which is able to obviously improve the performance of many popular skeleton-based action recognition methods.

【6】 Unified 2D and 3D Pre-training for Medical Image classification and Segmentation 标题：统一的二维和三维医学图像分类分割预训练链接：https://arxiv.org/abs/2112.09356

作者：Yutong Xie,Jianpeng Zhang,Yong Xia,Qi Wu 机构： Australia Institute for Machine Learning, The University of Adelaide, Australia, School of Computer Science and Engineering, Northwestern Polytechnical University, China 备注：13 pages 摘要：自我监督学习（SSL）为更好地利用未标记数据提供了巨大的机会。这是必不可少的医学图像分析，这是众所周知的缺乏注释。但是，当我们尝试在SSL中使用尽可能多的未标记医学图像时，必须打破维度障碍（\ie，使联合使用2D和3D图像成为可能）。在本文中，我们提出了一个基于学生-教师范式的通用自监督转换器（USST）框架，旨在利用大量多维未标记医学数据来学习丰富的表示。为了实现这一点，我们设计了一个金字塔TransformerU-Net（PTU）作为主干，它由可切换补丁嵌入（SPE）层和Transformer层组成。SPE层根据输入维度切换到2D或3D面片嵌入。之后，图像将转换为序列，而不考虑其原始尺寸。然后，Transformer层以序列对序列的方式对长期依赖关系进行建模，从而使USST能够从2D和3D图像中学习表示。与当前特定于维度的SSL相比，USST有两个明显的优点：（1）\textbf{more effective}-可以从更多不同的数据中学习表示法；和（2）\textbf{更通用}-可以转移到各种下游任务。结果表明，USST在六个2D/3D医学图像分类和分割任务上取得了令人满意的结果，大大优于监督的ImageNet预训练和高级SSL。摘要：Self-supervised learning (SSL) opens up huge opportunities for better utilizing unlabeled data. It is essential for medical image analysis that is generally known for its lack of annotations. However, when we attempt to use as many as possible unlabeled medical images in SSL, breaking the dimension barrier (\ie, making it possible to jointly use both 2D and 3D images) becomes a must. In this paper, we propose a Universal Self-Supervised Transformer (USST) framework based on the student-teacher paradigm, aiming to leverage a huge of unlabeled medical data with multiple dimensions to learn rich representations. To achieve this, we design a Pyramid Transformer U-Net (PTU) as the backbone, which is composed of switchable patch embedding (SPE) layers and Transformer layers. The SPE layer switches to either 2D or 3D patch embedding depending on the input dimension. After that, the images are converted to a sequence regardless of their original dimensions. The Transformer layer then models the long-term dependencies in a sequence-to-sequence manner, thus enabling USST to learn representations from both 2D and 3D images. USST has two obvious merits compared to current dimension-specific SSL: (1) \textbf{more effective} - can learn representations from more and diverse data; and (2) \textbf{more versatile} - can be transferred to various downstream tasks. The results show that USST provides promising results on six 2D/3D medical image classification and segmentation tasks, outperforming the supervised ImageNet pre-training and advanced SSL counterparts substantially.

【7】 An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification 标题：一种用于拥挤场景分类的视听数据集和深度学习框架链接：https://arxiv.org/abs/2112.09172

作者：Lam Pham,Dat Ngo,Phu X. Nguyen,Truong Hoang,Alexander Schindler 机构： Austria Instituteof Technology, Ngo is with School of Computer Science and Electronic Engineering, University of Essex, Nguyen is with Department of Computer Fundamentals, FPTUniversity 摘要：本文提出了一个视听场景分类（SC）的任务，其中输入视频被分类为五个真实拥挤场景之一：“暴乱”、“噪音街道”、“烟火事件”、“音乐事件”和“运动氛围”。为此，我们首先从Youtube（在野外场景中）收集这五个拥挤环境的视听数据集（视频）。然后，提出了广泛的深度学习框架来独立部署音频或视频输入数据。最后，将从高性能深度学习框架中获得的结果进行融合，以获得最佳准确度分数。我们的实验结果表明，音频和视频输入因素独立地影响SC任务的性能。值得注意的是，探索音频或视频输入数据的深度学习框架可以达到95.7%的最佳准确率。摘要：This paper presents a task of audio-visual scene classification (SC) where input videos are classified into one of five real-life crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere'. To this end, we firstly collect an audio-visual dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a wide range of deep learning frameworks are proposed to deploy either audio or visual input data independently. Finally, results obtained from high-performed deep learning frameworks are fused to achieve the best accuracy score. Our experimental results indicate that audio and visual input factors independently contribute to the SC task's performance. Significantly, an ensemble of deep learning frameworks exploring either audio or visual input data can achieve the best accuracy of 95.7%.

分割|语义相关(6篇)

【1】 Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation 标题：基于伪标签自训练的局部对比损失半监督医学图像分割链接：https://arxiv.org/abs/2112.09645

作者：Krishna Chaitanya,Ertunc Erdil,Neerav Karani,Ender Konukoglu 备注：13 pages, 4 figures, 7 tables. This article is under review at a Journal 摘要：基于监督深度学习的方法可以得到准确的医学图像分割结果。然而，这需要大量的标记数据集，获取它们是一项艰巨的任务，需要临床专业知识。基于半监督/自监督学习的方法通过利用未标记数据和有限的注释数据来解决这一限制。最近的自监督学习方法使用对比损失从未标记图像中学习良好的全局级表示，并在流行的自然图像数据集（如ImageNet）上实现高性能的分类任务。在像素级预测任务（如分割）中，还必须学习良好的局部级表示和全局表示，以获得更好的精度。然而，现有的基于局部对比损失的方法对于学习良好的局部表示的影响仍然有限，因为相似和不同的局部区域是基于随机增强和空间邻近性定义的；由于在半监督/自监督环境中缺乏大规模专家注释，因此不基于局部区域的语义标签。在本文中，我们提出了一种局部对比损失法，通过利用从未标记图像的伪标签和有限的注释图像中获得的语义标签信息来学习用于分割的良好像素级特征。特别是，我们定义了建议的损失，以鼓励对具有相同伪标签/标签的像素进行类似表示，同时与数据集中具有不同伪标签/标签的像素的表示不同。我们执行基于伪标签的自训练，并通过联合优化建议的标记集和未标记集上的对比损失和仅有限标记集上的分割损失来训练网络。我们在三个公共心脏和前列腺数据集上进行了评估，获得了较高的分割性能。摘要：Supervised deep learning-based methods yield accurate results for medical image segmentation. However, they require large labeled datasets for this, and obtaining them is a laborious task that requires clinical expertise. Semi/self-supervised learning-based approaches address this limitation by exploiting unlabeled data along with limited annotated data. Recent self-supervised learning methods use contrastive loss to learn good global level representations from unlabeled images and achieve high performance in classification tasks on popular natural image datasets like ImageNet. In pixel-level prediction tasks such as segmentation, it is crucial to also learn good local level representations along with global representations to achieve better accuracy. However, the impact of the existing local contrastive loss-based methods remains limited for learning good local representations because similar and dissimilar local regions are defined based on random augmentations and spatial proximity; not based on the semantic label of local regions due to lack of large-scale expert annotations in the semi/self-supervised setting. In this paper, we propose a local contrastive loss to learn good pixel level features useful for segmentation by exploiting semantic label information obtained from pseudo-labels of unlabeled images alongside limited annotated images. In particular, we define the proposed loss to encourage similar representations for the pixels that have the same pseudo-label/ label while being dissimilar to the representation of pixels with different pseudo-label/label in the dataset. We perform pseudo-label based self-training and train the network by jointly optimizing the proposed contrastive loss on both labeled and unlabeled sets and segmentation loss on only the limited labeled set. We evaluated on three public cardiac and prostate datasets, and obtain high segmentation performance.

【2】 Methods for segmenting cracks in 3d images of concrete: A comparison based on semi-synthetic images 标题：基于半合成图像的混凝土三维图像裂缝分割方法比较链接：https://arxiv.org/abs/2112.09493

作者：Tin Barisin,Christian Jung,Franziska Müsebeck,Claudia Redenbach,Katja Schladitz 机构：Franziska M¨usebeckb, Fraunhofer Institut f¨ur Techno- und Wirtschaftsmathematik, Fraunhofer-Platz , Kaiserslautern, Germany, Technische Universit¨at Kaiserslautern, Gottlieb-Daimler-Straße , Kaiserslautern, Germany 摘要：混凝土是建筑物、桥梁和道路的标准建筑材料。由于安全在此类结构的设计、监测和维护中起着核心作用，因此了解混凝土的开裂行为非常重要。计算机断层扫描捕捉建筑材料的微观结构，并允许研究裂纹萌生和扩展。在大型3d图像中手动分割裂纹表面是不可行的。本文对三维图像中的裂纹自动分割方法进行了综述和比较。经典的图像处理方法（边缘检测滤波器、模板匹配、最小路径和区域生长算法）和学习方法（卷积神经网络、随机森林）被考虑并在半合成3d图像上进行了测试。它们的性能很大程度上取决于参数选择，参数选择应适应图像的灰度分布和混凝土的几何特性。一般来说，学习方法表现最好，尤其是对于薄裂纹和低灰度值对比度。摘要：Concrete is the standard construction material for buildings, bridges, and roads. As safety plays a central role in the design, monitoring, and maintenance of such constructions, it is important to understand the cracking behavior of concrete. Computed tomography captures the microstructure of building materials and allows to study crack initiation and propagation. Manual segmentation of crack surfaces in large 3d images is not feasible. In this paper, automatic crack segmentation methods for 3d images are reviewed and compared. Classical image processing methods (edge detection filters, template matching, minimal path and region growing algorithms) and learning methods (convolutional neural networks, random forests) are considered and tested on semi-synthetic 3d images. Their performance strongly depends on parameter selection which should be adapted to the grayvalue distribution of the images and the geometric properties of the concrete. In general, the learning methods perform best, in particular for thin cracks and low grayvalue contrast.

【3】 Weakly Supervised Semantic Segmentation via Alternative Self-Dual Teaching 标题：基于交替自对偶教学的弱监督语义切分链接：https://arxiv.org/abs/2112.09459

作者：Dingwen Zhang,Wenyuan Zeng,Guangyu Guo,Chaowei Fang,Lechao Cheng,Junwei Han 机构：The Brain and Artificial Intelligence Laboratory, Northwestern Polytechnical University, Xi’an, China, Xidian University, Xi’an, China, Zhejiang Lab, Hangzhou, China 摘要：现有的弱监督语义分割（WSSS）框架通常包含分离的掩码细化模型和主要的语义区域挖掘模型。这些方法将包含冗余的特征提取主干和有偏见的学习目标，使得它们计算复杂，但对于解决WSSS任务来说不是最优的。为了解决这个问题，本文建立了一个紧凑的学习框架，将分类和掩码细化组件嵌入到一个统一的深度模型中。通过共享特征提取主干，我们的模型能够促进两个组件之间的知识共享，同时保持较低的计算复杂度。为了鼓励高质量的知识交互，我们提出了一种新的替代自我双重教学（ASDT）机制。与传统的提取策略不同，我们模型中的两个教师分支的知识通过脉宽调制（PWM）交替提取到学生分支，产生PW波形选择信号来指导知识提取过程。通过这种方式，学生分支可以帮助防止由于教师分支提供的知识不完善而导致模型陷入局部最小解。在PASCAL VOC 2012和COCO Stuff 10K上进行的综合实验证明了所提出的替代性自我双重教学机制的有效性，以及我们方法的最新性能。摘要：Current weakly supervised semantic segmentation (WSSS) frameworks usually contain the separated mask-refinement model and the main semantic region mining model. These approaches would contain redundant feature extraction backbones and biased learning objectives, making them computational complex yet sub-optimal to addressing the WSSS task. To solve this problem, this paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model. With the shared feature extraction backbone, our model is able to facilitate knowledge sharing between the two components while preserving a low computational complexity. To encourage high-quality knowledge interaction, we propose a novel alternative self-dual teaching (ASDT) mechanism. Unlike the conventional distillation strategy, the knowledge of the two teacher branches in our model is alternatively distilled to the student branch by a Pulse Width Modulation (PWM), which generates PW wave-like selection signal to guide the knowledge distillation process. In this way, the student branch can help prevent the model from falling into local minimum solutions caused by the imperfect knowledge provided of either teacher branch. Comprehensive experiments on the PASCAL VOC 2012 and COCO-Stuff 10K demonstrate the effectiveness of the proposed alternative self-dual teaching mechanism as well as the new state-of-the-art performance of our approach.

【4】 Semantic-Based Few-Shot Learning by Interactive Psychometric Testing 标题：基于语义的交互式心理测量学习链接：https://arxiv.org/abs/2112.09201

作者：Lu Yin,Vlado Menkovski,Yulong Pei,Mykola Pechenizkiy 机构：Eindhoven University of Technology, Eindhoven , MB, Netherlands 备注：Accepted by AAAI 2022 Workshop on Interactive Machine Learning (IML@AAAI22) 摘要：很少有镜头分类任务的目标是仅基于支持集中的几个标记示例对查询集中的图像进行分类。大多数研究通常假设任务中的每个图像都有一个唯一的类关联。在这些假设下，当支持类和查询类之间没有精确匹配时，这些算法可能无法识别正确的类分配。例如，给出几张狮子、自行车和苹果的图片来对老虎进行分类。然而，在一个更一般的设置，我们可以考虑更高层次的概念，大型食肉动物匹配老虎到狮子的语义分类。现有研究很少考虑这种情况，因为基于标签的监督与复杂的概念关系不兼容。在这项工作中，我们将少数镜头学习推向了更具挑战性的场景，即基于语义的少数镜头学习，并提出了一种通过交互心理测量学学习捕捉内部语义关系来解决这一范式的方法。我们在CIFAR-100数据集上评估了我们的方法。结果表明了我们提出的方法的优点。摘要：Few-shot classification tasks aim to classify images in query sets based on only a few labeled examples in support sets. Most studies usually assume that each image in a task has a single and unique class association. Under these assumptions, these algorithms may not be able to identify the proper class assignment when there is no exact matching between support and query classes. For example, given a few images of lions, bikes, and apples to classify a tiger. However, in a more general setting, we could consider the higher-level concept of large carnivores to match the tiger to the lion for semantic classification. Existing studies rarely considered this situation due to the incompatibility of label-based supervision with complex conception relationships. In this work, we advanced the few-shot learning towards this more challenging scenario, the semantic-based few-shot learning, and proposed a method to address the paradigm by capturing the inner semantic relationships using interactive psychometric learning. We evaluate our method on the CIFAR-100 dataset. The results show the merits of our proposed method.

【5】 FastSurferVINN: Building Resolution-Independence into Deep Learning Segmentation Methods -- A Solution for HighRes Brain MRI 标题：FastSurferVINN：将分辨率独立性构建到深度学习分割方法中--HighRes脑MRI的一种解决方案链接：https://arxiv.org/abs/2112.09654

作者：Leonie Henschel,David Kügler,Martin Reuter 机构：German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany, A.A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Boston MA, USA, Department of Radiology, Harvard Medical School, Boston MA,USA 备注：under review at NeuroImage 摘要：领先的神经成像研究已将3T MRI采集分辨率推至1.0 mm以下，以改进结构定义和形态计量学。然而，只有很少几个时间密集型的自动化图像分析管道经过了高分辨率（HiRes）设置的验证。另一方面，高效的深度学习方法很少支持一个以上的固定分辨率（通常为1.0 mm）。此外，缺乏标准的亚毫米分辨率，以及对扫描仪、年龄、疾病或遗传变异的充分覆盖范围内的各种雇佣数据的可用性有限，为训练雇佣网络带来了额外的、未解决的挑战。将分辨率独立性纳入基于深度学习的分割，即能够在一系列不同的体素大小上以其固有分辨率分割图像，有望克服这些挑战，但目前还没有这种方法。我们现在通过引入用于分辨率无关分割任务的与体素大小无关的神经网络（VINN）来填补这一空白，并提出FASTSURFERNN，它（i）建立并实现分辨率无关的深度学习，作为第一种同时支持0.7-1.0 mm全脑分割的方法，（ii）在分辨率方面显著优于最先进的方法，并且（iii）缓解了HiRes数据集中存在的数据不平衡问题。总的来说，内部分辨率独立性对HiRes和1.0 mm MRI分割都有利。通过我们经过严格验证的FastSurferann，我们发布了一个用于形态计量神经图像分析的快速工具。此外，VINN体系结构代表了一种高效的分辨率无关的分割方法，具有更广泛的应用前景摘要：Leading neuroimaging studies have pushed 3T MRI acquisition resolutions below 1.0 mm for improved structure definition and morphometry. Yet, only few, time-intensive automated image analysis pipelines have been validated for high-resolution (HiRes) settings. Efficient deep learning approaches, on the other hand, rarely support more than one fixed resolution (usually 1.0 mm). Furthermore, the lack of a standard submillimeter resolution as well as limited availability of diverse HiRes data with sufficient coverage of scanner, age, diseases, or genetic variance poses additional, unsolved challenges for training HiRes networks. Incorporating resolution-independence into deep learning-based segmentation, i.e., the ability to segment images at their native resolution across a range of different voxel sizes, promises to overcome these challenges, yet no such approach currently exists. We now fill this gap by introducing a Voxelsize Independent Neural Network (VINN) for resolution-independent segmentation tasks and present FastSurferVINN, which (i) establishes and implements resolution-independence for deep learning as the first method simultaneously supporting 0.7-1.0 mm whole brain segmentation, (ii) significantly outperforms state-of-the-art methods across resolutions, and (iii) mitigates the data imbalance problem present in HiRes datasets. Overall, internal resolution-independence mutually benefits both HiRes and 1.0 mm MRI segmentation. With our rigorously validated FastSurferVINN we distribute a rapid tool for morphometric neuroimage analysis. The VINN architecture, furthermore, represents an efficient resolution-independent segmentation method for wider application

【6】 ASC-Net: Unsupervised Medical Anomaly Segmentation Using an Adversarial-based Selective Cutting Network 标题：ASC-Net：基于对抗性选择性切割网络的无监督医学异常分割链接：https://arxiv.org/abs/2112.09135

作者：Raunak Dey,Wenbo Sun,Haibo Xu,Yi Hong 机构： Department of Computer Science, University of Georgia, Department of Radiology, Zhongnan Hospital of Wuhan University, Department of Computer Science and Engineering, Shanghai Jiao Tong University 备注：Currently in Submission to Medical Image Analysis Journal. Extension of DOI - 10.1007/978-3-030-87240-3_23 with more details and experiments and indepth analysis. arXiv admin note: substantial text overlap with arXiv:2103.03664 摘要：在本文中，我们考虑的问题，在医疗图像中的无监督异常分割，这已引起越来越多的关注，近年来，由于昂贵的像素级注释专家和大量未注释的正常和异常图像扫描的存在。我们介绍了一个分割网络，该网络利用对抗式学习将图像分割为两个切块，其中一个切块落入用户提供的参考分布中。这种基于对抗的选择性切割网络（ASC-Net）连接了基于集群的深度分割和基于对抗的异常/新颖性检测算法这两个领域。我们的ASC网络从正常和异常的医学扫描中学习，在没有任何面罩监督的情况下分割医学扫描中的异常。我们在三个公共数据集（即用于脑肿瘤分割的BraTS 2019、用于肝损伤分割的LiTS和用于脑损伤分割的MS-SEG 2015）以及用于脑肿瘤分割的私有数据集上评估了该无监督异常分割模型。与现有的方法相比，我们的模型在无监督的异常分割任务中表现出了巨大的性能提升。尽管与监督学习算法相比，仍有进一步提高性能的空间，但有希望的实验结果和有趣的观察结果为利用用户定义的知识构建用于医学异常识别的无监督学习算法提供了依据。摘要：In this paper we consider the problem of unsupervised anomaly segmentation in medical images, which has attracted increasing attention in recent years due to the expensive pixel-level annotations from experts and the existence of a large amount of unannotated normal and abnormal image scans. We introduce a segmentation network that utilizes adversarial learning to partition an image into two cuts, with one of them falling into a reference distribution provided by the user. This Adversarial-based Selective Cutting network (ASC-Net) bridges the two domains of cluster-based deep segmentation and adversarial-based anomaly/novelty detection algorithms. Our ASC-Net learns from normal and abnormal medical scans to segment anomalies in medical scans without any masks for supervision. We evaluate this unsupervised anomly segmentation model on three public datasets, i.e., BraTS 2019 for brain tumor segmentation, LiTS for liver lesion segmentation, and MS-SEG 2015 for brain lesion segmentation, and also on a private dataset for brain tumor segmentation. Compared to existing methods, our model demonstrates tremendous performance gains in unsupervised anomaly segmentation tasks. Although there is still room to further improve performance compared to supervised learning algorithms, the promising experimental results and interesting observations shed light on building an unsupervised learning algorithm for medical anomaly identification using user-defined knowledge.

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 Domain Adaptation on Point Clouds via Geometry-Aware Implicits 标题：基于几何感知的点云区域自适应链接：https://arxiv.org/abs/2112.09343

作者：Yuefan Shen,Yanchao Yang,Mi Yan,He Wang,Youyi Zheng,Leonidas Guibas 机构： Zhejiang University , Stanford University, Peking University 摘要：点云作为一种流行的几何表示形式，在三维视觉中受到了广泛的关注，在自动驾驶和机器人技术中有着广泛的应用。在点云上学习的一个重要但尚未解决的问题是，如果使用不同的程序生成或使用不同的传感器捕获，同一对象的点云可能具有显著的几何变化。这些不一致性会导致域间隙，因此在一个域上训练的神经网络可能无法推广到其他域。减少领域差距的典型技术是执行对抗性训练，以便特征空间中的点云可以对齐。然而，对抗性训练容易陷入退化的局部极小值，导致负的适应增益。在这里，我们提出了一种简单而有效的点云无监督域自适应方法，该方法利用一个学习几何感知隐式的自监督任务，一次完成两个关键任务。首先，通过下游任务的隐式表示，保留点云中的几何信息。更重要的是，可以在隐式空间中有效地学习特定领域的变化。由于实际中缺乏形状模型，我们还提出了一种自适应策略来计算任意点云的无符号距离场。当与任务丢失相结合时，所提出的方法优于最先进的无监督域自适应方法，该方法依赖于对抗性域对齐和更复杂的自监督任务。我们的方法在PointDA-10和GraspNet数据集上进行了评估。代码和经过训练的模型将公开提供。摘要：As a popular geometric representation, point clouds have attracted much attention in 3D vision, leading to many applications in autonomous driving and robotics. One important yet unsolved issue for learning on point cloud is that point clouds of the same object can have significant geometric variations if generated using different procedures or captured using different sensors. These inconsistencies induce domain gaps such that neural networks trained on one domain may fail to generalize on others. A typical technique to reduce the domain gap is to perform adversarial training so that point clouds in the feature space can align. However, adversarial training is easy to fall into degenerated local minima, resulting in negative adaptation gains. Here we propose a simple yet effective method for unsupervised domain adaptation on point clouds by employing a self-supervised task of learning geometry-aware implicits, which plays two critical roles in one shot. First, the geometric information in the point clouds is preserved through the implicit representations for downstream tasks. More importantly, the domain-specific variations can be effectively learned away in the implicit space. We also propose an adaptive strategy to compute unsigned distance fields for arbitrary point clouds due to the lack of shape models in practice. When combined with a task loss, the proposed outperforms state-of-the-art unsupervised domain adaptation methods that rely on adversarial domain alignment and more complicated self-supervised tasks. Our method is evaluated on both PointDA-10 and GraspNet datasets. The code and trained models will be publicly available.

【2】 Sim2Real Docs: Domain Randomization for Documents in Natural Scenes using Ray-traced Rendering 标题：Sim2Real Docs：基于光线跟踪渲染的自然场景中文档的域随机化链接：https://arxiv.org/abs/2112.09220

作者：Nikhil Maddikunta,Huijun Zhao,Sumit Keswani,Alfy Samuel,Fu-Ming Guo,Nishan Srishankar,Vishwa Pardeshi,Austin Huang 机构：Fidelity Investments, Artificial Intelligence Center of Excellence 备注：Accepted to Neurips 2021 Data Centric AI (DCAI) Workshop 摘要：过去，数字化文档的计算机视觉系统可以依赖于系统捕获的高质量扫描。如今，涉及数字文档的交易更可能从非专业人士上传手机照片开始。因此，用于文档自动化的计算机视觉现在必须考虑在自然场景上下文中捕获的文档。另一个挑战是，文档处理的任务目标可能是高度特定于用例的，这使得公共可用数据集的实用性受到限制，而手动数据标记的成本也很高，并且在用例之间的转换很差。为了解决这些问题，我们创建了Sim2Real Docs—一个用于合成数据集和在自然场景中执行文档领域随机化的框架。Sim2Real Docs支持使用Blender（一种用于三维建模和光线跟踪渲染的开源工具）对文档进行编程式三维渲染。通过使用模拟灯光、几何体、相机和背景的物理交互的渲染，我们在自然场景上下文中合成文档数据集。每个渲染都与特定于用例的地面真实数据配对，指定感兴趣的潜在特征，生成无限适合任务训练数据。然后，机器学习模型的作用是解决由渲染管道引起的反问题。通过微调或调整领域随机化参数，此类模型可以进一步使用真实数据进行迭代。摘要：In the past, computer vision systems for digitized documents could rely on systematically captured, high-quality scans. Today, transactions involving digital documents are more likely to start as mobile phone photo uploads taken by non-professionals. As such, computer vision for document automation must now account for documents captured in natural scene contexts. An additional challenge is that task objectives for document processing can be highly use-case specific, which makes publicly-available datasets limited in their utility, while manual data labeling is also costly and poorly translates between use cases. To address these issues we created Sim2Real Docs - a framework for synthesizing datasets and performing domain randomization of documents in natural scenes. Sim2Real Docs enables programmatic 3D rendering of documents using Blender, an open source tool for 3D modeling and ray-traced rendering. By using rendering that simulates physical interactions of light, geometry, camera, and background, we synthesize datasets of documents in a natural scene context. Each render is paired with use-case specific ground truth data specifying latent characteristics of interest, producing unlimited fit-for-task training data. The role of machine learning models is then to solve the inverse problem posed by the rendering pipeline. Such models can be further iterated upon with real-world data by either fine tuning or making adjustments to domain randomization parameters.

半弱无监督|主动学习|不确定性(1篇)

【1】 Watermarking Images in Self-Supervised Latent Spaces 标题：自监督隐含空间中的图像水印链接：https://arxiv.org/abs/2112.09581

作者：Pierre Fernandez,Alexandre Sablayrolles,Teddy Furon,Hervé Jégou,Matthijs Douze 机构：Facebook AI,Inria Rennes 摘要：我们根据自监督方法，重新讨论了基于预训练深度网络的水印技术。我们提出了一种将标记和二进制消息嵌入其潜在空间的方法，利用标记时的数据扩充。我们的方法可以在任何分辨率下运行，并创建对广泛变换（旋转、裁剪、JPEG、对比度等）鲁棒的水印。它显著优于以前的零位方法，并且它在多比特水印上的性能与最先进的编码端到端的水印解码器结构相吻合。我们的实现和模型将公开。摘要：We revisit watermarking techniques based on pre-trained deep networks, in the light of self-supervised approaches. We present a way to embed both marks and binary messages into their latent spaces, leveraging data augmentation at marking time. Our method can operate at any resolution and creates watermarks robust to a broad range of transformations (rotations, crops, JPEG, contrast, etc). It significantly outperforms the previous zero-bit methods, and its performance on multi-bit watermarking is on par with state-of-the-art encoder-decoder architectures trained end-to-end for watermarking. Our implementation and models will be made publicly available.

时序|行为识别|姿态|视频|运动估计(6篇)

【1】 Deep Learning for Spatiotemporal Modeling of Urbanization 标题：深度学习在城市化时空建模中的应用链接：https://arxiv.org/abs/2112.09668

作者：Tang Li,Jing Gao,Xi Peng 机构：University of Delaware, Newark, DE 备注：Accepted by NeurIPS 2021 MLPH (Machine Learning in Public Health) Workshop; Best Paper Awarded by NeurIPS 2021 MLPH (Machine Learning in Public Health) Workshop 摘要：城市化对全世界人口的健康和福祉产生了巨大影响。因此，城市化的预测性空间模型可以成为有效公共卫生规划的有用工具。许多空间城市化模型已经使用经典的机器学习和数值建模技术开发出来。然而，深度学习及其捕获复杂时空现象的能力尚未应用于城市化建模。在这里，我们探讨了城市化预测模型的深层空间学习能力。我们将数字地理空间数据视为具有像素和通道的图像，并通过扩充来丰富数据集，以利用深度学习的高容量。我们得到的模型可以生成端到端的多变量城市化预测，并在初步比较中优于最先进的经典机器学习城市化模型。摘要：Urbanization has a strong impact on the health and wellbeing of populations across the world. Predictive spatial modeling of urbanization therefore can be a useful tool for effective public health planning. Many spatial urbanization models have been developed using classic machine learning and numerical modeling techniques. However, deep learning with its proven capacity to capture complex spatiotemporal phenomena has not been applied to urbanization modeling. Here we explore the capacity of deep spatial learning for the predictive modeling of urbanization. We treat numerical geospatial data as images with pixels and channels, and enrich the dataset by augmentation, in order to leverage the high capacity of deep learning. Our resulting model can generate end-to-end multi-variable urbanization predictions, and outperforms a state-of-the-art classic machine learning urbanization model in preliminary comparisons.

【2】 Video-Based Reconstruction of the Trajectories Performed by Skiers 标题：基于视频的滑雪者运动轨迹重建链接：https://arxiv.org/abs/2112.09647

作者：Matteo Dunnhofer,Alberto Zurini,Maurizio Dunnhofer,Christian Micheloni 机构：•Machine Learning and Perception Lab, University of Udine, Udine, Italy, ⋆EYOF , Organizing Committee, Amaro, Italy 摘要：轨迹是不同滑雪学科的基础。能够分析此类曲线的工具可以加强训练活动并丰富广播内容。然而，目前可用的解决方案基于地理定位传感器和表面模型。在这篇短文中，我们提出了一种基于视频的方法来重建运动员在比赛中通过的点序列。我们的原型由一系列基于深度学习的算法组成，这些算法用于重建运动员的运动，并根据摄像机的视角将其可视化。这是在野外不同滑雪项目中实现的，无需任何摄像机校准。我们在广播和智能手机拍摄的高山滑雪和跳台滑雪专业比赛视频上测试了我们的解决方案。取得的定性结果显示了我们解决方案的潜力。摘要：Trajectories are fundamental in different skiing disciplines. Tools enabling the analysis of such curves can enhance the training activity and enrich the broadcasting contents. However, the solutions currently available are based on geo-localized sensors and surface models. In this short paper, we propose a video-based approach to reconstruct the sequence of points traversed by an athlete during its performance. Our prototype is constituted by a pipeline of deep learning-based algorithms to reconstruct the athlete's motion and to visualize it according to the camera perspective. This is achieved for different skiing disciplines in the wild without any camera calibration. We tested our solution on broadcast and smartphone-captured videos of alpine skiing and ski jumping professional competitions. The qualitative results achieved show the potential of our solution.

【3】 Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scans 标题：三维扫描中基于深度学习的6DBIN位姿估计链接：https://arxiv.org/abs/2112.09598

作者：Lukáš Gajdošech,Viktor Kocur,Martin Stuchlík,Lukáš Hudec,Martin Madaras 机构：Skeletex Research, Slovakia, Physics and Informatics, Comenius University Bratislava, Slovakia, Slovak Technical University Bratislava, Slovakia, Brno University of Technology, Czech Republic 备注：Accepted VISAPP 2022 摘要：自动化机器人系统需要尽可能坚固，并且通常具有故障安全性，同时具有相对较高的精度和可重复性。尽管基于深度学习的方法正在成为如何处理3D扫描和图像处理任务的研究标准，但处理这些数据的行业标准仍然以分析为基础。我们的论文声称，分析方法的鲁棒性较差，更难测试、更新和维护。本文主要研究三维扫描中立体仓库的6D位姿估计问题。因此，我们提出了一个高质量的数据集，它由合成数据和结构光扫描仪捕获的真实扫描组成，并带有精确的注释。此外，我们提出了两种不同的6D面元姿态估计方法，一种是作为工业标准的分析方法，另一种是基线数据驱动方法。两种方法都经过交叉评估，我们的实验表明，使用合成数据增强真实扫描的训练可以改进我们提出的数据驱动神经模型。本立场文件是初步的，因为所提出的方法是在一个相对较小的初始数据集上进行训练和评估的，我们计划在未来进行扩展。摘要：An automated robotic system needs to be as robust as possible and fail-safe in general while having relatively high precision and repeatability. Although deep learning-based methods are becoming research standard on how to approach 3D scan and image processing tasks, the industry standard for processing this data is still analytically-based. Our paper claims that analytical methods are less robust and harder for testing, updating, and maintaining. This paper focuses on a specific task of 6D pose estimation of a bin in 3D scans. Therefore, we present a high-quality dataset composed of synthetic data and real scans captured by a structured-light scanner with precise annotations. Additionally, we propose two different methods for 6D bin pose estimation, an analytical method as the industrial standard and a baseline data-driven method. Both approaches are cross-evaluated, and our experiments show that augmenting the training on real scans with synthetic data improves our proposed data-driven neural model. This position paper is preliminary, as proposed methods are trained and evaluated on a relatively small initial dataset which we plan to extend in the future.

【4】 Align and Prompt: Video-and-Language Pre-training with Entity Prompts 标题：调整和提示：使用实体提示进行视频和语言预训练链接：https://arxiv.org/abs/2112.09583

作者：Dongxu Li,Junnan Li,Hongdong Li,Juan Carlos Niebles,Steven C. H. Hoi 机构：Salesforce Research,The Australian National University 摘要：视频和语言预训练显示，在各种下游任务方面有很大的改进。以前的大多数方法都是通过基于Transformer的多模式编码器捕获跨模式交互，而不是完全解决单峰视频和文本特征之间的错位问题。此外，学习细粒度视觉语言对齐通常需要现成的对象检测器来提供对象信息，这是由于检测器的词汇量有限和昂贵的计算成本造成的。我们提出对齐和提示：一个高效的视频和语言预训练框架，具有更好的跨模态对齐。首先，我们引入了一种视频文本对比（VTC）方法，在实例级对齐单峰视频文本特征，简化了跨模式交互的建模。然后，我们提出了一个新的基于视觉的预训练任务，即提示实体建模（PEM），该任务旨在学习细粒度区域实体对齐。为了实现这一点，我们首先引入一个实体提示器模块，该模块通过VTC进行训练，以产生视频剪辑和用实体名称实例化的文本提示之间的相似性。PEM任务然后要求模型预测随机选择的视频作物的实体伪标签（即归一化相似性分数）。由此产生的预训练模型在文本视频检索和视频质量保证（videoQA）方面都达到了最先进的性能，大大超过了先前的工作。我们的代码和预先训练的模型将发布。摘要：Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a new visually-grounded pre-training task, prompting entity modeling (PEM), which aims to learn fine-grained region-entity alignment. To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names. The PEM task then asks the model to predict the entity pseudo-labels (i.e~normalized similarity scores) for randomly-selected video crops. The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Our code and pre-trained models will be released.

【5】 Enhanced Frame and Event-Based Simulator and Event-Based Video Interpolation Network 标题：增强的基于帧和事件的模拟器和基于事件的视频插值网络链接：https://arxiv.org/abs/2112.09379

作者：Adam Radomski,Andreas Georgiou,Thomas Debrunner,Chenghan Li,Luca Longinotti,Minwon Seo,Moosung Kwak,Chang-Woo Shin,Paul K. J. Park,Hyunsurk Eric Ryu,Kynan Eng 机构：iniVation AG, Zurich, Switzerland, Samsung Electronics, -, Samsungjeonja-ro, Hwaseong-si, Gyeonggi-do ,-, South Korea, Seoul National University, Seoul, South Korea 备注：10 pages, 19 figures 摘要：快速的基于神经形态事件的视觉传感器（Dynamic vision Sensor，DVS）可以与较慢的传统基于帧的传感器相结合，以实现比传统方法更高质量的帧间插值，传统方法依赖于使用例如光流的固定运动近似。在这项工作中，我们提出了一个新的，先进的事件模拟器，可以产生真实的场景记录的相机装备与任意数量的传感器位于固定的偏移量。它包括一个新的可配置的基于帧的图像传感器模型，具有逼真的图像质量降低效果，以及一个具有更精确特性的扩展DVS模型。我们使用我们的模拟器来训练一种新的重建模型，该模型设计用于高fps视频的端到端重建。与以前发布的方法不同，我们的方法不要求帧和DVS相机具有相同的光学元件、位置或相机分辨率。它也不限于与传感器有固定距离的物体。我们表明，由我们的模拟器生成的数据可以用于训练我们的新模型，从而在公共数据集上生成与最新技术同等或更好质量的重建图像。我们还展示了传感器对真实传感器记录的数据的概括。摘要：Fast neuromorphic event-based vision sensors (Dynamic Vision Sensor, DVS) can be combined with slower conventional frame-based sensors to enable higher-quality inter-frame interpolation than traditional methods relying on fixed motion approximations using e.g. optical flow. In this work we present a new, advanced event simulator that can produce realistic scenes recorded by a camera rig with an arbitrary number of sensors located at fixed offsets. It includes a new configurable frame-based image sensor model with realistic image quality reduction effects, and an extended DVS model with more accurate characteristics. We use our simulator to train a novel reconstruction model designed for end-to-end reconstruction of high-fps video. Unlike previously published methods, our method does not require the frame and DVS cameras to have the same optics, positions, or camera resolutions. It is also not limited to objects a fixed distance from the sensor. We show that data generated by our simulator can be used to train our new model, leading to reconstructed images on public datasets of equivalent or better quality than the state of the art. We also show our sensor generalizing to data recorded by real sensors.

【6】 End-to-End Rate-Distortion Optimized Learned Hierarchical Bi-Directional Video Compression 标题：端到端率失真优化学习型分层双向视频压缩链接：https://arxiv.org/abs/2112.09529

作者：M. Akın Yılmaz,A. Murat Tekalp 机构：Authors are with the Department of Electrical and Electronics Engineering, Koc¸ University 备注：Accepted for publication in IEEE Transactions on Image Processing on 15 Dec. 2021 摘要：传统的视频压缩（VC）方法基于运动补偿变换编码，并且由于端到端优化问题的组合性质，运动估计、模式和量化参数选择以及熵编码的步骤被单独优化。学习的VC允许同时对非线性变换、运动和熵模型进行端到端率失真（R-D）优化训练。学习VC中的大多数作品考虑基于连续帧对的平均R-损耗的顺序视频编解码器的端到端优化。众所周知，在传统的VC中，分层双向编码优于顺序压缩，因为它能够同时使用过去和未来的参考帧。本文提出了一种学习型分层双向视频编解码器（LHBDC），它结合了分层运动补偿预测和端到端优化的优点。实验结果表明，我们在峰值信噪比（PSNR）和MS-SSIM中获得了迄今为止所报告的学习VC方案的最佳R-D结果。与传统视频编解码器相比，我们的端到端优化编解码器的R-D性能在PSNR和MS-SSIM中优于x265和SVT-HEVC编码器（“veryslow”预设），在MS-SSIM中优于HM 16.23参考软件。我们目前的烧蚀研究表明，由于提出了新的工具，如学习掩蔽、流场二次采样和时间流矢量预测，性能有所提高。复制我们结果的模型和说明可以在https://github.com/makinyilmaz/LHBDC/ 摘要：Conventional video compression (VC) methods are based on motion compensated transform coding, and the steps of motion estimation, mode and quantization parameter selection, and entropy coding are optimized individually due to the combinatorial nature of the end-to-end optimization problem. Learned VC allows end-to-end rate-distortion (R-D) optimized training of nonlinear transform, motion and entropy model simultaneously. Most works on learned VC consider end-to-end optimization of a sequential video codec based on R-D loss averaged over pairs of successive frames. It is well-known in conventional VC that hierarchical, bi-directional coding outperforms sequential compression because of its ability to use both past and future reference frames. This paper proposes a learned hierarchical bi-directional video codec (LHBDC) that combines the benefits of hierarchical motion-compensated prediction and end-to-end optimization. Experimental results show that we achieve the best R-D results that are reported for learned VC schemes to date in both PSNR and MS-SSIM. Compared to conventional video codecs, the R-D performance of our end-to-end optimized codec outperforms those of both x265 and SVT-HEVC encoders ("veryslow" preset) in PSNR and MS-SSIM as well as HM 16.23 reference software in MS-SSIM. We present ablation studies showing performance gains due to proposed novel tools such as learned masking, flow-field subsampling, and temporal flow vector prediction. The models and instructions to reproduce our results can be found in https://github.com/makinyilmaz/LHBDC/

医学相关(4篇)

【1】 CPPE-5: Medical Personal Protective Equipment Dataset 标题：CPPE-5：医用个人防护设备数据集链接：https://arxiv.org/abs/2112.09569

作者：Rishit Dagli,Ali Mustufa Shaikh 机构：High School, Narayana Junior College, Mumbai, India, Student Community Lead, Postman Inc 备注：16 pages, 6 tables, 6 figures. Code and models are available at this https URL 摘要：我们提出了一个新的具有挑战性的数据集CPPE-5（医疗个人防护装备），其目标是允许研究医疗个人防护装备的从属分类，这在其他关注广泛级别类别的流行数据集（如PASCAL VOC、ImageNet、Microsoft COCO、OpenImages等）中是不可能的。为了使在此数据集上训练的模型易于在复杂场景中的实际场景中使用，我们的数据集主要包含显示复杂场景的图像，每个场景中的多个对象在其自然上下文中。此数据集的图像收集重点在于：获取尽可能多的非标志性图像，并确保所有图像都是真实的图像，而不是该区域的其他现有数据集。我们的数据集包括5个对象类别（工作服、面罩、手套、面具和护目镜），每个图像都带有一组边界框和正面标签。我们对该数据集进行了详细分析，并将其与其他流行的大类数据集以及侧重于个人防护设备的数据集进行了比较。我们还发现，目前还不存在此类公开可用的数据集。最后，我们还分析了性能，并比较了边界框结果的基线模型和最先进模型的复杂性。我们的代码、数据和经过训练的模型可在https://git.io/cppe5-dataset . 摘要：We present a new challenging dataset, CPPE - 5 (Medical Personal Protective Equipment), with the goal to allow the study of subordinate categorization of medical personal protective equipments, which is not possible with other popular data sets that focus on broad level categories (such as PASCAL VOC, ImageNet, Microsoft COCO, OpenImages, etc). To make it easy for models trained on this dataset to be used in practical scenarios in complex scenes, our dataset mainly contains images that show complex scenes with several objects in each scene in their natural context. The image collection for this dataset focusing on: obtaining as many non-iconic images as possible and making sure all the images are real-life images unlike other existing datasets in this area. Our dataset includes 5 object categories (coveralls, face shield, gloves, mask, and goggles) and each image is annotated with a set of bounding boxes and positive labels. We present a detailed analysis of the dataset in comparison to other popular broad category datasets as well as datasets focusing on personal protective equipments, we also find that at present there exist no such publicly available datasets. Finally we also analyze performance and compare model complexities on baseline and state-of-the-art models for bounding box results. Our code, data, and trained models are available at https://git.io/cppe5-dataset .

【2】 Towards Launching AI Algorithms for Cellular Pathology into Clinical & Pharmaceutical Orbits 标题：将细胞病理学的人工智能算法推向临床和药物轨道链接：https://arxiv.org/abs/2112.09496

作者：Amina Asif,Kashif Rajpoot,David Snead,Fayyaz Minhas,Nasir Rajpoot 机构：†Tissue Image Analytics Centre, Department of Computer Science, University of Warwick, UK, ‡Department of Computer Science, University of Birmingham, UK, §Department of Pathology, University Hospitals Coventry & Warwickshire, UK, ¶The Alan Turing Institute, UK 摘要：计算病理学（CPath）是一个新兴领域，通过计算算法对组织切片的数字化高分辨率图像进行处理和分析，研究组织病理学。最近CPath基于深度学习的发展成功地利用了组织学图像中大量原始像素数据来预测诊断、预测、治疗敏感性和患者分层领域的目标参数——预示着组织学和肿瘤学都将迎来一个新的数据驱动AI时代。数据作为燃料，人工智能作为引擎，CPath算法准备好起飞并最终发射到临床和制药轨道。在本文中，我们讨论了CPath的局限性和相关的挑战，使读者能够区分希望和炒作，并为未来的研究提供方向，以克服这一新兴领域面临的一些主要挑战，使其能够发射到两个轨道。摘要：Computational Pathology (CPath) is an emerging field concerned with the study of tissue pathology via computational algorithms for the processing and analysis of digitized high-resolution images of tissue slides. Recent deep learning based developments in CPath have successfully leveraged sheer volume of raw pixel data in histology images for predicting target parameters in the domains of diagnostics, prognostics, treatment sensitivity and patient stratification -- heralding the promise of a new data-driven AI era for both histopathology and oncology. With data serving as the fuel and AI as the engine, CPath algorithms are poised to be ready for takeoff and eventual launch into clinical and pharmaceutical orbits. In this paper, we discuss CPath limitations and associated challenges to enable the readers distinguish hope from hype and provide directions for future research to overcome some of the major challenges faced by this budding field to enable its launch into the two orbits.

【3】 A Deep-Learning Framework for Improving COVID-19 CT Image Quality and Diagnostic Accuracy 标题：一种提高冠状病毒CT图像质量和诊断准确性的深度学习框架链接：https://arxiv.org/abs/2112.09216

作者：Garvit Goel,Jingyuan Qi,Wu-chun Feng,Guohua Cao 机构：Dept. of ECE, Virginia Tech, Dept. of CS, ECE, and BEAM 备注：10 pages 摘要：我们提出了一个基于深度学习的计算框架，快速和准确的CT（DL-事实）测试COVID-19。我们的CT为基础的DL框架的开发，以提高测试速度和精度的COVID-19（加上其变种），通过基于DL的方法，CT图像增强和分类。图像增强网络由DDnet（DenseNet的缩写）和基于反卷积的网络改编而成。为了证明它的速度和准确性，我们评估了DL事实在多个来源的COVID-19 CT图像。我们的结果表明，DL事实可以大大缩短从几天到几分钟的周转时间，并提高COVID-19测试精度高达91%。2019冠状病毒疾病的诊断和监测可以采用DL- FISH软件作为工具。摘要：We present a deep-learning based computing framework for fast-and-accurate CT (DL-FACT) testing of COVID-19. Our CT-based DL framework was developed to improve the testing speed and accuracy of COVID-19 (plus its variants) via a DL-based approach for CT image enhancement and classification. The image enhancement network is adapted from DDnet, short for DenseNet and Deconvolution based network. To demonstrate its speed and accuracy, we evaluated DL-FACT across several sources of COVID-19 CT images. Our results show that DL-FACT can significantly shorten the turnaround time from days to minutes and improve the COVID-19 testing accuracy up to 91%. DL-FACT could be used as a software tool for medical professionals in diagnosing and monitoring COVID-19.

【4】 Coherence Learning using Keypoint-based Pooling Network for Accurately Assessing Radiographic Knee Osteoarthritis 标题：基于关键点汇集网络的一致性学习在精确评估膝关节骨关节炎中的应用链接：https://arxiv.org/abs/2112.09177

作者：Kang Zheng,Yirui Wang,Chen-I Hsieh,Le Lu,Jing Xiao,Chang-Fu Kuo,Shun Miao 机构： PAII Inc., Bethesda, MD, USA, Chang Gung Memorial Hospital, Linkou, Taiwan, ROC, Ping An Technology, Shenzhen, China 备注：extension of RSNA 2020 report "Consistent and Coherent Computer-Aided Knee Osteoarthritis Assessment from Plain Radiographs" 摘要：膝关节骨关节炎（OA）是一种常见的退行性关节疾病，影响着全世界大量的老年人。膝关节骨性关节炎严重程度的准确影像学评估在慢性病患者管理中起着关键作用。目前临床上采用的膝关节OA分级系统是观察者主观的，并且存在评分者之间的分歧。在这项工作中，我们提出了一种计算机辅助诊断方法，以同时对复合和细粒度OA等级提供更准确和一致的评估。提出了一种新的半监督学习方法，通过从未标记数据中学习，利用复合和细粒度OA等级中的潜在一致性。通过使用预先训练的高斯混合模型的对数概率表示等级一致性，我们制定了一个非一致性损失，以将未标记的数据纳入训练中。所提出的方法还描述了一个基于关键点的汇集网络，其中从疾病目标关键点（沿膝关节提取）汇集深层图像特征，以提供更一致的病理信息特征表示，用于准确的OA等级评估。该方法基于公共骨关节炎倡议（OAI）数据进行综合评估，该数据是一项针对4796名受试者的多中心十年观察研究。实验结果表明，与以前的基于全图像的深度分类网络基线（如ResNet-50）相比，我们的方法有了显著的改进。摘要：Knee osteoarthritis (OA) is a common degenerate joint disorder that affects a large population of elderly people worldwide. Accurate radiographic assessment of knee OA severity plays a critical role in chronic patient management. Current clinically-adopted knee OA grading systems are observer subjective and suffer from inter-rater disagreements. In this work, we propose a computer-aided diagnosis approach to provide more accurate and consistent assessments of both composite and fine-grained OA grades simultaneously. A novel semi-supervised learning method is presented to exploit the underlying coherence in the composite and fine-grained OA grades by learning from unlabeled data. By representing the grade coherence using the log-probability of a pre-trained Gaussian Mixture Model, we formulate an incoherence loss to incorporate unlabeled data in training. The proposed method also describes a keypoint-based pooling network, where deep image features are pooled from the disease-targeted keypoints (extracted along the knee joint) to provide more aligned and pathologically informative feature representations, for accurate OA grade assessments. The proposed method is comprehensively evaluated on the public Osteoarthritis Initiative (OAI) data, a multi-center ten-year observational study on 4,796 subjects. Experimental results demonstrate that our method leads to significant improvements over previous strong whole image-based deep classification network baselines (like ResNet-50).

GAN|对抗|攻击|生成相关(5篇)

【1】 Information-theoretic stochastic contrastive conditional GAN: InfoSCC-GAN 标题：信息论随机对比条件GaN：InfoSCC-GaN 链接：https://arxiv.org/abs/2112.09653

作者：Vitaliy Kinakh,Mariia Drozdova,Guillaume Quétant,Tobias Golling,Slava Voloshynovskiy 机构：Department of Computer Science, Department of Particle Physics, University of Geneva, Switzerland 摘要：条件生成是生成问题的一个子类，其中生成的输出受属性信息的约束。本文提出了一种具有可探测潜在空间的随机对比条件生成对抗网络（InfoSCC-GAN）。InfoSCC-GAN架构基于基于InfoNCE范式的无监督对比编码器、属性分类器和特征GAN生成器。我们提出了一种新的训练方法，基于每$n$-次迭代使用外部或内部属性的生成器正则化，使用预训练的对比编码器和预训练的分类器。基于输入数据和潜在空间表示以及潜在空间和生成数据之间的互信息最大化的信息论公式，推导了所提出的InfoSCC-GAN。因此，我们证明了训练目标函数与上述信息论公式之间的联系。实验结果表明，在AFHQ和CelebA数据集上，InfoSCC-GAN在图像生成方面优于“香草”特征GAN。此外，我们通过烧蚀研究调查鉴别器结构和损耗函数的影响。最后，我们证明，由于采用了特征根生成器，与普通的确定性根相比，该框架具有随机生成能力，但与现有框架相比，编码器、分类器和生成器具有独立的训练。代码、实验结果和演示可在线访问https://github.com/vkinakh/InfoSCC-GAN. 摘要：Conditional generation is a subclass of generative problems where the output of the generation is conditioned by the attribute information. In this paper, we present a stochastic contrastive conditional generative adversarial network (InfoSCC-GAN) with an explorable latent space. The InfoSCC-GAN architecture is based on an unsupervised contrastive encoder built on the InfoNCE paradigm, an attribute classifier and an EigenGAN generator. We propose a novel training method, based on generator regularization using external or internal attributes every $n$-th iteration, using a pre-trained contrastive encoder and a pre-trained classifier. The proposed InfoSCC-GAN is derived based on an information-theoretic formulation of mutual information maximization between input data and latent space representation as well as latent space and generated data. Thus, we demonstrate a link between the training objective functions and the above information-theoretic formulation. The experimental results show that InfoSCC-GAN outperforms the "vanilla" EigenGAN in the image generation on AFHQ and CelebA datasets. In addition, we investigate the impact of discriminator architectures and loss functions by performing ablation studies. Finally, we demonstrate that thanks to the EigenGAN generator, the proposed framework enjoys a stochastic generation in contrast to vanilla deterministic GANs yet with the independent training of encoder, classifier, and generator in contrast to existing frameworks. Code, experimental results, and demos are available online at https://github.com/vkinakh/InfoSCC-GAN.

【2】 Dynamics-aware Adversarial Attack of 3D Sparse Convolution Network 标题：三维稀疏卷积网络的动态感知敌意攻击链接：https://arxiv.org/abs/2112.09428

作者：An Tao,Yueqi Duan,He Wang,Ziyi Wu,Pengliang Ji,Haowen Sun,Jie Zhou,Jiwen Lu 机构：Tsinghua University, Peking University, University of Toronto, Beihang University 摘要：本文研究了深层神经网络中的动态感知对抗攻击问题。大多数现有的对抗性攻击算法都是在一个基本假设下设计的——网络体系结构在整个攻击过程中都是固定的。然而，该假设不适用于许多最近提出的网络，例如3D稀疏卷积网络，其包含依赖输入的执行以提高计算效率。这导致了严重的梯度滞后问题，使得在当前步骤学习的攻击由于随后的架构更改而无效。为了解决这个问题，我们提出了一种超前梯度法（LGM），并展示了滞后梯度的显著影响。更具体地说，我们重新制定梯度以了解网络体系结构的潜在动态变化，这样，当网络体系结构动态变化时，学习到的攻击比不知道动态的方法更好地“引导”下一步。在各种数据集上的大量实验表明，我们的LGM在语义分割和分类方面取得了令人印象深刻的性能。与动态无意识方法相比，LGM在ScanNet和S3DIS数据集上的mIoU平均降低约20%。LGM的性能也优于最近的点云攻击。摘要：In this paper, we investigate the dynamics-aware adversarial attack problem in deep neural networks. Most existing adversarial attack algorithms are designed under a basic assumption -- the network architecture is fixed throughout the attack process. However, this assumption does not hold for many recently proposed networks, e.g. 3D sparse convolution network, which contains input-dependent execution to improve computational efficiency. It results in a serious issue of lagged gradient, making the learned attack at the current step ineffective due to the architecture changes afterward. To address this issue, we propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient. More specifically, we re-formulate the gradients to be aware of the potential dynamic changes of network architectures, so that the learned attack better "leads" the next step than the dynamics-unaware methods when network architecture changes dynamically. Extensive experiments on various datasets show that our LGM achieves impressive performance on semantic segmentation and classification. Compared with the dynamic-unaware methods, LGM achieves about 20% lower mIoU averagely on the ScanNet and S3DIS datasets. LGM also outperforms the recent point cloud attacks.

【3】 SuperStyleNet: Deep Image Synthesis with Superpixel Based Style Encoder 标题：SuperStyleNet：基于超像素风格编码器的深度图像合成链接：https://arxiv.org/abs/2112.09367

作者：Jonghyun Kim,Gen Li,Cheolkon Jung,Joongkyu Kim 机构： Department of Electrical and Computer, Sungkyunkwan University, Suwon, South Korea, School of Informatics, University of Edinburgh, Edinburgh, UK, School of Electronic Engineering, Xidian University, Xian, Shaanxi, China 备注：Accepted to BMVC 2021. Codes are available at this https URL 摘要：现有的图像合成方法利用基于卷积和池层堆栈的样式编码器从输入图像生成样式代码。然而，编码的向量不一定包含相应图像的局部信息，因为小尺度对象通过这种降尺度过程趋向于“冲刷”。在本文中，我们提出了一种基于超像素的风格编码器的深度图像合成方法，称为超级叶网。首先，我们直接从基于超像素的原始图像中提取样式代码，以考虑局部对象。其次，我们基于图形分析恢复矢量化样式代码中的空间关系。因此，该网络通过将样式代码映射到语义标签来实现高质量的图像合成。实验结果表明，该方法在视觉质量和定量测量方面优于现有方法。此外，我们还通过调整样式代码来实现精细的空间样式编辑。摘要：Existing methods for image synthesis utilized a style encoder based on stacks of convolutions and pooling layers to generate style codes from input images. However, the encoded vectors do not necessarily contain local information of the corresponding images since small-scale objects are tended to "wash away" through such downscaling procedures. In this paper, we propose deep image synthesis with superpixel based style encoder, named as SuperStyleNet. First, we directly extract the style codes from the original image based on superpixels to consider local objects. Second, we recover spatial relationships in vectorized style codes based on graphical analysis. Thus, the proposed network achieves high-quality image synthesis by mapping the style codes into semantic labels. Experimental results show that the proposed method outperforms state-of-the-art ones in terms of visual quality and quantitative measurements. Furthermore, we achieve elaborate spatial style editing by adjusting style codes.

【4】 All You Need is RAW: Defending Against Adversarial Attacks with Camera Image Pipelines 标题：您所需要的只是RAW：使用摄像机图像管道防御敌意攻击链接：https://arxiv.org/abs/2112.09219

作者：Yuxuan Zhang,Bo Dong,Felix Heide 机构：Princeton University 摘要：现有的用于计算机视觉任务的神经网络容易受到敌对攻击：向输入图像添加不可察觉的扰动可能会愚弄这些方法，使其在没有扰动的情况下对正确预测的图像做出错误预测。各种防御方法已经提出了图像到图像的映射方法，或者在训练过程中包括这些扰动，或者在预处理去噪步骤中消除它们。在这样做的过程中，现有的方法往往忽略了今天的数据集中的自然RGB图像不是被捕获的，而是从捕获过程中受到各种降级影响的原始颜色过滤器阵列捕获中恢复的。在这项工作中，我们利用这个原始数据分布作为对抗性防御的经验先验。具体来说，我们提出了一种模型不可知的对抗防御方法，该方法将输入的RGB图像映射到拜耳原始空间，然后使用学习的摄像机图像信号处理（ISP）管道返回到输出RGB，以消除潜在的对抗模式。该方法作为一个现成的预处理模块，与特定于模型的对抗性训练方法不同，不需要对抗性图像进行训练。因此，该方法可以推广到看不见的任务，而无需额外的再训练。针对不同视觉任务（如分类、语义分割、目标检测）的大规模数据集（如ImageNet、COCO）的实验验证了该方法在任务域中的性能明显优于现有方法。摘要：Existing neural networks for computer vision tasks are vulnerable to adversarial attacks: adding imperceptible perturbations to the input images can fool these methods to make a false prediction on an image that was correctly predicted without the perturbation. Various defense methods have proposed image-to-image mapping methods, either including these perturbations in the training process or removing them in a preprocessing denoising step. In doing so, existing methods often ignore that the natural RGB images in today's datasets are not captured but, in fact, recovered from RAW color filter array captures that are subject to various degradations in the capture. In this work, we exploit this RAW data distribution as an empirical prior for adversarial defense. Specifically, we proposed a model-agnostic adversarial defensive method, which maps the input RGB images to Bayer RAW space and back to output RGB using a learned camera image signal processing (ISP) pipeline to eliminate potential adversarial patterns. The proposed method acts as an off-the-shelf preprocessing module and, unlike model-specific adversarial training methods, does not require adversarial images to train. As a result, the method generalizes to unseen tasks without additional retraining. Experiments on large-scale datasets (e.g., ImageNet, COCO) for different vision tasks (e.g., classification, semantic segmentation, object detection) validate that the method significantly outperforms existing methods across task domains.

【5】 TAFIM: Targeted Adversarial Attacks against Facial Image Manipulations 标题：TAFIM：针对面部图像处理的有针对性的敌意攻击链接：https://arxiv.org/abs/2112.09151

作者：Shivangi Aneja,Lev Markhasin,Matthias Niessner 机构：Matthias Nießner, Technical University of Munich, Sony Europe RDC Stuttgart 备注：Paper Video: this https URL Project Page: this https URL 摘要：人脸图像处理方法尽管在计算机图形学中有许多有益的应用，但也会影响个人隐私或传播虚假信息，从而引起关注。在这项工作中，我们首先提出了一种主动防御措施，以防止面部操纵的发生。为此，我们引入了一种新的数据驱动方法，该方法产生嵌入原始图像中的图像特定扰动。关键思想是，这些受保护的图像通过使操纵模型生成预定义的操纵目标（在本例中为均匀着色的输出图像）而不是实际操纵来防止人脸操纵。与分别优化每个图像的噪声模式的传统对抗性攻击相比，我们的通用模型只需要一次前向传递，因此运行速度要快几个数量级，并允许轻松集成到图像处理堆栈中，即使是在资源受限的设备（如智能手机）上。此外，我们建议利用可微压缩近似，从而使生成的扰动对普通图像压缩具有鲁棒性。我们进一步证明了产生的扰动可以同时防止多种操纵方法。摘要：Face image manipulation methods, despite having many beneficial applications in computer graphics, can also raise concerns by affecting an individual's privacy or spreading disinformation. In this work, we propose a proactive defense to prevent face manipulation from happening in the first place. To this end, we introduce a novel data-driven approach that produces image-specific perturbations which are embedded in the original images. The key idea is that these protected images prevent face manipulation by causing the manipulation model to produce a predefined manipulation target (uniformly colored output image in our case) instead of the actual manipulation. Compared to traditional adversarial attacks that optimize noise patterns for each image individually, our generalized model only needs a single forward pass, thus running orders of magnitude faster and allowing for easy integration in image processing stacks, even on resource-constrained devices like smartphones. In addition, we propose to leverage a differentiable compression approximation, hence making generated perturbations robust to common image compression. We further show that a generated perturbation can simultaneously prevent against multiple manipulation methods.

自动驾驶|车辆|车道检测等(1篇)

【1】 Human-Vehicle Cooperative Visual Perception for Shared Autonomous Driving 标题：共享自动驾驶的人车协同视觉感知链接：https://arxiv.org/abs/2112.09298

作者：Yiyue Zhao,Cailin Lei,Yu Shen,Yuchuan Du,Qijun Chen 机构：Tongji University 摘要：随着环境感知等关键技术的发展，自主汽车的自动化水平不断提高。然而，在达到高度自主驾驶之前，手动驾驶仍然需要参与驾驶过程，以确保人车共享驾驶的安全。现有的人车协同驾驶研究主要集中在汽车工程和驾驶员行为方面，在视觉感知领域的研究较少。由于在复杂的道路交通冲突场景中表现不佳，合作视觉感知需要进一步研究。此外，自动驾驶感知系统无法正确理解手动驾驶的特点。基于上述背景，本文针对复杂道路交通场景，直接提出了一种基于转移学习方法和图像融合算法的人车协同视觉感知方法，以增强共享自主驾驶的视觉感知能力。基于传递学习的目标检测图达到75.52%，为视觉融合奠定了坚实的基础。融合实验进一步揭示了人-车协同视觉感知反映了最危险区域，更准确地预测了冲突对象的轨迹。本研究开创了一种用于共享自主驾驶的合作视觉感知解决方案，并在现实世界复杂交通冲突场景中进行了实验，能够更好地支持后续规划和控制，提高自主车辆的安全性。摘要：With the development of key technologies like environment perception, the automation level of autonomous vehicles has been increasing. However, before reaching highly autonomous driving, manual driving still needs to participate in the driving process to ensure the safety of human-vehicle shared driving. The existing human-vehicle cooperative driving focuses on auto engineering and drivers' behaviors, with few research studies in the field of visual perception. Due to the bad performance in the complex road traffic conflict scenarios, cooperative visual perception needs to be studied further. In addition, the autonomous driving perception system cannot correctly understand the characteristics of manual driving. Based on the background above, this paper directly proposes a human-vehicle cooperative visual perception method to enhance the visual perception ability of shared autonomous driving based on the transfer learning method and the image fusion algorithm for the complex road traffic scenarios. Based on transfer learning, the mAP of object detection reaches 75.52% and lays a solid foundation for visual fusion. And the fusion experiment further reveals that human-vehicle cooperative visual perception reflects the riskiest zone and predicts the conflict object's trajectory more precisely. This study pioneers a cooperative visual perception solution for shared autonomous driving and experiments in real-world complex traffic conflict scenarios, which can better support the following planning and controlling and improve the safety of autonomous vehicles.

Attention注意力(1篇)

【1】 Towards More Effective PRM-based Crowd Counting via A Multi-resolution Fusion and Attention Network 标题：通过多分辨率融合和注意力网络实现更有效的基于PRM的人群计数链接：https://arxiv.org/abs/2112.09664

作者：Usman Sajid,Guanghui Wang 机构：Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA, Department of Computer Science, Ryerson University, Toronto, ON, Canada M,B ,K 摘要：本文着重于改进最近基于即插即用补丁重缩放模块（PRM）的人群计数方法。为了充分利用PRM潜力，并在人群变化、大视角、极端遮挡和背景区域混乱的挑战性图像中获得更可靠和准确的结果，我们提出了一种新的基于PRM的多分辨率多任务人群计数网络，该网络利用了PRM模块的更有效性和潜力。该模型由三个深层分支组成，每个分支生成不同分辨率的特征图。这些分支相互执行特征级融合，以构建用于最终人群估计的重要集体知识。此外，对早期特征图进行视觉注意，以加强后期通道对前景区域的理解。通过对四个基准数据集进行广泛的数值和可视化评估，证明这些深分支与PRM模块和早期参与块的集成比原始的基于PRM的方案更有效。建议的方法在RMSE评估标准方面产生了12.6%的显著改进。在跨数据集评估中，它也优于最先进的方法。摘要：The paper focuses on improving the recent plug-and-play patch rescaling module (PRM) based approaches for crowd counting. In order to make full use of the PRM potential and obtain more reliable and accurate results for challenging images with crowd-variation, large perspective, extreme occlusions, and cluttered background regions, we propose a new PRM based multi-resolution and multi-task crowd counting network by exploiting the PRM module with more effectiveness and potency. The proposed model consists of three deep-layered branches with each branch generating feature maps of different resolutions. These branches perform a feature-level fusion across each other to build the vital collective knowledge to be used for the final crowd estimate. Additionally, early-stage feature maps undergo visual attention to strengthen the later-stage channels understanding of the foreground regions. The integration of these deep branches with the PRM module and the early-attended blocks proves to be more effective than the original PRM based schemes through extensive numerical and visual evaluations on four benchmark datasets. The proposed approach yields a significant improvement by a margin of 12.6% in terms of the RMSE evaluation criterion. It also outperforms state-of-the-art methods in cross-dataset evaluations.

人脸|人群计数(2篇)

【1】 Cinderella's shoe won't fit Soundarya: An audit of facial processing tools on Indian faces 标题：灰姑娘的鞋子不适合Soundarya：印度人脸部处理工具的审计链接：https://arxiv.org/abs/2112.09326

作者：Gaurav Jain,Smriti Parsheera 机构： School of Public Policy, Indian Institute ofTechnology 备注：17 pages, 2 figures and 7 tables 摘要：印度越来越多地采用面部处理系统，其中充满了对隐私、透明度、问责制和程序保障缺失的担忧。与此同时，我们对这些技术对印度13.4亿多人口的不同特征、特征和肤色的影响知之甚少。在本文中，我们在印度人脸数据集上测试了四种商用人脸处理工具的人脸检测和人脸分析功能。这些工具在人脸检测以及性别和年龄分类功能中显示不同的错误率。印度女性面孔的性别分类错误率始终高于男性——女性错误率最高为14.68%。在某些情况下，这一错误率远远高于先前针对其他民族女性的研究结果。年龄分类错误也很高。尽管考虑了一个人实际年龄正负10年的可接受误差幅度，但年龄预测失败率在14.3%到42.2%之间。这些发现表明，面部处理工具的准确性有限，特别是对某些人口群体而言，而且在采用此类系统之前需要进行更批判性的思考。摘要：The increasing adoption of facial processing systems in India is fraught with concerns of privacy, transparency, accountability, and missing procedural safeguards. At the same time, we also know very little about how these technologies perform on the diverse features, characteristics, and skin tones of India's 1.34 billion-plus population. In this paper, we test the face detection and facial analysis functions of four commercial facial processing tools on a dataset of Indian faces. The tools display varying error rates in the face detection and gender and age classification functions. The gender classification error rate for Indian female faces is consistently higher compared to that of males -- the highest female error rate being 14.68%. In some cases, this error rate is much higher than that shown by previous studies for females of other nationalities. Age classification errors are also high. Despite taking into account an acceptable error margin of plus or minus 10 years from a person's actual age, age prediction failures are in the range of 14.3% to 42.2%. These findings point to the limited accuracy of facial processing tools, particularly for certain demographic groups, and the need for more critical thinking before adopting such systems.

【2】 PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision 标题：PeopleSansPeople：以人为中心的计算机视觉合成数据生成器链接：https://arxiv.org/abs/2112.09290

作者：Salehe Erfanian Ebadi,You-Cyuan Jhang,Alex Zook,Saurav Dhakad,Adam Crespi,Pete Parisi,Steven Borkman,Jonathan Hogins,Sujoy Ganguly 机构：Unity Technologies 备注：PeopleSansPeople template Unity environment, benchmark binaries, and source code is available at: this https URL 摘要：近年来，在大规模标记数据集的帮助下，人体检测和姿态估计取得了长足的进步。然而，这些数据集无法保证或分析人类活动、姿势或环境多样性。此外，隐私、法律、安全和道德问题可能会限制收集更多人类数据的能力。一种新兴的替代现实世界数据的方法是合成数据，它可以缓解其中的一些问题。然而，合成数据生成器的创建极具挑战性，阻碍了研究人员对其有用性的探索。因此，我们发布了一个以人为中心的合成数据生成器PeopleSansPeople，其中包含模拟就绪的3D人体资源、参数化照明和摄像系统，并生成2D和3D边界框、实例和语义分割以及COCO姿势标签。使用PeopleSansPeople，我们使用Detectron2关键点R-CNN变体[1]执行基准合成数据训练。我们发现，使用合成数据对网络进行预训练，并对目标真实世界数据进行微调（少量镜头传输到有限的COCO person train子集[2]），导致keypoint AP为$60.37\pm 0.48$（COCO test-dev2017），优于仅使用相同真实数据训练（keypoint AP为$55.80$）并使用ImageNet进行预训练的模型（关键点AP为57.50美元）。这种免费提供的数据生成器应该能够在以人为中心的计算机视觉的关键领域，对新兴的仿真领域进行广泛的研究。摘要：In recent years, person detection and human pose estimation have made great strides, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis of human activities, poses, or context diversity. Additionally, privacy, legal, safety, and ethical concerns may limit the ability to collect more human data. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creation of synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. Therefore, we release a human-centric synthetic data generator PeopleSansPeople which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PeopleSansPeople, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using synthetic data and fine-tuning on target real-world data (few-shot transfer to limited subsets of COCO-person train [2]) resulted in a keypoint AP of $60.37 \pm 0.48$ (COCO test-dev2017) outperforming models trained with the same real data alone (keypoint AP of $55.80$) and pre-trained with ImageNet (keypoint AP of $57.50$). This freely-available data generator should enable a wide range of research into the emerging field of simulation to real transfer learning in the critical area of human-centric computer vision.

超分辨率|去噪|去模糊|去雾(2篇)

【1】 Super-resolution reconstruction of cytoskeleton image based on A-net deep learning network 标题：基于A网深度学习网络的细胞骨架图像超分辨率重建链接：https://arxiv.org/abs/2112.09574

作者：Qian Chen,Haoxin Bai,Bingchen Che,Tianyun Zhao,Ce Zhang,Kaige Wang,Jintao Bai,Wei Zhao 机构：. School of Automation, Northwestern Polytechnical University, Xi’an , China, . State Key Laboratory of Photon-Technology in Western China Energy, International, Collaborative Center on Photoelectric Technology and Nano Functional Materials, Institute of 备注：The manuscript has 17 pages, 10 figures and 58 references 摘要：迄今为止，纳米尺度的活细胞成像仍然具有挑战性。尽管超分辨率显微镜方法能够在光学分辨率极限以下显示亚细胞结构，但空间分辨率仍然远远不足以在体内重建生物分子的结构（即约24 nm厚的微管纤维）。在本研究中，我们提出了一个A-net网络，并表明将A-net深度学习网络与基于退化模型的DWDC算法相结合，可以显著提高共焦显微镜捕获的细胞骨架图像的分辨率。利用DWDC算法构建新的数据集，并利用A-net神经网络的特性（即相当少的层），我们成功地去除了噪声和絮状结构，这些结构最初会干扰原始图像中的细胞结构，使用相对较小的数据集将空间分辨率提高了10倍。因此，我们得出结论，所提出的将A-net神经网络与DWDC方法相结合的算法是从低分辨率图像中提取生物分子、细胞和器官结构细节的合适且通用的方法。摘要：To date, live-cell imaging at the nanometer scale remains challenging. Even though super-resolution microscopy methods have enabled visualization of subcellular structures below the optical resolution limit, the spatial resolution is still far from enough for the structural reconstruction of biomolecules in vivo (i.e. ~24 nm thickness of microtubule fiber). In this study, we proposed an A-net network and showed that the resolution of cytoskeleton images captured by a confocal microscope can be significantly improved by combining the A-net deep learning network with the DWDC algorithm based on degradation model. Utilizing the DWDC algorithm to construct new datasets and taking advantage of A-net neural network's features (i.e., considerably fewer layers), we successfully removed the noise and flocculent structures, which originally interfere with the cellular structure in the raw image, and improved the spatial resolution by 10 times using relatively small dataset. We, therefore, conclude that the proposed algorithm that combines A-net neural network with the DWDC method is a suitable and universal approach for exacting structural details of biomolecules, cells and organs from low-resolution images.

【2】 A Novel Image Denoising Algorithm Using Concepts of Quantum Many-Body Theory 标题：一种利用量子多体理论概念的图像去噪新算法链接：https://arxiv.org/abs/2112.09254

作者：Sayantan Dutta,Adrian Basarab,Bertrand Georgeot,Denis Kouamé 机构： Basarab is with the Universit´e de Lyon, Universit´e ClaudeBernard Lyon 1 备注：20 pages, 16 figures; complements and expands arXiv:2108.13778 摘要：在图像处理应用中，稀疏表示是一种非常有效的方法，例如去噪。近年来，随着计算能力的增长，利用从一幅或多幅图像中提取的补丁中的冗余来增加稀疏性的数据驱动策略变得越来越突出。本文从量子多体理论出发，提出了一种新的图像去噪算法。基于面片分析，局部图像邻域中的相似性度量通过一个类似于量子力学中相互作用的术语形式化，该术语可以有效地保留真实图像的局部结构。这种自适应基础的多功能性将其应用范围扩展到与图像无关或与图像相关的噪声场景，而无需进行任何调整。我们与现有的方法进行了严格的比较，以证明所提出的算法在不考虑图像特征、噪声统计和强度的情况下的去噪能力。我们说明了超参数的性质及其对去噪性能的影响，以及在没有基本事实的实验装置中自动选择接近最佳值的超参数值的规则。最后，我们展示了我们的方法处理实际图像去噪问题的能力，例如医学超声图像去噪应用。摘要：Sparse representation of real-life images is a very effective approach in imaging applications, such as denoising. In recent years, with the growth of computing power, data-driven strategies exploiting the redundancy within patches extracted from one or several images to increase sparsity have become more prominent. This paper presents a novel image denoising algorithm exploiting such an image-dependent basis inspired by the quantum many-body theory. Based on patch analysis, the similarity measures in a local image neighborhood are formalized through a term akin to interaction in quantum mechanics that can efficiently preserve the local structures of real images. The versatile nature of this adaptive basis extends the scope of its application to image-independent or image-dependent noise scenarios without any adjustment. We carry out a rigorous comparison with contemporary methods to demonstrate the denoising capability of the proposed algorithm regardless of the image characteristics, noise statistics and intensity. We illustrate the properties of the hyperparameters and their respective effects on the denoising performance, together with automated rules of selecting their values close to the optimal one in experimental setups with ground truth not available. Finally, we show the ability of our approach to deal with practical images denoising problems such as medical ultrasound image despeckling applications.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Point2Cyl: Reverse Engineering 3D Objects from Point Clouds to Extrusion Cylinders 标题：Point2Cyl：将三维对象从点云逆向工程到拉伸圆柱体链接：https://arxiv.org/abs/2112.09329

作者：Mikaela Angelina Uy,Yen-yu Chang,Minhyuk Sung,Purvi Goel,Joseph Lambourne,Tolga Birdal,Leonidas Guibas 机构：Stanford University, KAIST, Autodesk Research, Editing, Scale edit, Center edit, Extent edit, Axis &, boolean edit, Boolean edit, Extrusion, Cylinders, Reconstruction, Decomposition 摘要：我们提出了Point2Cyl，一种将原始3D点云转换为一组拉伸圆柱体的监督网络。从原始几何图形到CAD模型的逆向工程是实现形状编辑软件中3D数据操作的一项基本任务，从而扩展其在许多下游应用中的用途。特别是，与有限类型的图元（如平面、球体和圆柱体）相比，具有一系列拉伸圆柱体（二维草图加上拉伸轴和范围）及其布尔组合的CAD模型形式不仅在CAD社区/软件中广泛使用，而且具有很强的形状表达能力。在这项工作中，我们介绍了一种神经网络，它通过首先学习底层几何代理，以几何基础的方式解决挤压圆柱体分解问题。准确地说，我们的方法首先预测每点分割、基/桶标签和法线，然后在可微和封闭形式公式中估计基本挤出参数。我们的实验表明，我们的方法在两个最新的CAD数据集Fusion Gallery和DeepCAD上表现出最好的性能，我们进一步展示了我们在逆向工程和编辑方面的方法。摘要：We propose Point2Cyl, a supervised network transforming a raw 3D point cloud to a set of extrusion cylinders. Reverse engineering from a raw geometry to a CAD model is an essential task to enable manipulation of the 3D data in shape editing software and thus expand their usages in many downstream applications. Particularly, the form of CAD models having a sequence of extrusion cylinders -- a 2D sketch plus an extrusion axis and range -- and their boolean combinations is not only widely used in the CAD community/software but also has great expressivity of shapes, compared to having limited types of primitives (e.g., planes, spheres, and cylinders). In this work, we introduce a neural network that solves the extrusion cylinder decomposition problem in a geometry-grounded way by first learning underlying geometric proxies. Precisely, our approach first predicts per-point segmentation, base/barrel labels and normals, then estimates for the underlying extrusion parameters in differentiable and closed-form formulations. Our experiments show that our approach demonstrates the best performance on two recent CAD datasets, Fusion Gallery and DeepCAD, and we further showcase our approach on reverse engineering and editing.

多模态(1篇)

【1】 Logically at the Factify 2022: Multimodal Fact Verification 标题：在Factify 2022上的逻辑：多模态事实验证链接：https://arxiv.org/abs/2112.09253

作者：Jie Gao,Hella-Franziska Hoffmann,Stylianos Oikonomou,David Kiskovski,Anil Bandhakavi 机构：Brookfoot Mills, Brookfoot Industrial Estate, Brighouse, HD,RW, United Kingdom 备注：Accepted in AAAI'22: First Workshop on Multimodal Fact-Checking and Hate Speech Detection, Februrary 22 - March 1, 2022,Vancouver, BC, Canada 摘要：本文描述了AAAI 2022多模态事实验证（Factify）挑战的参与者系统。尽管最近在基于文本的验证技术和大型预先训练的多模态模型跨视觉和语言方面取得了进展，但在应用多模态技术自动化事实检查过程方面所做的工作非常有限，特别是考虑到社交媒体上关于图像和视频的索赔和假新闻日益普遍。在我们的工作中，挑战被视为多模态蕴涵任务，并被视为多类分类。提出并探索了两种基线方法，包括集成模型（结合两个单峰模型）和多峰注意网络（对索赔和证据文档中图像和文本对之间的交互进行建模）。在这项工作中，我们进行了一些实验，调查和基准测试不同的SoTA预训练Transformer和视觉模型。我们的最佳模型在排行榜中排名第一，在验证集和测试集上的加权平均F-测度均为0.77。数据集的探索性分析也在Factify数据集上进行，并揭示激发我们假设的显著模式和问题（例如，单词重叠、视觉蕴涵关联、源偏倚）。最后，我们强调了任务和多模式数据集在未来研究中面临的挑战。摘要：This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking process, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multi-modal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis of dataset is also carried out on the Factify data set and uncovers salient patterns and issues (e.g., word overlapping, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research.

3D|3D重建等相关(1篇)

【1】 The Wanderings of Odysseus in 3D Scenes 标题：奥德修斯在3D场景中的漫游链接：https://arxiv.org/abs/2112.09251

作者：Yan Zhang,Siyu Tang 机构：ETH Z¨urich 备注：11 pages main text + 2 page appendix 摘要：我们的目标是填充数字环境，在这个环境中，数字人类拥有不同的身体形状，不断移动，并有合理的身体场景接触。核心挑战是为各种3D实体生成逼真、可控和无限长的运动。为此，我们建议通过体表标记生成运动原语，简称为GAMMA。在我们的解决方案中，我们将长期运动分解为运动基本体的时间序列。我们利用体表标记和条件变分自动编码器对每个运动原语进行建模，并通过递归实现生成模型来生成长期运动。为了控制运动达到目标，我们使用策略网络来探索模型的潜在空间，并使用基于树的搜索在测试期间保持运动质量。实验表明，与目前最先进的数据驱动方法相比，该方法能够产生更真实、更可控的运动。使用传统的路径查找算法，生成的人体可以在场景中长时间长距离移动。代码将在以下网址发布以供研究之用：\url{https://yz-cnsdqz.github.io/eigenmotion/GAMMA/} 摘要：Our goal is to populate digital environments, in which the digital humans have diverse body shapes, move perpetually, and have plausible body-scene contact. The core challenge is to generate realistic, controllable, and infinitely long motions for diverse 3D bodies. To this end, we propose generative motion primitives via body surface markers, shortened as GAMMA. In our solution, we decompose the long-term motion into a time sequence of motion primitives. We exploit body surface markers and conditional variational autoencoder to model each motion primitive, and generate long-term motion by implementing the generative model recursively. To control the motion to reach a goal, we apply a policy network to explore the model latent space, and use a tree-based search to preserve the motion quality during testing. Experiments show that our method can produce more realistic and controllable motion than state-of-the-art data-driven method. With conventional path-finding algorithms, the generated human bodies can realistically move long distances for a long period of time in the scene. Code will be released for research purposes at: \url{https://yz-cnsdqz.github.io/eigenmotion/GAMMA/}

其他神经网络|深度学习|模型|建模(4篇)

【1】 Visual Microfossil Identificationvia Deep Metric Learning 标题：基于深度度量学习的视觉微化石识别链接：https://arxiv.org/abs/2112.09490

作者：Tayfun Karaderi,Tilo Burghardt,Allison Y. Hsiang,Jacob Ramaer,Daniela N. Schmidt 机构：Department of Computer Science, University of Bristol, UK, Institutionen f¨or geologiska vetenskaper, Stockholm University, Sweden, School of Earth Sciences, University of Bristol, UK 备注：12 pages, 5 figures 摘要：我们首次将深度度量学习应用于在显微图像上对浮游有孔虫壳进行分类的问题。这项物种识别任务是重建过去气候的重要信息来源和科学支柱。文献中的所有有孔虫CNN识别管道都产生了黑盒分类器，人类专家缺乏可视化选项，无法应用于开放集问题。在这里，我们根据这些管道对度量学习进行基准测试，产生了第一个表型浮游有孔虫形态空间的科学可视化，并证明度量学习可用于对训练期间看不见的物种进行聚类。我们表明，度量学习在这一领域的性能超过了所有已发表的基于CNN的最新基准。我们在由35种现代浮游有孔虫组成的无尽有孔虫公共图书馆的34640张专家注释图像上评估了我们的方法。我们对该数据的结果显示，在保留的测试数据上复制专家标签的准确率为92%（F1分数为0.84），在训练中从未遇到过物种聚类的情况下，准确率为66.5%（F1分数为0.70）。我们的结论是，度量学习在这一领域是非常有效的，并且是微化石鉴定专家在环自动化的一个重要工具。关键代码、网络权重和数据分割与本文一起发布，以实现完全再现性。摘要：We apply deep metric learning for the first time to the prob-lem of classifying planktic foraminifer shells on microscopic images. This species recognition task is an important information source and scientific pillar for reconstructing past climates. All foraminifer CNN recognition pipelines in the literature produce black-box classifiers that lack visualisation options for human experts and cannot be applied to open set problems. Here, we benchmark metric learning against these pipelines, produce the first scientific visualisation of the phenotypic planktic foraminifer morphology space, and demonstrate that metric learning can be used to cluster species unseen during training. We show that metric learning out-performs all published CNN-based state-of-the-art benchmarks in this domain. We evaluate our approach on the 34,640 expert-annotated images of the Endless Forams public library of 35 modern planktic foraminifera species. Our results on this data show leading 92% accuracy (at 0.84 F1-score) in reproducing expert labels on withheld test data, and 66.5% accuracy (at 0.70 F1-score) when clustering species never encountered in training. We conclude that metric learning is highly effective for this domain and serves as an important tool towards expert-in-the-loop automation of microfossil identification. Key code, network weights, and data splits are published with this paper for full reproducibility.

【2】 Interpreting Audiograms with Multi-stage Neural Networks 标题：用多级神经网络解释听力图链接：https://arxiv.org/abs/2112.09357

作者：Shufan Li,Congxi Lu,Linkai Li,Jirong Duan,Xinping Fu,Haoshuai Zhou 机构：Orka Labs Inc., ENT&Audiology Center, Xinhua Hospital∗, &Punan Hospital, Ear Institute, University College London 备注：12pages,12 figures. The code for this project is available at this https URL 摘要：听力图是一种特殊类型的线状图，代表个人在不同频率下的听力水平。听力专家使用它们来诊断听力损失，并进一步为客户选择和调整合适的助听器。有几个项目，如Autoaudio，旨在通过机器学习加速这一过程。但现有的所有模型都只能检测图像中的听力图，并将其分类为一般类别。他们无法通过解释标记、轴和线从检测到的听力图中提取听力水平信息。为了解决这个问题，我们提出了一个多级听力图解释网络（MAIN），它直接从听力图的照片中读取听力水平数据。我们还建立了开放听力图，这是一个开放的听力图图像数据集，带有标记和轴的注释，我们在此基础上训练和评估了我们提出的模型。实验表明，该模型是可行和可靠的。摘要：Audiograms are a particular type of line charts representing individuals' hearing level at various frequencies. They are used by audiologists to diagnose hearing loss, and further select and tune appropriate hearing aids for customers. There have been several projects such as Autoaudio that aim to accelerate this process through means of machine learning. But all existing models at their best can only detect audiograms in images and classify them into general categories. They are unable to extract hearing level information from detected audiograms by interpreting the marks, axis, and lines. To address this issue, we propose a Multi-stage Audiogram Interpretation Network (MAIN) that directly reads hearing level data from photos of audiograms. We also established Open Audiogram, an open dataset of audiogram images with annotations of marks and axes on which we trained and evaluated our proposed model. Experiments show that our model is feasible and reliable.

【3】 Procedural Kernel Networks 标题：过程性核心网络链接：https://arxiv.org/abs/2112.09318

作者：Bartlomiej Wronski 机构：Google Research 备注：11 pages, technical report 摘要：在过去的十年中，卷积神经网络（CNN）定义了许多低级图像处理和恢复任务的最新技术，如去噪、去噪、放大或修复。然而，设备上的移动摄影仍然被传统的图像处理技术所主导，并且大多使用简单的机器学习技术，或者将神经网络处理局限于生成低分辨率的掩模。CNN的高计算和内存要求、移动设备的有限处理能力和热约束，以及大的输出图像分辨率（通常为8-12 MPix）阻碍了其更广泛的应用。在这项工作中，我们介绍了过程核网络（PKNs），一系列机器学习模型，用于生成图像过滤核或其他传统算法的参数。一个轻量级的CNN以较低的分辨率处理输入图像，与其他基于内核的机器学习方法相比，它产生了显著的加速，并允许新的应用。该体系结构是端到端学习的，特别适合于广泛的低级图像处理任务，在这些任务中，它提高了许多传统算法的性能。我们还描述了该框架如何将以前的一些工作结合起来，将机器学习应用于常见的图像恢复任务。摘要：In the last decade Convolutional Neural Networks (CNNs) have defined the state of the art for many low level image processing and restoration tasks such as denoising, demosaicking, upscaling, or inpainting. However, on-device mobile photography is still dominated by traditional image processing techniques, and uses mostly simple machine learning techniques or limits the neural network processing to producing low resolution masks. High computational and memory requirements of CNNs, limited processing power and thermal constraints of mobile devices, combined with large output image resolutions (typically 8--12 MPix) prevent their wider application. In this work, we introduce Procedural Kernel Networks (PKNs), a family of machine learning models which generate parameters of image filter kernels or other traditional algorithms. A lightweight CNN processes the input image at a lower resolution, which yields a significant speedup compared to other kernel-based machine learning methods and allows for new applications. The architecture is learned end-to-end and is especially well suited for a wide range of low-level image processing tasks, where it improves the performance of many traditional algorithms. We also describe how this framework unifies some previous work applying machine learning for common image restoration tasks.

【4】 An Empirical Investigation of the Role of Pre-training in Lifelong Learning 标题：关于职前训练在终身学习中作用的实证研究链接：https://arxiv.org/abs/2112.09153

作者：Sanket Vaibhav Mehta,Darshan Patil,Sarath Chandar,Emma Strubell 机构：Carnegie Mellon University,Mila - Quebec AI Institute,University of Montreal, École Polytechnique de Montréal,Canada CIFAR AI Chair 备注：30 pages 摘要：机器学习中的终身学习模式是一种有吸引力的替代更为突出的孤立学习模式的方法，这不仅是因为它与生物学习相似，还因为它可以通过避免过度的模型再训练来减少能量浪费。这一范式的一个关键挑战是灾难性遗忘现象。随着预训练模型在机器学习中的日益普及和成功，我们提出了一个问题：预训练在终身学习中扮演什么角色，特别是在灾难性遗忘方面？我们在预先训练的大型模型中研究现有方法，并评估它们在各种文本和图像分类任务中的性能，包括使用15种不同NLP任务的新数据集进行的大规模研究。在所有设置中，我们观察到，与随机初始化的模型相比，通用预训练隐式地减轻了顺序学习多个任务时灾难性遗忘的影响。然后，我们进一步研究为什么在这种环境下，预先训练可以减轻遗忘。我们通过分析损失情况来研究这一现象，发现预先训练的权重似乎通过导致更大的极小值来缓解遗忘。基于这一观点，我们建议联合优化当前任务损失和损失盆地锐度，以便在顺序微调期间明确鼓励更宽的盆地。我们表明，这种优化方法可以在多个环境中实现与最先进的任务顺序连续学习相媲美的性能，而不会保留随任务数量而扩展的内存。摘要：The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning, but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel dataset of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness in order to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach leads to performance comparable to the state-of-the-art in task-sequential continual learning across multiple settings, without retaining a memory that scales in size with the number of tasks.

其他(19篇)

【1】 Light Field Neural Rendering 标题：光场神经绘制链接：https://arxiv.org/abs/2112.09687

作者：Mohammed Suhail,Carlos Esteves,Leonid Sigal,Ameesh Makadia 机构：University of British Columbia, Vector Institute for AI, Canada CIFAR AI Chair, Google 备注：Project page with code and videos at this https URL 摘要：用于新视图合成的经典光场渲染可以准确地再现与视图相关的效果，如反射、折射和半透明，但需要对场景进行密集的视图采样。基于几何重建的方法只需要稀疏视图，但不能精确地模拟非朗伯效应。我们引入了一个模型，该模型结合了这两个方向的优点并缓解了这两个方向的局限性。通过对光场的四维表示进行操作，我们的模型能够准确地表示依赖于视图的效果。通过在训练和推理期间强制几何约束，场景几何体从稀疏视图集中隐式学习。具体地说，我们介绍了一种基于两级变换器的模型，该模型首先沿极线聚集特征，然后沿参考视图聚集特征以生成目标光线的颜色。我们的模型在多个前向和360{\deg}数据集上优于最新技术，在具有严重视图依赖性变化的场景上具有更大的裕度。摘要：Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene. Methods based on geometric reconstruction need only sparse views, but cannot accurately model non-Lambertian effects. We introduce a model that combines the strengths and mitigates the limitations of these two directions. By operating on a four-dimensional representation of the light field, our model learns to represent view-dependent effects accurately. By enforcing geometric constraints during training and inference, the scene geometry is implicitly learned from a sparse set of views. Concretely, we introduce a two-stage transformer-based model that first aggregates features along epipolar lines, then aggregates features along reference views to produce the color of a target ray. Our model outperforms the state-of-the-art on multiple forward-facing and 360{\deg} datasets, with larger margins on scenes with severe view-dependent variations.

【2】 AI-Assisted Verification of Biometric Data Collection 标题：生物特征数据采集的人工智能辅助验证链接：https://arxiv.org/abs/2112.09660

作者：Ryan Lindsey 机构：University of KentuckyDecember 20 摘要：从视频源中识别动作是一项具有挑战性的自动化任务，尤其是在较旧的硬件上。该项目有两个目标：一个是识别安卓手机正面摄像头的动作，另一个是支持尽可能多的手机和安卓版本。这就限制了我们只能使用小到可以在有GPU和没有GPU的手机上运行的型号，并且只能使用摄像头来识别动作。在本文中，我们使用在自定义数据集上训练的模型，比较了YOLO体系结构在不同设备（有和没有专用GPU）上的性能。我们还讨论了在有限的硬件上从视频中识别人脸和动作的局限性。摘要：Recognizing actions from a video feed is a challenging task to automate, especially so on older hardware. There are two aims for this project: one is to recognize an action from the front-facing camera on an Android phone, the other is to support as many phones and Android versions as possible. This limits us to using models that are small enough to run on mobile phones with and without GPUs, and only using the camera feed to recognize the action. In this paper we compare performance of the YOLO architecture across devices (with and without dedicated GPUs) using models trained on a custom dataset. We also discuss limitations in recognizing faces and actions from video on limited hardware.

【3】 Improving neural implicit surfaces geometry with patch warping 标题：用面片翘曲改进神经隐式曲面几何链接：https://arxiv.org/abs/2112.09648

作者：François Darmon,Bénédicte Bascle,Jean-Clément Devaux,Pascal Monasse,Mathieu Aubry 机构：Thales LAS France, LIGM (UMR ,), ´Ecole des Ponts, Univ. Gustave Eiffel, CNRS, Marne-la-Vall´ee, France 备注：Project wepbage: this http url this http URL 摘要：神经隐式曲面已成为多视图三维重建的重要技术，但其精度仍然有限。在本文中，我们认为，这是由于学习和渲染高频纹理与神经网络的困难。因此，我们建议在标准的神经渲染优化中添加跨不同视图的直接照片一致性项。直观地说，我们优化隐式几何体，使其以一致的方式相互扭曲视图。我们证明了两个要素是这种方法成功的关键：（i）使用沿每条射线的3D点的预测占有率和法线扭曲整个面片，并使用鲁棒结构相似性（SSIM）测量其相似性；（ii）处理可见性和遮挡时，不太重视不正确的扭曲，同时鼓励尽可能完整的重建。我们在标准DTU和EPFL基准上评估了我们称为NeuralWarp的方法，结果表明，在这两个数据集上，它比最先进的无监督隐式曲面重建的性能高出20%以上。摘要：Neural implicit surfaces have become an important technique for multi-view 3D reconstruction but their accuracy remains limited. In this paper, we argue that this comes from the difficulty to learn and render high frequency textures with neural networks. We thus propose to add to the standard neural rendering optimization a direct photo-consistency term across the different views. Intuitively, we optimize the implicit geometry so that it warps views on each other in a consistent way. We demonstrate that two elements are key to the success of such an approach: (i) warping entire patches, using the predicted occupancy and normals of the 3D points along each ray, and measuring their similarity with a robust structural similarity (SSIM); (ii) handling visibility and occlusion in such a way that incorrect warps are not given too much importance while encouraging a reconstruction as complete as possible. We evaluate our approach, dubbed NeuralWarp, on the standard DTU and EPFL benchmarks and show it outperforms state of the art unsupervised implicit surfaces reconstructions by over 20% on both datasets.

【4】 Global explainability in aligned image modalities 标题：对齐图像模态的全局可解释性链接：https://arxiv.org/abs/2112.09591

作者：Justin Engelmann,Amos Storkey,Miguel O. Bernabeu 机构：UKRI CDT Biomedical AI, University of Edinburgh, School of Informatics, Usher Institute 摘要：深度学习（DL）模型在许多计算机视觉问题上非常有效，并越来越多地用于关键应用。它们也是天生的黑匣子。有许多方法可以生成图像解释，使从业者能够理解和验证给定图像的模型预测。除此之外，还需要验证DL模型{通常}是否以合理的方式工作，即与领域知识一致，并且不依赖不需要的数据人工制品。为此，需要对模型进行全面解释。在这项工作中，我们将重点放在自然对齐的图像模式上，以便每个像素位置代表成像对象上类似的相对位置，这在医学成像中很常见。我们提出了图像解释的像素级聚合，作为获得标签级和全局解释的简单方法。然后，这些可以用于模型验证、知识发现，并作为交流从检查图像解释中得出的定性结论的有效方式。我们进一步提出渐进擦除加渐进恢复（PEPPR）作为一种方法，以定量验证这些全局解释是否忠实于模型的预测。然后，我们将这些方法应用于超宽视野视网膜图像，这是一种自然对齐的模式。我们发现，全局解释与领域知识一致，并且忠实地反映了模型的工作原理。摘要：Deep learning (DL) models are very effective on many computer vision problems and increasingly used in critical applications. They are also inherently black box. A number of methods exist to generate image-wise explanations that allow practitioners to understand and verify model predictions for a given image. Beyond that, it would be desirable to validate that a DL model \textit{generally} works in a sensible way, i.e. consistent with domain knowledge and not relying on undesirable data artefacts. For this purpose, the model needs to be explained globally. In this work, we focus on image modalities that are naturally aligned such that each pixel position represents a similar relative position on the imaged object, as is common in medical imaging. We propose the pixel-wise aggregation of image-wise explanations as a simple method to obtain label-wise and overall global explanations. These can then be used for model validation, knowledge discovery, and as an efficient way to communicate qualitative conclusions drawn from inspecting image-wise explanations. We further propose Progressive Erasing Plus Progressive Restoration (PEPPR) as a method to quantitatively validate that these global explanations are faithful to how the model makes its predictions. We then apply these methods to ultra-widefield retinal images, a naturally aligned modality. We find that the global explanations are consistent with domain knowledge and faithfully reflect the model's workings.

【5】 Nearest neighbor search with compact codes: A decoder perspective 标题：紧凑码最近邻搜索：一种解码器的观点链接：https://arxiv.org/abs/2112.09568

作者：Kenza Amara,Matthijs Douze,Alexandre Sablayrolles,Hervé Jégou 机构：Facebook AI 摘要：在十亿规模的数据集上快速检索相似向量的现代方法依赖于压缩域方法，如二进制草图或产品量化。这些方法将一定的损失最小化，通常是均方误差或其他针对检索问题定制的目标函数。在本文中，我们将二进制哈希或乘积量化器等流行方法重新解释为自动编码器，并指出它们隐含地对解码器的形式做出了次优假设。我们设计了向后兼容的解码器，改进了从相同代码中重构向量的过程，从而在最近邻搜索中获得更好的性能。与二进制散列方法或流行基准上的产品量化相比，我们的方法有了显著的改进。摘要：Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the mean squared error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods or product quantization on popular benchmarks.

【6】 LTB curves with Lipschitz turn are par-regular 标题：具有Lipschitz转角的LTB曲线是PAR-正则的链接：https://arxiv.org/abs/2112.09567

作者：Etienne Le Quentrec,Loïc Mazo,Étienne Baudrier,Mohamed Tajine 摘要：在数字化过程中保持拓扑结构是最重要的要求。为此，数字几何中的经典做法是假定形状边界是规则的。证明了Par正则性等价于具有正reach或属于具有Lipschitz导数的曲线的c1,1类。最近，我们建议使用一个更大的类，它包含具有钝角的多边形，即局部转弯有界曲线。本技术报告的目的是仅使用转弯（即积分曲率）的概念来定义局部转弯有界曲线类中的par规则曲线类。更精确地说，在前一篇文章中，我们已经证明了par正则曲线是局部转向有界的。顺便说一句，这个证明使我们证明了par正则曲线的转角是其长度的Lipschitz函数。我们将验证后一个属性的曲线类称为具有Lipschitz转向的曲线。在本技术报告中，我们证明了相反的断言：具有Lipschitz转向的局部转向有界曲线是par正则的。定理3.1说明了等价性，引理3.2证明了相反的断言。在第1节中，我们回顾了par正则性的定义以及具有正到达集的等价性。在第2节中，我们介绍了局部转向有界曲线和具有Lipschitz转向的曲线的概念。在后一节中，一些中间步骤（引理2.3和2.11）在引入相关概念之后得到了证明。最后一节（第3节）专门用于证明概念的等价性。摘要：Preserving the topology during a digitization process is a requirement of first importance. To this end, it is classical in Digital Geometry to assume the shape borders to be par-regular. Par-regularity was proved to be equivalent to having positive reach or to belong to the class C 1,1 of curves with Lipschitz derivative. Recently, we proposed to use a larger class that encompasses polygons with obtuse angles, the locally turn-bounded curves. The aim of this technical report is to define the class of par-regular curves inside the class of locally turn-bounded curves using only the notion of turn, that is of integral curvature. To be more precise, in a previous article, we have already proved that par-regular curves are locally turn-bounded. Incidentally this proof lead us to show that the turn of par-regular curves is a Lipschitz function of their length. We call the class of curves verifying this latter property the curves with Lipschitz turn. In this technical report, we prove the converse assertion : locally turn-bounded curves with Lipschitz turn are par-regular. The equivalence is stated in Theorem 3.1 and the converse assertion is proved in Lemma 3.2. In section 1, we recall the definition of par-regularity and equivalently of sets with positive reach. In section 2, we present the notions of curves locally turn-bounded and of curves with Lipschitz turn. Throughout this latter section, some of intermediate steps (Lemmas 2.3 and 2.11) are proved just after the introduction of their related notions. The last section (section 3) is dedicated to the proof of the equivalence of the notions.

【7】 Complex Functional Maps : a Conformal Link Between Tangent Bundles 标题：复函数映射：切丛之间的保角连接链接：https://arxiv.org/abs/2112.09546

作者：Nicolas Donati,Etienne Corman,Simone Melzi,Maks Ovsjanikov 机构：LIX, Ecole Polytechnique, IP Paris,Université de Lorraine, CNRS, Inria, LORIA, France,Sapienza, University of Rome 摘要：本文引入复泛函映射，将泛函映射框架推广到曲面上切向量场之间的保角映射。这些地图的一个关键特性是方向意识。更具体地说，我们证明了与连接两个流形的函数空间的正则函数映射不同，我们的复函数映射在定向切线束之间建立了链接，从而允许切线向量场的鲁棒和有效传输。通过首先赋予每个形状的切线束一个复杂的结构，然后利用该切线束，生成的操作自然成为定向软件，从而有利于形状之间保持方向和角度的对应，而不依赖于描述符或额外的正则化。最后，也许更重要的是，我们将演示这些对象如何在functional map框架内实现多个实际应用程序。我们表明，可以联合估计函数映射及其复杂对应项，以促进方向保持，正则化先前遭受方向反转对称错误的管道。摘要：In this paper, we introduce complex functional maps, which extend the functional map framework to conformal maps between tangent vector fields on surfaces. A key property of these maps is their orientation awareness. More specifically, we demonstrate that unlike regular functional maps that link functional spaces of two manifolds, our complex functional maps establish a link between oriented tangent bundles, thus permitting robust and efficient transfer of tangent vector fields. By first endowing and then exploiting the tangent bundle of each shape with a complex structure, the resulting operations become naturally orientationaware, thus favoring orientation and angle preserving correspondence across shapes, without relying on descriptors or extra regularization. Finally, and perhaps more importantly, we demonstrate how these objects enable several practical applications within the functional map framework. We show that functional maps and their complex counterparts can be estimated jointly to promote orientation preservation, regularizing pipelines that previously suffered from orientation-reversing symmetry errors.

【8】 Symmetry-aware Neural Architecture for Embodied Visual Navigation 标题：用于具身视觉导航的对称性感知神经结构链接：https://arxiv.org/abs/2112.09515

作者：Shuang Liu,Takayuki Okatani 机构：Okatani Takayuki, Tohoku University, RIKEN Center for AIP 摘要：视觉探索是一项寻求尽快访问环境中所有可导航区域的任务。现有的方法采用深度强化学习（RL）作为任务的标准工具。然而，它们往往容易受到训练数据和测试数据之间的统计变化的影响，导致在训练数据不分布（OOD）的新环境中泛化能力差。在本文中，我们试图通过利用任务中可用的归纳偏差来提高泛化能力。采用主动神经SLAM（ANS）学习探索策略，以优势行动者-批评家（A2C）方法为基本框架，我们首先指出行动者和批评家所代表的映射应满足特定的对称性。然后，我们为演员和评论家提出了一个网络设计，以内在地实现这些对称性。具体来说，我们使用$G$-卷积代替标准卷积，并在批评家网络的最后一部分插入我们在本研究中新设计的半全局极性池（SGPP）层。实验结果表明，当在Gibson数据集上进行训练并在MP3D数据集上进行测试时，我们的方法将区域覆盖率提高了810万美元，从而建立了新的最先进技术。摘要：Visual exploration is a task that seeks to visit all the navigable areas of an environment as quickly as possible. The existing methods employ deep reinforcement learning (RL) as the standard tool for the task. However, they tend to be vulnerable to statistical shifts between the training and test data, resulting in poor generalization over novel environments that are out-of-distribution (OOD) from the training data. In this paper, we attempt to improve the generalization ability by utilizing the inductive biases available for the task. Employing the active neural SLAM (ANS) that learns exploration policies with the advantage actor-critic (A2C) method as the base framework, we first point out that the mappings represented by the actor and the critic should satisfy specific symmetries. We then propose a network design for the actor and the critic to inherently attain these symmetries. Specifically, we use $G$-convolution instead of the standard convolution and insert the semi-global polar pooling (SGPP) layer, which we newly design in this study, in the last section of the critic network. Experimental results show that our method increases area coverage by $8.1 m^2$ when trained on the Gibson dataset and tested on the MP3D dataset, establishing the new state-of-the-art.

【9】 Adaptively Customizing Activation Functions for Various Layers 标题：自适应定制各层激活功能链接：https://arxiv.org/abs/2112.09442

作者：Haigen Hu,Aizhu Liu,Qiu Guan,Xiaoxin Li,Shengyong Chen,Qianwei Zhou 机构：College of Computer Science and Technology, Zhejiang University of Technology, Key Laboratory of Visual Media Intelligent Processing Technology of Zhejiang Province, School of Computer Science and Engineering, Tianjin University of Technology 摘要：为了增强神经网络的非线性并增强其在输入和响应变量之间的映射能力，激活函数在建模数据中更复杂的关系和模式方面起着至关重要的作用。在这项工作中，提出了一种新的方法来自适应定制激活函数，只需在传统的激活函数（如Sigmoid、Tanh和ReLU）中添加很少的参数。为了验证所提方法的有效性，本文对加速收敛和提高性能进行了一些理论和实验分析，并基于各种网络模型（如AlexNet、VGGNet、GoogLeNet、ResNet和DenseNet）进行了一系列实验，以及各种数据集（如CIFAR10、CIFAR100、miniImageNet、PASCAL VOC和COCO）。为了进一步验证在各种优化策略和使用场景中的有效性和适用性，还对不同的优化策略（如SGD、Momentum、AdaGrad、AdaDelta和ADAM）和不同的识别任务（如分类和检测）进行了对比实验。结果表明，该方法简单易行，但在收敛速度、精度和泛化能力等方面具有显著的性能，在几乎所有的实验中，该方法的整体性能都优于其他流行的方法，如ReLU和自适应函数，如Swish。该守则可于https://github.com/HuHaigen/Adaptively-Customizing-Activation-Functions.该软件包包括出于再现性目的提出的三个自适应激活函数。摘要：To enhance the nonlinearity of neural networks and increase their mapping abilities between the inputs and response variables, activation functions play a crucial role to model more complex relationships and patterns in the data. In this work, a novel methodology is proposed to adaptively customize activation functions only by adding very few parameters to the traditional activation functions such as Sigmoid, Tanh, and ReLU. To verify the effectiveness of the proposed methodology, some theoretical and experimental analysis on accelerating the convergence and improving the performance is presented, and a series of experiments are conducted based on various network models (such as AlexNet, VGGNet, GoogLeNet, ResNet and DenseNet), and various datasets (such as CIFAR10, CIFAR100, miniImageNet, PASCAL VOC and COCO) . To further verify the validity and suitability in various optimization strategies and usage scenarios, some comparison experiments are also implemented among different optimization strategies (such as SGD, Momentum, AdaGrad, AdaDelta and ADAM) and different recognition tasks like classification and detection. The results show that the proposed methodology is very simple but with significant performance in convergence speed, precision and generalization, and it can surpass other popular methods like ReLU and adaptive functions like Swish in almost all experiments in terms of overall performance.The code is publicly available at https://github.com/HuHaigen/Adaptively-Customizing-Activation-Functions. The package includes the proposed three adaptive activation functions for reproducibility purposes.

【10】 A Review on Visual Privacy Preservation Techniques for Active and Assisted Living 标题：主动辅助生活中的视觉隐私保护技术综述链接：https://arxiv.org/abs/2112.09422

作者：Siddharth Ravi,Pau Climent-Pérez,Francisco Florez-Revuelta 机构：Department of Computing Technology, University of Alicante, Alicante, Valencian Community, Spain 摘要：本文回顾了视觉隐私保护技术的发展现状，特别关注了适用于主动和辅助生活（AAL）领域的技术。介绍了一种新的分类法，可用于对最先进的视觉隐私保护方法进行分类。感知模糊处理方法是分类法中的一个类别，它被突出显示。这些是一类视觉隐私保护技术，在考虑基于视频的AAL监控下的场景时尤其相关。对机器学习模型的模糊处理也进行了探讨。通过设计将不同隐私级别的高级分类方案与所提出的视觉隐私保护技术分类法相连接。最后，我们注意到该领域存在的开放性问题，并向读者介绍视觉隐私领域未来研究的一些令人兴奋的途径。摘要：This paper reviews the state of the art in visual privacy protection techniques, with particular attention paid to techniques applicable to the field of active and assisted living (AAL). A novel taxonomy with which state-of-the-art visual privacy protection methods can be classified is introduced. Perceptual obfuscation methods, a category in the taxonomy, is highlighted. These are a category of visual privacy preservation techniques particularly relevant when considering scenarios that come under video-based AAL monitoring. Obfuscation against machine learning models is also explored. A high-level classification scheme of the different levels of privacy by design is connected to the proposed taxonomy of visual privacy preservation techniques. Finally, we note open questions that exist in the field and introduce the reader to some exciting avenues for future research in the area of visual privacy.

【11】 Disentangled representations: towards interpretation of sex determination from hip bone 标题：解开表征：从髋骨解读性别决定链接：https://arxiv.org/abs/2112.09414

作者：Kaifeng Zou,Sylvain Faisan,Fabrice Heitz,Marie Epain,Pierre Croisille,Laurent Fanton,Sébastien Valette 摘要：通过突出显示对决策贡献最大的输入图像区域，显著性映射已成为使神经网络具有可解释性的常用方法。在医学成像中，它们特别适合在异常定位的背景下解释神经网络。然而，从我们的实验来看，它们不太适合分类问题，因为在分类问题中，允许区分不同类别的特征在空间上是相关的、分散的，并且肯定是非平凡的。因此，在本文中，我们提出了一个新的范例，以更好地解释。为此，我们向用户提供相关且易于解释的信息，以便用户能够形成自己的意见。我们使用解纠缠变分自动编码器，其潜在表示分为两部分：不可解释部分和解纠缠部分。后者解释了明确表示不同兴趣类别的分类变量。除了提供给定输入样本的类之外，这种模型还提供了通过修改潜在表示中分类变量的值将样本从给定类转换为另一类样本的可能性。这为更容易解释阶级差异铺平了道路。我们在法医学中从髋骨自动确定性别的背景下说明了这种方法的相关性。该模型编码的区分不同类别的特征与专家知识一致。摘要：By highlighting the regions of the input image that contribute the most to the decision, saliency maps have become a popular method to make neural networks interpretable. In medical imaging, they are particularly well-suited to explain neural networks in the context of abnormality localization. However, from our experiments, they are less suited to classification problems where the features that allow to distinguish between the different classes are spatially correlated, scattered and definitely non-trivial. In this paper we thus propose a new paradigm for better interpretability. To this end we provide the user with relevant and easily interpretable information so that he can form his own opinion. We use Disentangled Variational Auto-Encoders which latent representation is divided into two components: the non-interpretable part and the disentangled part. The latter accounts for the categorical variables explicitly representing the different classes of interest. In addition to providing the class of a given input sample, such a model offers the possibility to transform the sample from a given class to a sample of another class, by modifying the value of the categorical variables in the latent representation. This paves the way to easier interpretation of class differences. We illustrate the relevance of this approach in the context of automatic sex determination from hip bones in forensic medicine. The features encoded by the model, that distinguish the different classes were found to be consistent with expert knowledge.

【12】 ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources 标题：ZeroVL：使视觉语言表示与有限资源保持一致的强大基线链接：https://arxiv.org/abs/2112.09331

作者：Quan Cui,Boyan Zhou,Yu Guo,Weidong Yin,Hao Wu,Osamu Yoshie 机构：ByteDance, Waseda University, Fudan University 摘要：开创性的双编码器预训练工作（如剪辑和对齐）揭示了将多模态表示与对比学习相结合的潜力。然而，这些工作需要大量的数据和计算资源（例如，十亿级web数据和数百个GPU），这使得资源有限的研究人员无法复制和进一步探索。为此，我们探索了一系列简单但有效的启发式方法，并提供了全面的训练指导，使我们能够在有限的资源下进行双编码器多模式表示对齐。我们提供了一个可复制的强有力的竞争结果基线，即ZeroVL，只有1400万个可公开访问的学术数据集和8个V100 GPU。此外，我们收集了100万个web数据进行预训练，取得了与最先进的方法相当或更好的结果，进一步证明了我们的方法在大规模数据上的有效性。我们希望这项工作能为今后多模式训练的研究提供有用的数据点和经验。我们的代码和预先训练的模型将发布，以促进研究社区。摘要：Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we explore a stack of simple but effective heuristics, and provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. We provide a reproducible strong baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our method on large-scale data. We hope that this work will provide useful data points and experience for future research in multi-modal pre-training. Our code and pre-trained models will be released to facilitate the research community.

【13】 All-photon Polarimetric Time-of-Flight Imaging 标题：全光子偏振飞行时间成像链接：https://arxiv.org/abs/2112.09278

作者：Seung-Hwan Baek,Felix Heide 机构：Princeton University 摘要：飞行时间（ToF）传感器提供了一种成像方式，为各种应用提供了燃料，包括激光雷达在自动驾驶、机器人技术和增强现实中的应用。传统的ToF成像方法通过向场景发送光脉冲并测量从场景表面直接反射的第一到达光子的ToF（无任何时间延迟）来估计深度。因此，第一次响应之后的所有光子通常被视为不需要的噪声。本文从利用初至光子的原理出发，结合对初至和迟至光子的时间偏振分析，提出了一种全光子ToF成像方法，该方法具有丰富的几何和材料场景信息。为此，我们提出了一种新的时间偏振反射模型、一种有效的捕获方法和一种利用表面和亚表面反射反射的光的时间偏振变化的重建方法。建议的全光子偏振ToF成像方法允许通过利用系统捕获的所有光子获取场景的深度、表面法线和材料参数，而传统ToF成像仅从第一个到达的光子获取粗略深度。我们在仿真和原型实验中验证了我们的方法。摘要：Time-of-flight (ToF) sensors provide an imaging modality fueling diverse applications, including LiDAR in autonomous driving, robotics, and augmented reality. Conventional ToF imaging methods estimate the depth by sending pulses of light into a scene and measuring the ToF of the first-arriving photons directly reflected from a scene surface without any temporal delay. As such, all photons following this first response are typically considered as unwanted noise. In this paper, we depart from the principle of using first-arriving photons and propose an all-photon ToF imaging method by incorporating the temporal-polarimetric analysis of first- and late-arriving photons, which possess rich scene information about its geometry and material. To this end, we propose a novel temporal-polarimetric reflectance model, an efficient capture method, and a reconstruction method that exploits the temporal-polarimetric changes of light reflected by the surface and sub-surface reflection. The proposed all-photon polarimetric ToF imaging method allows for acquiring depth, surface normals, and material parameters of a scene by utilizing all photons captured by the system, whereas conventional ToF imaging only obtains coarse depth from the first-arriving photons. We validate our method in simulation and experimentally with a prototype.

【14】 Image Inpainting Using AutoEncoder and Guided Selection of Predicted Pixels 标题：基于自动编码器和预测像素引导选择的图像修复链接：https://arxiv.org/abs/2112.09262

作者：Mohammad H. Givkashi,Mahshid Hadipour,Arezoo PariZanganeh,Zahra Nabizadeh,Nader Karimi,Shadrokh Samavi 机构：Dept. of Elect. and Comp. Engineering, Isfahan University of Technology, Isfahan ,-, Iran, Dept. of Elect. and Comp. Engineering, McMaster University, L,S ,L, Canada 备注：5 pages, 2 figures, 4 tables 摘要：图像修复是增强失真数字图像的有效方法。不同的修复方法利用相邻像素的信息来预测缺失像素的值。最近，深度神经网络已经被用于学习图像的结构和语义细节，以达到修复的目的。在本文中，我们提出了一种用于图像修复的网络。该网络类似于U-Net，从图像中提取各种特征，从而获得更好的结果。我们通过将受损像素替换为输出图像的恢复像素来改进最终结果。我们的实验结果表明，与传统方法相比，该方法产生了高质量的结果。摘要：Image inpainting is an effective method to enhance distorted digital images. Different inpainting methods use the information of neighboring pixels to predict the value of missing pixels. Recently deep neural networks have been used to learn structural and semantic details of images for inpainting purposes. In this paper, we propose a network for image inpainting. This network, similar to U-Net, extracts various features from images, leading to better results. We improved the final results by replacing the damaged pixels with the recovered pixels of the output images. Our experimental results show that this method produces high-quality results compare to the traditional methods.

【15】 How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation 标题：如何增加你的背心？一致性损失和StyleAug，一种随机风格转换增强链接：https://arxiv.org/abs/2112.09260

作者：Akash Umakantha,Joao D. Semedo,S. Alireza Golestaneh,Wan-Yi S. Lin 机构：Carnegie Mellon University, Pittsburgh, PA , Jo˜ao D. Semedo, Bosch Center for Artificial Intelligence 摘要：视觉转换器（ViT）体系结构最近在各种计算机视觉任务中取得了具有竞争力的性能。与卷积神经网络（CNN）相比，ViTs背后的动机之一是较弱的诱导偏差。然而，这也使得VIT训练更加困难。它们需要非常大的训练数据集、严格的规则化和强大的数据扩充。用于训练VIT的数据扩充策略在很大程度上继承自CNN训练，尽管两种架构之间存在显著差异。在这项工作中，我们实证评估了不同的数据增强策略在CNN（如ResNet）和ViT图像分类体系结构上的表现。我们引入了一种称为StyleAug的样式传输数据增强，它最适合训练VIT，而RandAugment和Augmix通常最适合训练CNN。我们还发现，除了分类损失外，在同一图像的多个增强之间使用一致性损失在训练VIT时特别有用。摘要：The Vision Transformer (ViT) architecture has recently achieved competitive performance across a variety of computer vision tasks. One of the motivations behind ViTs is weaker inductive biases, when compared to convolutional neural networks (CNNs). However this also makes ViTs more difficult to train. They require very large training datasets, heavy regularization, and strong data augmentations. The data augmentation strategies used to train ViTs have largely been inherited from CNN training, despite the significant differences between the two architectures. In this work, we empirical evaluated how different data augmentation strategies performed on CNN (e.g., ResNet) versus ViT architectures for image classification. We introduced a style transfer data augmentation, termed StyleAug, which worked best for training ViTs, while RandAugment and Augmix typically worked best for training CNNs. We also found that, in addition to a classification loss, using a consistency loss between multiple augmentations of the same image was especially helpful when training ViTs.

【16】 Sparse Coding with Multi-Layer Decoders using Variance Regularization 标题：基于方差正则化的多层译码器稀疏编码链接：https://arxiv.org/abs/2112.09214

作者：Katrina Evtimova,Yann LeCun 机构：Center for Data Science, New York University, Courant Institute, New York University, Meta AI Research 摘要：使用$l_1$惩罚和学习线性字典的稀疏编码需要对字典进行正则化，以防止代码的$l_1$范数崩溃。通常，这种正则化需要限定字典元素的欧几里德规范。在这项工作中，我们提出了一种新的稀疏编码协议，它可以在不需要正则化解码器的情况下防止代码崩溃。我们的方法直接正则化代码，使每个潜在代码分量在给定输入集的一组稀疏表示上的方差大于固定阈值。此外，我们还探索了使用多层解码器有效训练稀疏编码系统的方法，因为它们可以建模比线性字典更复杂的关系。在我们对MNIST和自然图像块的实验中，我们表明用我们的方法学习的解码器在线性和多层情况下都具有可解释的特征。此外，我们还表明，与使用线性字典的自动编码器相比，使用我们的方差正则化方法训练的具有多层解码器的稀疏自动编码器使用稀疏表示生成更高质量的重建。此外，使用我们的方差正则化方法获得的稀疏表示在低数据区域的去噪和分类的下游任务中是有用的。摘要：Sparse coding with an $l_1$ penalty and a learned linear dictionary requires regularization of the dictionary to prevent a collapse in the $l_1$ norms of the codes. Typically, this regularization entails bounding the Euclidean norms of the dictionary's elements. In this work, we propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold over a set of sparse representations for a given set of inputs. Furthermore, we explore ways to effectively train sparse coding systems with multi-layer decoders since they can model more complex relationships than linear dictionaries. In our experiments with MNIST and natural image patches, we show that decoders learned with our approach have interpretable features both in the linear and multi-layer case. Moreover, we show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations when compared to autoencoders with linear dictionaries. Additionally, sparse representations obtained with our variance regularization approach are useful in the downstream tasks of denoising and classification in the low-data regime.

【17】 Mitigating the Bias of Centered Objects in Common Datasets 标题：减轻公共数据集中居中对象的偏差链接：https://arxiv.org/abs/2112.09195

作者：Gergely Szabo,Andras Horvath 机构： Andr´as Horv´athPeter Pazmany Catholic University Faculty of Information Technology and BionicsBudapest 摘要：卷积网络被认为是移位不变性的，但事实证明，它们的响应可能会根据对象的确切位置而变化。在本文中，我们将证明最常见的调查数据集有一个偏差，即在训练过程中，对象在图像中心过度表示。这种偏差和这些网络的边界条件会对这些体系结构的性能产生重大影响，当对象接近边界时，其精度会显著下降。我们还将演示如何使用数据增强技术减轻这种影响。摘要：Convolutional networks are considered shift invariant, but it was demonstrated that their response may vary according to the exact location of the objects. In this paper we will demonstrate that most commonly investigated datasets have a bias, where objects are over-represented at the center of the image during training. This bias and the boundary condition of these networks can have a significant effect on the performance of these architectures and their accuracy drops significantly as an object approaches the boundary. We will also demonstrate how this effect can be mitigated with data augmentation techniques.

【18】 Monitoring crop phenology with street-level imagery using computer vision 标题：利用计算机视觉利用街道图像监测作物物候链接：https://arxiv.org/abs/2112.09190

作者：Raphaël d'Andrimont,Momchil Yordanov,Laura Martinez-Sanchez,Marijn van der Velde 机构：European Commission, Joint Research Centre (JRC), Ispra , Italy, ∗ These authors contributed equally to this work. 备注：18 pages 摘要：街道一级图像具有扩大现场数据收集的巨大潜力。这是通过结合使用廉价的高质量摄像机和深度学习计算机解决方案的最新进展来获得相关主题信息而实现的。我们提出了一个利用计算机视觉从街道图像中收集和提取作物类型和物候信息的框架。2018年生长季期间，在荷兰弗莱沃兰省，用侧视动作摄像机拍摄高清晰度照片。从3月到10月，每月对一条200公里的固定路线进行调查，每秒收集一张图片，总共产生40万张地理标记图片。在220个特定地块位置，记录了17种作物类型的现场作物物候观测结果。此外，时间跨度包括特定的出苗前阶段，如春、夏作物不同种植的裸土，以及收获后的种植做法，如绿色施肥和捕获作物。基于卷积神经网络（MobileNet）的转移学习，使用TensorFlow和著名的图像识别模型进行分类。开发了一种超调谐方法，以获得160个模型中性能最好的模型。该最佳模型应用于区分作物类型的独立推理集，宏观F1得分为88.1%，地块水平的主要物候期为86.9%。讨论了该方法的潜力和注意事项以及实施和改进的实际考虑。提出的框架加快了高质量的现场数据收集，并提出了通过计算机视觉自动分类收集海量数据的途径。摘要：Street-level imagery holds a significant potential to scale-up in-situ data collection. This is enabled by combining the use of cheap high quality cameras with recent advances in deep learning compute solutions to derive relevant thematic information. We present a framework to collect and extract crop type and phenological information from street level imagery using computer vision. During the 2018 growing season, high definition pictures were captured with side-looking action cameras in the Flevoland province of the Netherlands. Each month from March to October, a fixed 200-km route was surveyed collecting one picture per second resulting in a total of 400,000 geo-tagged pictures. At 220 specific parcel locations detailed on the spot crop phenology observations were recorded for 17 crop types. Furthermore, the time span included specific pre-emergence parcel stages, such as differently cultivated bare soil for spring and summer crops as well as post-harvest cultivation practices, e.g. green manuring and catch crops. Classification was done using TensorFlow with a well-known image recognition model, based on transfer learning with convolutional neural networks (MobileNet). A hypertuning methodology was developed to obtain the best performing model among 160 models. This best model was applied on an independent inference set discriminating crop type with a Macro F1 score of 88.1% and main phenological stage at 86.9% at the parcel level. Potential and caveats of the approach along with practical considerations for implementation and improvement are discussed. The proposed framework speeds up high quality in-situ data collection and suggests avenues for massive data collection via automated classification using computer vision.

【19】 Colloquium: Advances in automation of quantum dot devices control 标题：座谈会：量子点器件控制自动化的进展链接：https://arxiv.org/abs/2112.09362

作者：Justyna P. Zwolak,Jacob M. Taylor 机构：National Institute of Standards and Technology, Gaithersburg, Maryland , USA, Joint Quantum Institute, Joint Center for Quantum Information and Computer Science, University of Maryland, College Park 备注：19 pages, 9 figures 摘要：量子点阵列（QD）是实现可伸缩耦合量子比特系统的一个很有前途的候选系统，也是量子计算机的基本构造块。在这样的半导体量子系统中，器件现在有几十个单独的静电和动态电压，必须仔细设置，以将系统定位到单电子区域，并实现良好的量子比特操作性能。必要的点位置和电荷到栅极电压的映射是一个具有挑战性的经典控制问题。随着量子点量子位数量的增加，相关的参数空间变得足够大，使得启发式控制变得不可行。近年来，将基于脚本的算法与机器学习（ML）技术相结合，在自动化设备控制方面做出了相当大的努力。在本次座谈会中，我们全面概述了量子点器件控制自动化的最新进展，特别强调了二维电子气体中形成的硅基和砷化镓基量子点。将基于物理的建模与现代数值优化和ML相结合，已被证明在产生高效、可扩展的控制方面相当有效。理论、计算和实验工作与计算机科学和ML的进一步整合在推进半导体和其他量子计算平台方面具有巨大潜力。摘要：Arrays of quantum dots (QDs) are a promising candidate system to realize scalable, coupled qubits systems and serve as a fundamental building block for quantum computers. In such semiconductor quantum systems, devices now have tens of individual electrostatic and dynamical voltages that must be carefully set to localize the system into the single-electron regime and to realize good qubit operational performance. The mapping of requisite dot locations and charges to gate voltages presents a challenging classical control problem. With an increasing number of QD qubits, the relevant parameter space grows sufficiently to make heuristic control unfeasible. In recent years, there has been a considerable effort to automate device control that combines script-based algorithms with machine learning (ML) techniques. In this Colloquium, we present a comprehensive overview of the recent progress in the automation of QD device control, with a particular emphasis on silicon- and GaAs-based QDs formed in two-dimensional electron gases. Combining physics-based modeling with modern numerical optimization and ML has proven quite effective in yielding efficient, scalable control. Further integration of theoretical, computational, and experimental efforts with computer science and ML holds tremendous potential in advancing semiconductor and other platforms for quantum computing.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-20，如有侵权请联系 cloudcommunity@tencent.com 删除

linux