计算机视觉学术速递[10.18]

公众号-arXiv每日学术速递

发布于 2021-10-21 16:15:26

2.2K0

发布于 2021-10-21 16:15:26

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计70篇

Transformer(4篇)

【1】 Tensor-to-Image: Image-to-Image Translation with Vision Transformers 标题：张量到图像：使用视觉变换器进行图像到图像的转换链接：https://arxiv.org/abs/2110.08037

作者：Yiğit Gündüç 摘要：Transformer自首次推出以来就受到了广泛的关注，并有着广泛的应用。Transformer开始接管深度学习的所有领域，视觉Transformer论文也证明了它们可以用于计算机视觉任务。在本文中，我们使用了一个基于视觉转换器的自定义设计模型，即张量到图像，用于图像到图像的转换。在自我关注的帮助下，我们的模型能够在不做任何修改的情况下推广并应用于不同的问题。摘要：Transformers gain huge attention since they are first introduced and have a wide range of applications. Transformers start to take over all areas of deep learning and the Vision transformers paper also proved that they can be used for computer vision tasks. In this paper, we utilized a vision transformer-based custom-designed model, tensor-to-image, for the image to image translation. With the help of self-attention, our model was able to generalize and apply to different problems without a single modification.

【2】 Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation 标题：通过基于贴片的负增强理解和提高视觉转换器的鲁棒性链接：https://arxiv.org/abs/2110.07858

作者：Yao Qin,Chiyuan Zhang,Ting Chen,Balaji Lakshminarayanan,Alex Beutel,Xuezhi Wang 机构：Google Research 摘要：我们研究了视觉变换器（VIT）的鲁棒性，通过其特殊的基于补丁的建筑结构，即，它们将图像处理为一系列图像补丁。我们发现，VIT对基于补丁的转换非常不敏感，即使转换在很大程度上破坏了原始语义并使图像无法被人类识别。这表明VIT大量使用了在这种转换中幸存下来的功能，但通常不表示人类的语义类。进一步的研究表明，这些特征是有用的，但不具有鲁棒性，因为在这些特征上训练的VIT可以获得较高的分布精度，但在分布偏移下会发生故障。从这一理解出发，我们会问：训练模型以减少对这些特性的依赖是否可以提高ViT鲁棒性和分布外性能？我们使用基于面片的操作转换的图像作为负面增强视图，并提供损失，以使训练正规化，而不使用非稳健特征。这是对现有研究的一个补充，现有研究主要集中于通过语义保持变换来增强输入，以增强模型的不变性。我们表明，基于补丁的负增强在一系列基于ImageNet的鲁棒性基准测试中持续提高VIT的鲁棒性。此外，我们发现基于补丁的负面增强是对传统（正面）数据增强的补充，共同进一步提高了性能。这项工作中的所有代码都将是开源的。摘要：We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve ViT robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation, and together boost the performance further. All the code in this work will be open-sourced.

【3】 Certified Patch Robustness via Smoothed Vision Transformers 标题：通过平滑视觉转换器认证的补丁稳健性链接：https://arxiv.org/abs/2110.07719

作者：Hadi Salman,Saachi Jain,Eric Wong,Aleksander Mądry 机构：MIT, Aleksander M ˛adry 摘要：认证补丁防御可以保证图像分类器对有界连续区域内的任意变化的鲁棒性。但是，目前，这种稳健性是以标准精度降低和推理时间变慢为代价的。我们演示了如何使用视觉变换器实现显著更好的认证面片鲁棒性，从而提高计算效率，并且不会导致标准精度的大幅下降。这些改进源于视觉转换器的固有能力，它可以优雅地处理大部分被遮掩的图像。我们的代码可在https://github.com/MadryLab/smoothed-vit. 摘要：Certified patch defenses can guarantee robustness of an image classifier to arbitrary changes within a bounded contiguous region. But, currently, this robustness comes at a cost of degraded standard accuracies and slower inference times. We demonstrate how using vision transformers enables significantly better certified patch robustness that is also more computationally efficient and does not incur a substantial drop in standard accuracy. These improvements stem from the inherent ability of the vision transformer to gracefully handle largely masked images. Our code is available at https://github.com/MadryLab/smoothed-vit.

【4】 Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining 标题：神经网络与Transformer相结合的自监督预训练多模态3D MRI脑肿瘤分割链接：https://arxiv.org/abs/2110.07919

作者：Mariia Dobko,Danylo-Ivan Kolinko,Ostap Viniavskyi,Yurii Yelisieiev 机构：The Machine Learning Lab at Ukrainian Catholic University, Lviv, Ukraine 摘要：我们将改进的TransBTS、nnU网络以及两者的组合应用于BraTS 2021挑战赛的分段任务。事实上，我们改变了TransBTS模型的原始结构，增加了压缩和激励块，增加了CNN层的数量，用可学习的多层感知器（MLP）嵌入替换了Transformer块中的位置编码，这使得Transformer在推理过程中可调整到任何输入大小。通过这些修改，我们能够大大提高TransBTS的性能。受nnU网络框架的启发，我们决定通过将nnU网络内部的体系结构更改为我们的自定义模型，将其与经过修改的TransBTS相结合。在BraTS 2021的验证集上，这些方法的集成分别达到0.8496、0.8698、0.9256骰子分数和15.72、11.057、3.374 HD95，用于增强肿瘤、肿瘤核心和整个肿瘤。我们的代码是公开的。摘要：We apply an ensemble of modified TransBTS, nnU-Net, and a combination of both for the segmentation task of the BraTS 2021 challenge. In fact, we change the original architecture of the TransBTS model by adding Squeeze-and-Excitation blocks, an increasing number of CNN layers, replacing positional encoding in Transformer block with a learnable Multilayer Perceptron (MLP) embeddings, which makes Transformer adjustable to any input size during inference. With these modifications, we are able to largely improve TransBTS performance. Inspired by a nnU-Net framework we decided to combine it with our modified TransBTS by changing the architecture inside nnU-Net to our custom model. On the Validation set of BraTS 2021, the ensemble of these approaches achieves 0.8496, 0.8698, 0.9256 Dice score and 15.72, 11.057, 3.374 HD95 for enhancing tumor, tumor core, and whole tumor, correspondingly. Our code is publicly available.

检测相关(7篇)

【1】 Reliable Shot Identification for Complex Event Detection via Visual-Semantic Embedding 标题：基于视觉语义嵌入的复杂事件检测中的可靠镜头识别链接：https://arxiv.org/abs/2110.08063

作者：Minnan Luo,Xiaojun Chang,Chen Gong 机构：School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an , China., School of Computing Technologies, RMIT University, Melbourne, VIC , Australia. 备注：11 pages, accepted by CVIU 摘要：多媒体事件检测的任务是检测用户在网站上生成的视频中感兴趣的特定事件。这项任务面临的最根本的挑战在于视频质量的巨大变化以及事件本身的高级语义抽象。在本文中，我们将视频分解为几个片段，并通过将每个视频表示为片段的“包”，其中每个片段被称为实例，直观地将复杂事件检测任务建模为多实例学习问题。我们不是平等地对待实例，而是将每个实例与一个可靠性变量关联，以表明其重要性，然后选择可靠实例进行训练。为了精确地度量不同实例的可靠性，我们提出了一种基于实例事件相似度的高层语义特征和基于视觉信息的底层特征相结合的视觉语义引导损失方法。在课程学习的激励下，我们引入负弹性网络正则化项，开始训练具有高可靠性实例的分类器，并逐渐考虑可靠性相对较低的实例。提出了一种求解具有挑战性的非凸非光滑问题的替代优化算法。在标准数据集（即TRECVID MEDTest 2013和TRECVID MEDTest 2014）上的实验结果证明了所提方法相对于基线算法的有效性和优越性。摘要：Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a "bag" of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms.

【2】 Receptive Field Broadening and Boosting for Salient Object Detection 标题：用于醒目目标检测的感受野扩展和增强链接：https://arxiv.org/abs/2110.07859

作者：Mingcan Ma,Changqun Xia,Chenxi Xie,Xiaowu Chen,Jia Li 机构：State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University, Beijing, China, Pengcheng Laboratory, Shenzhen, China 备注：9 pages, 5 figures 摘要：显著目标检测需要一个全面的、可伸缩的感受野来定位图像中具有视觉意义的目标。近年来，视觉变换器和多分支模块的出现极大地增强了神经网络在不同尺度上感知物体的能力。然而，与传统的主干网相比，Transformer的计算过程非常耗时。此外，多分支模块的不同分支在每次训练迭代中可能会导致相同的误差反向传播，这不利于识别特征的提取。为了解决这些问题，我们提出了一种基于transformer和CNN的双边网络来同时有效地扩展局部细节和全局语义信息。此外，还提出了一种多头提升（MHB）策略，以增强不同网络分支的特殊性。通过计算不同预测头的误差，每个分支可以分别关注其他分支预测错误的像素。此外，与多径并行训练不同，MHB每次随机选择一个分支，以增强方式进行梯度反向传播。此外，还提出了一种注意力特征融合模块（AF），用于根据各自的特征融合两种类型的特征。在五个基准数据集上的综合实验表明，与现有的方法相比，该方法可以实现显著的性能改进。摘要：Salient object detection requires a comprehensive and scalable receptive field to locate the visually significant objects in the image. Recently, the emergence of visual transformers and multi-branch modules has significantly enhanced the ability of neural networks to perceive objects at different scales. However, compared to the traditional backbone, the calculation process of transformers is time-consuming. Moreover, different branches of the multi-branch modules could cause the same error back propagation in each training iteration, which is not conducive to extracting discriminative features. To solve these problems, we propose a bilateral network based on transformer and CNN to efficiently broaden local details and global semantic information simultaneously. Besides, a Multi-Head Boosting (MHB) strategy is proposed to enhance the specificity of different network branches. By calculating the errors of different prediction heads, each branch can separately pay more attention to the pixels that other branches predict incorrectly. Moreover, Unlike multi-path parallel training, MHB randomly selects one branch each time for gradient back propagation in a boosting way. Additionally, an Attention Feature Fusion Module (AF) is proposed to fuse two types of features according to respective characteristics. Comprehensive experiments on five benchmark datasets demonstrate that the proposed method can achieve a significant performance improvement compared with the state-of-the-art methods.

【3】 EMDS-7: Environmental Microorganism Image Dataset Seventh Version for Multiple Object Detection Evaluation 标题：EMDS-7：用于多目标检测评估的环境微生物图像数据集第七版链接：https://arxiv.org/abs/2110.07723

作者：Hechen Yang,Chen Li,Xin Zhao,Jiawei Zhang,Pingli Ma,Peng Zhao,Ao Chen,Tao Jiang,Hongzan Sunand Marcin Grzegorzek 机构：Microscopic Image and Medical Image Analysis Group, MBIE College, Northeastern University, Shenyang, PR China, Environmental Engineering Department, Northeastern University, Shenyang , China 摘要：环境微生物图像数据集第七版（EMDS-7）是一个显微图像数据集，包括原始环境微生物图像（EMs）和“.XML”格式文件中相应的对象标签文件。EMDS-7数据集由41种EMs组成，共有2365幅图像和13216个标记对象。EMDS-7数据库主要用于目标检测。为了证明EMDS-7的有效性，我们选择了最常用的深度学习方法（快速RCNN、YOLOv3、YOLOv4、SSD和RetinaNet）和评估指标进行测试和评估。摘要：The Environmental Microorganism Image Dataset Seventh Version (EMDS-7) is a microscopic image data set, including the original Environmental Microorganism images (EMs) and the corresponding object labeling files in ".XML" format file. The EMDS-7 data set consists of 41 types of EMs, which has a total of 2365 images and 13216 labeled objects. The EMDS-7 database mainly focuses on the object detection. In order to prove the effectiveness of EMDS-7, we select the most commonly used deep learning methods (Faster-RCNN, YOLOv3, YOLOv4, SSD and RetinaNet) and evaluation indices for testing and evaluation.

【4】 Adversarial Scene Reconstruction and Object Detection System for Assisting Autonomous Vehicle 标题：辅助自主车辆的对战场景重建与目标检测系统链接：https://arxiv.org/abs/2110.07716

作者：Md Foysal Haque,Hay-Youn Lim,Dae-Seong Kang 机构：Department of Electronic Engineering, Dong-A University, Republic of Korea 摘要：在当前的计算机视觉时代，通过视频监控系统对场景进行分类是一项至关重要的任务。人工智能（AI）视频监控技术得到了显著的发展，同时人工智能和深度学习也进入了该系统。采用深度学习视觉分类方法的优势组合，在视觉场景分类方面取得了巨大的准确性。然而，视觉分类器在检查黑暗可见区域的场景时面临困难，尤其是在夜间。此外，分类器在识别场景上下文方面也面临困难。本文提出了一种深度学习模型，该模型将黑暗的视觉场景重建为清晰的场景，如日光，该方法识别自主车辆的视觉动作。该模型实现了87.3%的场景重建精度和89.2%的场景理解和检测任务。摘要：In the current computer vision era classifying scenes through video surveillance systems is a crucial task. Artificial Intelligence (AI) Video Surveillance technologies have been advanced remarkably while artificial intelligence and deep learning ascended into the system. Adopting the superior compounds of deep learning visual classification methods achieved enormous accuracy in classifying visual scenes. However, the visual classifiers face difficulties examining the scenes in dark visible areas, especially during the nighttime. Also, the classifiers face difficulties in identifying the contexts of the scenes. This paper proposed a deep learning model that reconstructs dark visual scenes to clear scenes like daylight, and the method recognizes visual actions for the autonomous vehicle. The proposed model achieved 87.3 percent accuracy for scene reconstruction and 89.2 percent in scene understanding and detection tasks.

【5】 Talking Detection In Collaborative Learning Environments 标题：协作学习环境中的谈话检测链接：https://arxiv.org/abs/2110.07646

作者：Wenjing Shi,Marios S. Pattichis,Sylvia Celedón-Pattichis,Carlos LópezLeiva 机构：L´opezLeiva, Image and Video Processing and Communications Lab, Dept. of Electrical and Computer Engineering, Dept. of Language, Literacy, and Sociocultural Studies, University of New Mexico, United States. 摘要：我们研究了协作学习视频中谈话活动的检测问题。我们的方法使用头部检测和光流向量对数大小的投影，将问题简化为小投影图像的简单分类，而无需训练复杂的三维活动分类系统。然后，使用标准分类器的简单多数投票对小投影图像进行轻松分类。对于谈话检测，我们提出的方法被证明显著优于单活动系统。与时间段网络（TSN）的42%和卷积3D（C3D）的45%相比，我们的总体准确率为59%。此外，我们的方法能够检测来自多个说话人的多个讲话实例，同时还能检测说话人本身。摘要：We study the problem of detecting talking activities in collaborative learning videos. Our approach uses head detection and projections of the log-magnitude of optical flow vectors to reduce the problem to a simple classification of small projection images without the need for training complex, 3-D activity classification systems. The small projection images are then easily classified using a simple majority vote of standard classifiers. For talking detection, our proposed approach is shown to significantly outperform single activity systems. We have an overall accuracy of 59% compared to 42% for Temporal Segment Network (TSN) and 45% for Convolutional 3D (C3D). In addition, our method is able to detect multiple talking instances from multiple speakers, while also detecting the speakers themselves.

【6】 A deep learning model for classification of diabetic retinopathy in eye fundus images based on retinal lesion detection 标题：基于视网膜病变检测的糖尿病视网膜病变眼底图像分类深度学习模型链接：https://arxiv.org/abs/2110.07745

作者：Melissa delaPava,Hernán Ríos,Francisco J. Rodríguez,Oscar J. Perdomo,Fabio A. González 机构：Systems and Computer Engineering Department, Universidad Nacional de Colombia, Bogotá, Colombia,., Fundación Oftalmológica Nacional, School of Medicine and Health Sciences, Universidad del Rosario 备注：7 pages and 1 figure 摘要：糖尿病视网膜病变（DR）是糖尿病并发症影响视网膜的结果。如果未经诊断和治疗，它可能导致失明。眼科医生通过筛查每位患者并通过眼部成像分析视网膜病变来进行诊断。在实践中，这样的分析既耗时又繁琐。提出了一种眼底图像DR自动分类模型。该方法识别与DR相关的主要眼部病变，并随后诊断疾病。建议的方法遵循与临床医生相同的工作流程，提供可在临床上解释的信息以支持预测。kaggle EyePACS和Mesidor-2数据集的一个子集（标有眼部病变）已公开。kaggle EyePACS子集用作训练集，Mesidor-2用作病变和DR分类模型的测试集。对于DR诊断，我们的模型的曲线下面积、敏感性和特异性分别为0.948、0.886和0.875，这与最先进的方法相竞争。摘要：Diabetic retinopathy (DR) is the result of a complication of diabetes affecting the retina. It can cause blindness, if left undiagnosed and untreated. An ophthalmologist performs the diagnosis by screening each patient and analyzing the retinal lesions via ocular imaging. In practice, such analysis is time-consuming and cumbersome to perform. This paper presents a model for automatic DR classification on eye fundus images. The approach identifies the main ocular lesions related to DR and subsequently diagnoses the illness. The proposed method follows the same workflow as the clinicians, providing information that can be interpreted clinically to support the prediction. A subset of the kaggle EyePACS and the Messidor-2 datasets, labeled with ocular lesions, is made publicly available. The kaggle EyePACS subset is used as a training set and the Messidor-2 as a test set for lesions and DR classification models. For DR diagnosis, our model has an area-under-the-curve, sensitivity, and specificity of 0.948, 0.886, and 0.875, respectively, which competes with state-of-the-art approaches.

【7】 Non-contact Atrial Fibrillation Detection from Face Videos by Learning Systolic Peaks 标题：基于收缩期峰值学习的人脸视频非接触式心房颤动检测链接：https://arxiv.org/abs/2110.07610

作者：Zhaodong Sun,Juhani Junttila,Mikko Tulppo,Tapio Seppänen,Xiaobai Li 机构： University of Oulu 摘要：目的：我们提出了一种非接触的方法，用于从人脸视频中检测心房颤动（AF）。方法：记录100名健康受试者和100名房颤患者的面部录像、心电图（ECG）和接触式光容积描记术（PPG）。健康组中的所有视频都标记为健康。心脏病专家将患者组的视频标记为AF、窦性心律（SR）或心房扑动（AFL）。我们使用3D卷积神经网络进行远程PPG测量，并提出了一种新的损失函数（Wasserstein距离），用于使用接触PPG的收缩峰时间作为模型训练的标签。然后根据心跳间隔计算出一组心率变异性（HRV）特征，并利用HRV特征训练支持向量机（SVM）分类器。结果：我们提出的方法能够准确地从人脸视频中提取收缩峰用于AF检测。该方法采用独立于受试者的10倍交叉验证和30s视频剪辑进行训练，并在两个任务上进行测试。1）健康与房颤的分类：准确性、敏感性和特异性分别为96.16%、95.71%和96.23%。2） SR与AF的分类：准确性、敏感性和特异性分别为95.31%、98.66%和91.11%。结论：通过学习收缩期峰值，我们获得了良好的非接触性房颤检测性能。意义：非接触性房颤检测可用于家庭可疑人群房颤症状的自我筛查，或慢性患者治疗后房颤复发的自我监测。摘要：Objective: We propose a non-contact approach for atrial fibrillation (AF) detection from face videos. Methods: Face videos, electrocardiography (ECG), and contact photoplethysmography (PPG) from 100 healthy subjects and 100 AF patients are recorded. All the videos in the healthy group are labeled as healthy. Videos in the patient group are labeled as AF, sinus rhythm (SR), or atrial flutter (AFL) by cardiologists. We use the 3D convolutional neural network for remote PPG measurement and propose a novel loss function (Wasserstein distance) to use the timing of systolic peaks from contact PPG as the label for our model training. Then a set of heart rate variability (HRV) features are calculated from the inter-beat intervals, and a support vector machine (SVM) classifier is trained with HRV features. Results: Our proposed method can accurately extract systolic peaks from face videos for AF detection. The proposed method is trained with subject-independent 10-fold cross-validation with 30s video clips and tested on two tasks. 1) Classification of healthy versus AF: the accuracy, sensitivity, and specificity are 96.16%, 95.71%, and 96.23%. 2) Classification of SR versus AF: the accuracy, sensitivity, and specificity are 95.31%, 98.66%, and 91.11%. Conclusion: We achieve good performance of non-contact AF detection by learning systolic peaks. Significance: non-contact AF detection can be used for self-screening of AF symptom for suspectable populations at home, or self-monitoring of AF recurrence after treatment for the chronical patients.

分类|识别相关(9篇)

【1】 Crop Rotation Modeling for Deep Learning-Based Parcel Classification from Satellite Time Series 标题：基于深度学习的卫星时间序列地块分类轮换模型链接：https://arxiv.org/abs/2110.08187

作者：Félix Quinton,Loic Landrieu 机构：LASTIG, Univ. Gustave Eiffel, ENSG, IGN, F-, Saint-Mand´e, France 备注：Under review 摘要：虽然年度作物轮作对农业优化起着至关重要的作用，但在作物类型自动制图中，它们在很大程度上被忽略了。在本文中，我们利用注释卫星数据数量的增加，提出了第一种深度学习方法，同时对地块分类的年际和年际农业动态进行建模。除了简单的训练调整外，我们的模型比当前最先进的作物分类技术提高了660多万个点。此外，我们发布了第一个大型多年农业数据集，其中包含超过300000个带注释的地块。摘要：While annual crop rotations play a crucial role for agricultural optimization, they have been largely ignored for automated crop type mapping. In this paper, we take advantage of the increasing quantity of annotated satellite data to propose the first deep learning approach modeling simultaneously the inter- and intra-annual agricultural dynamics of parcel classification. Along with simple training adjustments, our model provides an improvement of over 6.6 mIoU points over the current state-of-the-art of crop classification. Furthermore, we release the first large-scale multi-year agricultural dataset with over 300,000 annotated parcels.

【2】 Automated Quality Control of Vacuum Insulated Glazing by Convolutional Neural Network Image Classification 标题：基于卷积神经网络图像分类的真空隔热玻璃质量自动控制链接：https://arxiv.org/abs/2110.08079

作者：Henrik Riedel,Sleheddine Mokdad,Isabell Schulz,Cenk Kocer,Philipp Rosendahl,Jens Schneider,Michael A. Kraus,Michael Drass 机构：the date of receipt and acceptance should be inserted later 备注：10 pages, 11 figures, 1 table 摘要：真空隔热玻璃（VIG）是一种高度隔热的窗户技术，与同等性能的充气隔热玻璃相比，它具有极薄的外形和较轻的重量。VIG是双窗格玻璃配置，窗格玻璃之间有亚毫米真空间隙，因此在其使用寿命期间处于恒定大气压力下。小柱子位于窗格玻璃之间以保持间隙，这会损坏玻璃，从而缩短VIG装置的使用寿命。为了有效地评估玻璃上的任何表面损伤，非常需要一个自动损伤检测系统。为了对损伤进行分类，我们使用卷积神经网络开发、训练和测试了一个深度学习计算机视觉系统。分类模型完美地对测试数据集进行了分类，接收器工作特性（ROC）的曲线下面积（AUC）为100%。通过使用更快的RCNN定位支柱的位置，我们自动将图像裁剪为相关信息。我们采用可解释人工智能（XAI）的Grad-CAM和Score-CAM这两种最先进的方法来理解内部机制，并能够证明我们的分类器在识别裂纹位置和几何形状方面优于ResNet50V2。因此，即使没有大量的训练数据，所提出的方法也可以用于检测系统缺陷。对我们模型预测能力的进一步分析表明，在收敛速度、准确性、100%召回率下的精确度以及ROC的AUC方面，我们的模型优于最先进的模型（ResNet50V2、ResNet101V2和ResNet152V2）。摘要：Vacuum Insulated Glazing (VIG) is a highly thermally insulating window technology, which boasts an extremely thin profile and lower weight as compared to gas-filled insulated glazing units of equivalent performance. The VIG is a double-pane configuration with a submillimeter vacuum gap between the panes and therefore under constant atmospheric pressure over their service life. Small pillars are positioned between the panes to maintain the gap, which can damage the glass reducing the lifetime of the VIG unit. To efficiently assess any surface damage on the glass, an automated damage detection system is highly desirable. For the purpose of classifying the damage, we have developed, trained, and tested a deep learning computer vision system using convolutional neural networks. The classification model flawlessly classified the test dataset with an area under the curve (AUC) for the receiver operating characteristic (ROC) of 100%. We have automatically cropped the images down to their relevant information by using Faster-RCNN to locate the position of the pillars. We employ the state-of-the-art methods Grad-CAM and Score-CAM of explainable Artificial Intelligence (XAI) to provide an understanding of the internal mechanisms and were able to show that our classifier outperforms ResNet50V2 for identification of crack locations and geometry. The proposed methods can therefore be used to detect systematic defects even without large amounts of training data. Further analyses of our model's predictive capabilities demonstrates its superiority over state-of-the-art models (ResNet50V2, ResNet101V2 and ResNet152V2) in terms of convergence speed, accuracy, precision at 100% recall and AUC for ROC.

【3】 Relation Preserving Triplet Mining for Stabilizing the Triplet Loss in Vehicle Re-identification 标题：保持关系的三元组挖掘在车辆识别中稳定三元组损失的研究链接：https://arxiv.org/abs/2110.07933

作者：Adhiraj Ghosh,Kuruparan Shanmugalingam,Wen-Yan Lin 机构：Singapore Management University,Zurich University of Applied Sciences,University of New South, Wales 摘要：对象外观通常会随着姿势的变化而发生显著变化。这给嵌入方案带来了挑战，这些方案试图将具有相同对象ID的实例映射到尽可能靠近的位置。在复杂的计算机视觉任务中，如重新识别（re-id），这个问题变得非常突出。在本文中，我们认为这些显著的外观变化表明对象ID由多个自然组组成，而强制将不同组的实例映射到公共位置则会适得其反。这导致我们引入了关系保持三元组挖掘（RPTM），一种特征匹配引导的三元组挖掘方案，确保三元组尊重对象ID中的自然子组。我们使用这种三元组挖掘机制来建立姿势感知、条件良好的三元组代价函数。这允许使用固定参数跨三个具有挑战性的基准对单个网络进行训练，同时仍然提供最先进的重新识别结果。摘要：Object appearances often change dramatically with pose variations. This creates a challenge for embedding schemes that seek to map instances with the same object ID to locations that are as close as possible. This issue becomes significantly heightened in complex computer vision tasks such as re-identification(re-id). In this paper, we suggest these dramatic appearance changes are indications that an object ID is composed of multiple natural groups and it is counter-productive to forcefully map instances from different groups to a common location. This leads us to introduce Relation Preserving Triplet Mining (RPTM), a feature matching guided triplet mining scheme, that ensures triplets will respect the natural sub-groupings within an object ID. We use this triplet mining mechanism to establish a pose-aware, well-conditioned triplet cost function. This allows a single network to be trained with fixed parameters across three challenging benchmarks, while still providing state-of-the-art re-identification results.

【4】 Improving Unsupervised Domain Adaptive Re-Identification via Source-Guided Selection of Pseudo-Labeling Hyperparameters 标题：源引导选择伪标签超参数改进无监督域自适应再辨识链接：https://arxiv.org/abs/2110.07897

作者：Fabian Dubourvieux,Angélique Loesch,Romaric Audigier,Samia Ainouz,Stéphane Canu 机构：Université Paris-Saclay, CEA, List, F-, Palaiseau, France, Normandie Univ, INSA Rouen, LITIS, Av. de l’Université le Madrillet , Saint Etienne du Rouvray, France 备注：Submitted to IEEE Access for review 摘要：用于重新识别（re ID）的无监督域适配（UDA）是一项具有挑战性的任务：为了避免额外数据的昂贵注释，它旨在将知识从具有注释数据的域转移到仅具有未标记数据的感兴趣域。伪标记方法已被证明对UDA re-ID有效。然而，这些方法的有效性在很大程度上取决于一些超参数（HP）的选择，这些超参数影响通过聚类生成伪标记。由于在感兴趣的领域中缺乏注释，因此这种选择非常重要。当前的方法只是对所有适应任务重复使用相同的经验值，而不考虑通过伪标记训练阶段改变的目标数据表示。由于这种过于简单的选择可能会限制它们的性能，我们旨在解决这个问题。我们提出了聚类UDA re ID的HP选择的新理论依据，以及伪标记UDA聚类的自动和循环HP调优方法：HyPASS。HyPASS包含伪标记方法中的两个模块：（i）基于标记源验证集的HP选择和（ii）特征辨别性的条件域对齐，以改进基于源样本的HP选择。在常用的人员和车辆身份识别数据集上的实验表明，与常用的经验HP设置相比，我们提出的HyPASS持续改进了最佳状态的身份识别方法。摘要：Unsupervised Domain Adaptation (UDA) for re-identification (re-ID) is a challenging task: to avoid a costly annotation of additional data, it aims at transferring knowledge from a domain with annotated data to a domain of interest with only unlabeled data. Pseudo-labeling approaches have proven to be effective for UDA re-ID. However, the effectiveness of these approaches heavily depends on the choice of some hyperparameters (HP) that affect the generation of pseudo-labels by clustering. The lack of annotation in the domain of interest makes this choice non-trivial. Current approaches simply reuse the same empirical value for all adaptation tasks and regardless of the target data representation that changes through pseudo-labeling training phases. As this simplistic choice may limit their performance, we aim at addressing this issue. We propose new theoretical grounds on HP selection for clustering UDA re-ID as well as method of automatic and cyclic HP tuning for pseudo-labeling UDA clustering: HyPASS. HyPASS consists in incorporating two modules in pseudo-labeling methods: (i) HP selection based on a labeled source validation set and (ii) conditional domain alignment of feature discriminativeness to improve HP selection based on source samples. Experiments on commonly used person re-ID and vehicle re-ID datasets show that our proposed HyPASS consistently improves the best state-of-the-art methods in re-ID compared to the commonly used empirical HP setting.

【5】 PolyNet: Polynomial Neural Network for 3D Shape Recognition with PolyShape Representation 标题：PolyNet：基于多边形表示的三维形状识别多项式神经网络链接：https://arxiv.org/abs/2110.07882

作者：Mohsen Yavartanoo,Shih-Hsuan Hung,Reyhaneh Neshatavar,Yue Zhang,Kyoung Mu Lee 机构：SNU ECE & ASRI, Oregon State University 摘要：三维形状表示及其处理对三维形状识别有重要影响。多边形网格作为一种三维形状表示，在计算机图形学和几何处理中具有许多优点。然而，现有的基于深度神经网络（DNN）的多边形网格表示方法仍然存在一些挑战，例如处理顶点的度和排列及其成对距离的变化。为了克服这些挑战，我们提出了一种基于DNN的方法（PolyNet）和一种具有多分辨率结构的特定多边形网格表示（PolyShape）。波利尼特包含两个运算；（1）具有可学习系数的多项式卷积（PolyConv）运算，该运算学习连续分布作为卷积滤波器，以在不同顶点之间共享权重；（2）多边形池（PolyPool）该方法利用多边形的多分辨率结构，在较低的维度上聚集特征。我们的实验表明，与现有的基于多边形网格的方法相比，PolyNet在三维形状分类和检索任务方面的优势和优势，以及它在图像图形表示分类方面的优势。该代码可从以下网站公开获取：https://myavartanoo.github.io/polynet/. 摘要：3D shape representation and its processing have substantial effects on 3D shape recognition. The polygon mesh as a 3D shape representation has many advantages in computer graphics and geometry processing. However, there are still some challenges for the existing deep neural network (DNN)-based methods on polygon mesh representation, such as handling the variations in the degree and permutations of the vertices and their pairwise distances. To overcome these challenges, we propose a DNN-based method (PolyNet) and a specific polygon mesh representation (PolyShape) with a multi-resolution structure. PolyNet contains two operations; (1) a polynomial convolution (PolyConv) operation with learnable coefficients, which learns continuous distributions as the convolutional filters to share the weights across different vertices, and (2) a polygonal pooling (PolyPool) procedure by utilizing the multi-resolution structure of PolyShape to aggregate the features in a much lower dimension. Our experiments demonstrate the strength and the advantages of PolyNet on both 3D shape classification and retrieval tasks compared to existing polygon mesh-based methods and its superiority in classifying graph representations of images. The code is publicly available from https://myavartanoo.github.io/polynet/.

【6】 "Knights": First Place Submission for VIPriors21 Action Recognition Challenge at ICCV 2021 链接：https://arxiv.org/abs/2110.07758

作者：Ishan Dave,Naman Biyani,Brandon Clark,Rohit Gupta,Yogesh Rawat,Mubarak Shah 机构： Center for Research in Computer Vision (CRCV), University of Central Florida, Orlando, Florida, USA, Indian Institute of Technology, Kanpur, India 备注：Challenge results are available at this https URL 摘要：本技术报告介绍了我们的方法“Knights”，在不使用任何额外数据的情况下，解决Kinetics-400（即Kinetics400ViPriors）的一小部分上的动作识别任务。我们的方法有三个主要部分：最先进的时间对比自我监督预训练、视频转换器模型和光流模式。随着标准测试时间的增加，我们提出的解决方案在Kinetics400ViPriors测试集上实现了73%，这是数据高效计算机视觉的动作识别挑战ICCV 2021的所有其他项目中最好的视觉诱导Priors。摘要：This technical report presents our approach "Knights" to solve the action recognition task on a small subset of Kinetics-400 i.e. Kinetics400ViPriors without using any extra-data. Our approach has 3 main components: state-of-the-art Temporal Contrastive self-supervised pretraining, video transformer models, and optical flow modality. Along with the use of standard test-time augmentation, our proposed solution achieves 73% on Kinetics400ViPriors test set, which is the best among all of the other entries Visual Inductive Priors for Data-Efficient Computer Vision's Action Recognition Challenge, ICCV 2021.

【7】 Beyond Classification: Directly Training Spiking Neural Networks for Semantic Segmentation 标题：超越分类：直接训练尖峰神经网络进行语义分割链接：https://arxiv.org/abs/2110.07742

作者：Youngeun Kim,Joshua Chough,Priyadarshini Panda 机构：Department of Electrical Engineering, Yale University, USA 摘要：尖峰神经网络（SNN）由于其稀疏、异步和二进制事件驱动的处理，最近成为人工神经网络（ANN）的低功耗替代品。由于其能源效率，SNN很有可能被部署到现实世界中资源受限的系统中，如自动驾驶车辆和无人机。然而，由于SNN的不可微性和复杂的神经元动力学特性，以往的SNN优化方法大多局限于图像识别。在本文中，我们探讨了SNN在分类之外的应用，并给出了配置有尖峰神经元的语义分割网络。具体来说，我们首先研究了两种有代表性的SNN优化技术（即ANN-SNN转换和代理梯度学习）在语义分割数据集上的识别任务。我们观察到，当从ANN转换时，SNN由于特征的空间差异而遭受高延迟和低性能。因此，我们使用替代梯度学习直接训练网络，比ANN-SNN转换具有更低的延迟和更高的性能。此外，我们为SNN域重新设计了两种基本的ANN分段结构（即完全卷积网络和DeepLab）。我们在两个公共语义分割基准上进行了实验，包括PASCAL VOC2012数据集和DDD17基于事件的数据集。除了说明SNN用于语义分割的可行性外，我们还表明SNN在该领域比ANN更具鲁棒性和能量效率。摘要：Spiking Neural Networks (SNNs) have recently emerged as the low-power alternative to Artificial Neural Networks (ANNs) because of their sparse, asynchronous, and binary event-driven processing. Due to their energy efficiency, SNNs have a high possibility of being deployed for real-world, resource-constrained systems such as autonomous vehicles and drones. However, owing to their non-differentiable and complex neuronal dynamics, most previous SNN optimization methods have been limited to image recognition. In this paper, we explore the SNN applications beyond classification and present semantic segmentation networks configured with spiking neurons. Specifically, we first investigate two representative SNN optimization techniques for recognition tasks (i.e., ANN-SNN conversion and surrogate gradient learning) on semantic segmentation datasets. We observe that, when converted from ANNs, SNNs suffer from high latency and low performance due to the spatial variance of features. Therefore, we directly train networks with surrogate gradient learning, resulting in lower latency and higher performance than ANN-SNN conversion. Moreover, we redesign two fundamental ANN segmentation architectures (i.e., Fully Convolutional Networks and DeepLab) for the SNN domain. We conduct experiments on two public semantic segmentation benchmarks including the PASCAL VOC2012 dataset and the DDD17 event-based dataset. In addition to showing the feasibility of SNNs for semantic segmentation, we show that SNNs can be more robust and energy-efficient compared to their ANN counterparts in this domain.

【8】 ASK: Adaptively Selecting Key Local Features for RGB-D Scene Recognition 标题：ASK：自适应选择RGB-D场景识别的关键局部特征链接：https://arxiv.org/abs/2110.07703

作者：Zhitong Xiong,Yuan Yuan,Qi Wang 摘要：室内场景图像通常包含分散的对象和各种场景布局，这使得RGB-D场景分类成为一项具有挑战性的任务。现有的方法在对空间变异性较大的场景图像进行分类时仍然存在局限性。因此，如何仅使用图像标签有效地提取局部面片级特征仍然是RGB-D场景识别的一个开放问题。在本文中，我们提出了一个有效的RGB-D场景识别框架，该框架自适应地选择重要的局部特征来捕获场景图像的巨大空间变异性。具体来说，我们设计了一个可微局部特征选择（DLFS）模块，该模块可以提取适当数量的关键局部场景特征。DLFS模块可以从空间相关的多模态RGB-D特征中选择有区别的局部主题级和对象级表示。我们利用RGB和深度模式之间的相关性为选择局部特征提供更多线索。为了确保选择有区别的局部特征，提出了变分互信息最大化损失法。此外，DLFS模块可以轻松扩展，以选择不同比例的局部特征。通过连接局部无序和全局结构化的多模态特征，该框架可以在公共RGB-D场景识别数据集上实现最先进的性能。摘要：Indoor scene images usually contain scattered objects and various scene layouts, which make RGB-D scene classification a challenging task. Existing methods still have limitations for classifying scene images with great spatial variability. Thus, how to extract local patch-level features effectively using only image labels is still an open problem for RGB-D scene recognition. In this paper, we propose an efficient framework for RGB-D scene recognition, which adaptively selects important local features to capture the great spatial variability of scene images. Specifically, we design a differentiable local feature selection (DLFS) module, which can extract the appropriate number of key local scenerelated features. Discriminative local theme-level and object-level representations can be selected with the DLFS module from the spatially-correlated multi-modal RGB-D features. We take advantage of the correlation between RGB and depth modalities to provide more cues for selecting local features. To ensure that discriminative local features are selected, the variational mutual information maximization loss is proposed. Additionally, the DLFS module can be easily extended to select local features of different scales. By concatenating the local-orderless and global structured multi-modal features, the proposed framework can achieve state-of-the-art performance on public RGB-D scene recognition datasets.

【9】 Multi-Layer Pseudo-Supervision for Histopathology Tissue Semantic Segmentation using Patch-level Classification Labels 标题：基于补丁级别分类标签的多层伪监督组织语义分割链接：https://arxiv.org/abs/2110.08048

作者：Chu Han,Jiatai Lin,Jinhai Mai,Yi Wang,Qingling Zhang,Bingchao Zhao,Xin Chen,Xipeng Pan,Zhenwei Shi,Xiaowei Xu,Su Yao,Lixu Yan,Huan Lin,Zeyan Xu,Xiaomei Huang,Guoqiang Han,Changhong Liang,Zaiyi Liu 机构： South China University of Technology, ChinaXiaowei Xu is with Guangdong Cardiovascular Institute, GuangdongProvincial Key Laboratory of South China Structural Heart Disease 备注：15 pages, 10 figures, journal 摘要：组织级语义分割是计算病理学中的一个重要步骤。全监督模型已经通过密集像素级注释实现了卓越的性能。然而，在千兆像素整张幻灯片图像上绘制这样的标签是极其昂贵和耗时的。在本文中，我们仅使用贴片级分类标签来实现组织病理学图像上的组织语义分割，最终减少了注释工作。我们提出了一个包含分类和分割两个阶段的模型。在分类阶段，我们提出了一个基于CAM的模型，通过贴片级标签生成伪掩模。在分割阶段，我们通过提出的多层伪监督实现了组织的语义分割。已经提出了一些技术创新，以缩小像素级和面片级注释之间的信息差距。作为本文的一部分，我们介绍了一种新的肺腺癌弱监督语义分割（WSSS）数据集（LUAD HistoSeg）。我们在两个数据集上进行了几个实验来评估我们提出的模型。我们提出的模型优于两种最先进的WSSS方法。请注意，我们可以通过完全监督模型获得可比较的定量和定性结果，MIoU和FwIoU的差距仅为2%。与手工标注相比，该模型可以大大节省标注时间，从数小时到数分钟不等。源代码位于：\url{https://github.com/ChuHan89/WSSS-Tissue}. 摘要：Tissue-level semantic segmentation is a vital step in computational pathology. Fully-supervised models have already achieved outstanding performance with dense pixel-level annotations. However, drawing such labels on the giga-pixel whole slide images is extremely expensive and time-consuming. In this paper, we use only patch-level classification labels to achieve tissue semantic segmentation on histopathology images, finally reducing the annotation efforts. We proposed a two-step model including a classification and a segmentation phases. In the classification phase, we proposed a CAM-based model to generate pseudo masks by patch-level labels. In the segmentation phase, we achieved tissue semantic segmentation by our proposed Multi-Layer Pseudo-Supervision. Several technical novelties have been proposed to reduce the information gap between pixel-level and patch-level annotations. As a part of this paper, we introduced a new weakly-supervised semantic segmentation (WSSS) dataset for lung adenocarcinoma (LUAD-HistoSeg). We conducted several experiments to evaluate our proposed model on two datasets. Our proposed model outperforms two state-of-the-art WSSS approaches. Note that we can achieve comparable quantitative and qualitative results with the fully-supervised model, with only around a 2\% gap for MIoU and FwIoU. By comparing with manual labeling, our model can greatly save the annotation time from hours to minutes. The source code is available at: \url{https://github.com/ChuHan89/WSSS-Tissue}.

分割|语义相关(5篇)

【1】 Performance, Successes and Limitations of Deep Learning Semantic Segmentation of Multiple Defects in Transmission Electron Micrographs 标题：透射电子显微图像多缺陷深度学习语义分割的性能、成功与局限性链接：https://arxiv.org/abs/2110.08244

作者：Ryan Jacobs,Mingren Shen,Yuhan Liu,Wei Hao,Xiaoshan Li,Ruoyu He,Jacob RC Greaves,Donglin Wang,Zeming Xie,Zitong Huang,Chao Wang,Kevin G. Field,Dane Morgan 机构：Morgan, Department of Materials Science and Engineering, University of Wisconsin-Madison, Madison, Department of Computer Sciences, University of Wisconsin–Madison, Madison, Wisconsin 摘要：在这项工作中，我们使用深度学习掩模区域卷积神经网络（Mask R-CNN）模型对辐照FeCrAl合金的电子显微镜图像中的多种缺陷类型进行语义分割。我们对关键模型性能统计数据进行了深入分析，重点关注与辐照铁基材料特性建模和理解相关的数量，如缺陷形状、缺陷尺寸和缺陷面积密度的预测分布。为了更好地理解该模型的性能和目前的局限性，我们提供了有用的评估测试示例，其中包括一系列随机拆分、数据集大小相关和领域目标交叉验证测试。总的来说，我们发现，目前的模型是一种快速，有效的工具，自动表征和量化显微镜图像中的多个缺陷类型，与人类领域专家标签的准确性水平。更具体地说，该模型可以实现平均缺陷识别F1分数高达0.8，并且基于随机交叉验证，总体平均（+/-标准偏差）缺陷大小和密度百分比误差分别为7.3（+/-3.8%）和12.7（+/-5.3%）。此外，我们的模型预测预期材料硬化在10-20 MPa范围内（约占总硬化的10%），这与实验误差水平大致相同。我们的有针对性的评估测试还表明，改进未来模型的最佳途径不是用更多标记图像扩展现有数据库，而是针对模型领域的弱点添加数据，例如来自不同显微镜、成像条件、辐照环境和合金类型的图像。最后，我们讨论了为更广泛的社区提供易于使用、开源的对象检测工具以识别新图像中的缺陷的第一阶段工作。摘要：In this work, we perform semantic segmentation of multiple defect types in electron microscopy images of irradiated FeCrAl alloys using a deep learning Mask Regional Convolutional Neural Network (Mask R-CNN) model. We conduct an in-depth analysis of key model performance statistics, with a focus on quantities such as predicted distributions of defect shapes, defect sizes, and defect areal densities relevant to informing modeling and understanding of irradiated Fe-based materials properties. To better understand the performance and present limitations of the model, we provide examples of useful evaluation tests which include a suite of random splits, and dataset size-dependent and domain-targeted cross validation tests. Overall, we find that the current model is a fast, effective tool for automatically characterizing and quantifying multiple defect types in microscopy images, with a level of accuracy on par with human domain expert labelers. More specifically, the model can achieve average defect identification F1 scores as high as 0.8, and, based on random cross validation, have low overall average (+/- standard deviation) defect size and density percentage errors of 7.3 (+/- 3.8)% and 12.7 (+/- 5.3)%, respectively. Further, our model predicts the expected material hardening to within 10-20 MPa (about 10% of total hardening), which is about the same error level as experiments. Our targeted evaluation tests also suggest the best path toward improving future models is not expanding existing databases with more labeled images but instead data additions that target weak points of the model domain, such as images from different microscopes, imaging conditions, irradiation environments, and alloy types. Finally, we discuss the first phase of an effort to provide an easy-to-use, open-source object detection tool to the broader community for identifying defects in new images.

【2】 Guided Point Contrastive Learning for Semi-supervised Point Cloud Semantic Segmentation 标题：基于引导点对比学习的半监督点云语义分割链接：https://arxiv.org/abs/2110.08188

作者：Li Jiang,Shaoshuai Shi,Zhuotao Tian,Xin Lai,Shu Liu,Chi-Wing Fu,Jiaya Jia 机构：The Chinese University of Hong Kong, SmartMore 备注：ICCV 2021 摘要：三维语义分割的快速发展与深度网络模型的发展密不可分，深度网络模型高度依赖大规模标注数据进行训练。为了解决三维点级标记的高成本和挑战，我们提出了一种半监督点云语义分割方法，在训练中采用未标记的点云来提高模型性能。受最近在自我监督任务中对比损失的启发，我们提出了引导点对比损失来增强半监督环境下的特征表示和模型泛化能力。未标记点云的语义预测在我们的损失中充当伪标记指导，以避免同一类别中的负对。此外，我们还设计了信心指导，以确保高质量的特征学习。此外，还提出了一种类别平衡抽样策略来收集正样本和负样本，以缓解类别不平衡问题。在三个数据集（ScanNet V2、S3DIS和SemanticKITTI）上的大量实验表明，我们的半监督方法可以有效地提高未标记数据的预测质量。摘要：Rapid progress in 3D semantic segmentation is inseparable from the advances of deep network models, which highly rely on large-scale annotated data for training. To address the high cost and challenges of 3D point-level labeling, we present a method for semi-supervised point cloud semantic segmentation to adopt unlabeled point clouds in training to boost the model performance. Inspired by the recent contrastive loss in self-supervised tasks, we propose the guided point contrastive loss to enhance the feature representation and model generalization ability in semi-supervised setting. Semantic predictions on unlabeled point clouds serve as pseudo-label guidance in our loss to avoid negative pairs in the same category. Also, we design the confidence guidance to ensure high-quality feature learning. Besides, a category-balanced sampling strategy is proposed to collect positive and negative samples to mitigate the class imbalance problem. Extensive experiments on three datasets (ScanNet V2, S3DIS, and SemanticKITTI) show the effectiveness of our semi-supervised method to improve the prediction quality with unlabeled data.

【3】 Accurate Fine-grained Layout Analysis for the Historical Tibetan Document Based on the Instance Segmentation 标题：基于实例分割的藏文历史文献精确细粒度版面分析链接：https://arxiv.org/abs/2110.08164

作者：Penghai Zhao,Weilan Wang,Xiaojuan Wang,Zhengqi Cai,Guowei Zhang,Yuqi Lu 机构：Key Laboratory of China’s Ethnic Languages, and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou, China, School of Mathematics and Computer Science 备注：The manuscript contains 16 pages,14 figures, and 40 references in total. Now, the manuscript is undereviewed by the 'Big data research'(ISSN:2214-5796) 摘要：准确的版面分析而不进行后续的文本行分割仍然是一个持续的挑战，特别是当面对Kangyur时，这是一种历史性的藏文文档，具有相当多的触动成分和斑驳的背景。版面分析的目的是识别文档图像中的不同区域，对于后续的字符识别等过程是必不可少的。然而，只有很少的研究进行了线级布局分析，这并没有处理Kangyur。为了获得最优结果，提出了一种细粒度子线路级布局分析方法。首先，我们引入了一种加速的方法来构建动态可靠的数据集。其次，根据Kangyur的特点对SOLOv2进行了增强。然后，我们在训练阶段向增强的SOLOv2提供准备好的注释文件。一旦网络被训练，文本行、句子和标题的实例可以在推理阶段被分割和识别。实验结果表明，所提出的方法在我们的数据集上提供了相当好的72.7%的AP。总的来说，这项初步研究为细粒度子线路级布局分析提供了见解，并验证了基于SOLOv2的方法。我们还相信，所提出的方法可以在具有各种布局的其他语言文档上采用。摘要：Accurate layout analysis without subsequent text-line segmentation remains an ongoing challenge, especially when facing the Kangyur, a kind of historical Tibetan document featuring considerable touching components and mottled background. Aiming at identifying different regions in document images, layout analysis is indispensable for subsequent procedures such as character recognition. However, there was only a little research being carried out to perform line-level layout analysis which failed to deal with the Kangyur. To obtain the optimal results, a fine-grained sub-line level layout analysis approach is presented. Firstly, we introduced an accelerated method to build the dataset which is dynamic and reliable. Secondly, enhancement had been made to the SOLOv2 according to the characteristics of the Kangyur. Then, we fed the enhanced SOLOv2 with the prepared annotation file during the training phase. Once the network is trained, instances of the text line, sentence, and titles can be segmented and identified during the inference stage. The experimental results show that the proposed method delivers a decent 72.7% AP on our dataset. In general, this preliminary research provides insights into the fine-grained sub-line level layout analysis and testifies the SOLOv2-based approaches. We also believe that the proposed methods can be adopted on other language documents with various layouts.

【4】 Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images 标题：基于主动学习的改进半监督卫星图像语义分割链接：https://arxiv.org/abs/2110.07782

作者：Shasvat Desai,Debasmita Ghose 机构：Orbital Insight, Yale University 备注：Accepted to Winter Conference on Applications of Computer Vision 2022 (WACV 2022) 摘要：遥感数据对于从监测森林火灾和毁林到跟踪城市化等应用至关重要。这些任务中的大多数都需要对模型进行密集的像素级注释，以便从这些卫星图像可用的有限标记数据中解析视觉信息。由于该领域缺乏高质量的标记训练数据，因此需要关注半监督技术。这些技术从一小组标记的示例生成伪标签，这些示例用于扩充标记的训练集。这就需要有一个具有高度代表性和多样性的训练集。因此，我们建议使用基于主动学习的抽样策略来选择一组具有高度代表性的标记训练数据。我们在两个现有的包含卫星图像的语义分割数据集：UC Merced土地利用分类数据集和DeepGlobe土地覆盖分类数据集上证明了我们提出的方法的有效性。我们报告，使用主动学习抽样策略的标记数据比随机抽样少量标记训练数据的标记数据的mIoU改善了27%。摘要：Remote sensing data is crucial for applications ranging from monitoring forest fires and deforestation to tracking urbanization. Most of these tasks require dense pixel-level annotations for the model to parse visual information from limited labeled data available for these satellite images. Due to the dearth of high-quality labeled training data in this domain, there is a need to focus on semi-supervised techniques. These techniques generate pseudo-labels from a small set of labeled examples which are used to augment the labeled training set. This makes it necessary to have a highly representative and diverse labeled training set. Therefore, we propose to use an active learning-based sampling strategy to select a highly representative set of labeled training data. We demonstrate our proposed method's effectiveness on two existing semantic segmentation datasets containing satellite images: UC Merced Land Use Classification Dataset and DeepGlobe Land Cover Classification Dataset. We report a 27% improvement in mIoU with as little as 2% labeled data using active learning sampling strategies over randomly sampling the small set of labeled training data.

【5】 Gray Matter Segmentation in Ultra High Resolution 7 Tesla ex vivo T2w MRI of Human Brain Hemispheres 标题：超高分辨率7Tesla离体人脑T2wMRI中的灰质分割链接：https://arxiv.org/abs/2110.07711

作者：Pulkit Khandelwal,Shokufeh Sadaghiani,Sadhana Ravikumar,Sydney Lim,Sanaz Arezoumandan,Claire Peterson,Eunice Chung,Madigan Bedard,Noah Capp,Ranjit Ittyerah,Elyse Migdal,Grace Choi,Emily Kopp,Bridget Loja,Eusha Hasan,Jiacheng Li,Karthik Prabhakaran,Gabor Mizsei,Marianna Gabrielyan,Theresa Schuck,John Robinson,Daniel Ohm,Edward Lee,John Q. Trojanowski,Corey McMillan,Murray Grossman,David Irwin,M. Dylan Tisdall,Sandhitsu R. Das,Laura E. M. Wisse,David A. Wolk,Paul A. Yushkevich 机构：University of Pennsylvania, Philadelphia, USA ,Lund University, Lund, Sweden 备注：Submitted to IEEE International Symposium on Biomedical Imaging (ISBI) 2022 摘要：离体脑MRI在显示和表征详细神经解剖学方面比活体MRI具有显著优势。然而，由于标记数据集的可用性有限，以及扫描仪硬件和采集协议的异质性，体外MRI中的自动皮层分割方法尚未得到很好的发展。在这项工作中，我们展示了32个离体人脑样本的高分辨率7特斯拉数据集。我们对九种神经网络结构的皮层外套膜分割性能进行了基准测试，使用从特定皮层区域取样的手动分割3D面片进行了训练和评估，并在不同样本中显示了跨整个大脑半球的出色泛化能力，以及在不同磁场强度和成像序列下获取的不可见图像。最后，我们提供了三维离体人脑图像关键区域的皮质厚度测量。我们的代码和处理过的数据集在https://github.com/Pulkit-Khandelwal/picsl-ex-vivo-segmentation. 摘要：Ex vivo MRI of the brain provides remarkable advantages over in vivo MRI for visualizing and characterizing detailed neuroanatomy. However, automated cortical segmentation methods in ex vivo MRI are not well developed, primarily due to limited availability of labeled datasets, and heterogeneity in scanner hardware and acquisition protocols. In this work, we present a high resolution 7 Tesla dataset of 32 ex vivo human brain specimens. We benchmark the cortical mantle segmentation performance of nine neural network architectures, trained and evaluated using manually-segmented 3D patches sampled from specific cortical regions, and show excellent generalizing capabilities across whole brain hemispheres in different specimens, and also on unseen images acquired at different magnetic field strength and imaging sequences. Finally, we provide cortical thickness measurements across key regions in 3D ex vivo human brain images. Our code and processed datasets are publicly available at https://github.com/Pulkit-Khandelwal/picsl-ex-vivo-segmentation.

半弱无监督|主动学习|不确定性(3篇)

【1】 Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction 标题：一种基于自监督掩码预测的动态剪枝方法链接：https://arxiv.org/abs/2110.08232

作者：Sara Elkerdawy,Mostafa Elhoushi,Hong Zhang,Nilanjan Ray 机构：University of Alberta,Toronto Heterogeneous Compilers Lab, Huawei 摘要：动态模型修剪是最近的一个方向，它允许在部署期间为每个输入样本推断不同的子网络。然而，当前的动态方法依赖于通过引入稀疏损失的正则化来学习连续通道选通。该公式引入了平衡不同损失（例如任务损失、正则化损失）的复杂性。此外，基于正则化的方法缺乏透明的折衷超参数选择来实现计算预算。我们的贡献有两方面：1）解耦任务和修剪训练。2）简单的超参数选择，可在训练前进行FLOPs减少估计。我们建议根据前一层的激活情况，预测一个掩模在一层中处理k个滤波器。我们提出的问题作为一个自我监督的二元分类问题。每个掩码预测器模块都经过训练，以预测当前层中每个滤波器的对数似然是否属于top-k激活滤波器。基于使用热图质量的新标准，动态估计每个输入的k值。我们在CIFAR和ImageNet数据集上展示了几种神经结构的实验，如VGG、ResNet和MobileNet。在CIFAR上，我们达到了与SOTA方法相似的精度，减少了15%和24%的触发器。类似地，在ImageNet中，我们实现了更低的精度下降，并使触发器减少了13%。摘要：Dynamic model pruning is a recent direction that allows for the inference of a different sub-network for each input sample during deployment. However, current dynamic methods rely on learning a continuous channel gating through regularization by inducing sparsity loss. This formulation introduces complexity in balancing different losses (e.g task loss, regularization loss). In addition, regularization-based methods lack transparent tradeoff hyperparameter selection to realize computational budget. Our contribution is twofold: 1) decoupled task and pruning training. 2) Simple hyperparameter selection that enables FLOPs reduction estimation before training. We propose to predict a mask to process k filters in a layer based on the activation of its previous layer. We pose the problem as a self-supervised binary classification problem. Each mask predictor module is trained to predict if the log-likelihood of each filter in the current layer belongs to the top-k activated filters. The value k is dynamically estimated for each input based on a novel criterion using the mass of heatmaps. We show experiments on several neural architectures, such as VGG, ResNet, and MobileNet on CIFAR and ImageNet datasets. On CIFAR, we reach similar accuracy to SOTA methods with 15% and 24% higher FLOPs reduction. Similarly in ImageNet, we achieve a lower drop in accuracy with up to 13% improvement in FLOPs reduction.

【2】 Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for Consistent Self-Supervised Monocular Depth Estimation 标题：注意与几何相遇：几何引导的时空注意一致性自监督单目深度估计链接：https://arxiv.org/abs/2110.08192

作者：Patrick Ruhkamp,Daoyi Gao,Hanzhi Chen,Nassir Navab,Benjamin Busam 机构：∗ Equal contribution. Author ordering determined randomly., Technical University of Munich 备注：Accepted at 3DV 2021 摘要：对于自监督单目深度预测管道来说，通过一组时间连续的图像推断几何上一致的密集3D场景仍然是一个挑战。本文探讨了日益流行的Transformer结构以及新颖的正则化损耗公式如何在保持精度的同时提高深度一致性。我们提出了一个空间注意模块，该模块将粗略的深度预测与聚集的局部几何信息相关联。一种新颖的时间注意机制进一步处理连续图像中全局上下文中的局部几何信息。此外，我们还引入了由光度周期一致性正则化的帧之间的几何约束。通过结合我们提出的正则化和新颖的时空注意模块，我们充分利用了单目帧之间基于几何和外观的一致性。与以前的方法相比，这产生了几何意义上的关注，并提高了时间深度稳定性和准确性。摘要：Inferring geometrically consistent dense 3D scenes across a tuple of temporally consecutive images remains challenging for self-supervised monocular depth prediction pipelines. This paper explores how the increasingly popular transformer architecture, together with novel regularized loss formulations, can improve depth consistency while preserving accuracy. We propose a spatial attention module that correlates coarse depth predictions to aggregate local geometric information. A novel temporal attention mechanism further processes the local geometric information in a global context across consecutive images. Additionally, we introduce geometric constraints between frames regularized by photometric cycle consistency. By combining our proposed regularization and the novel spatial-temporal-attention module we fully leverage both the geometric and appearance-based consistency across monocular frames. This yields geometrically meaningful attention and improves temporal depth stability and accuracy compared to previous methods.

【3】 Active Learning of Neural Collision Handler for Complex 3D Mesh Deformations 标题：复杂三维网格变形的神经碰撞处理机的主动学习链接：https://arxiv.org/abs/2110.07727

作者：Qingyang Tan,Zherong Pan,Breannan Smith,Takaaki Shiratori,Dinesh Manocha 机构： Department of Computer Science, University of Maryland at College Park, Lightspeed & Quantum Studio, Tencent America, Facebook Reality Labs Research 摘要：我们提出了一种鲁棒的学习算法来检测和处理三维变形网格中的碰撞。我们的碰撞检测器被表示为一个双层深度自动编码器，具有识别碰撞网格子部分的注意机制。我们使用数值优化算法来解决由网络引导的穿透问题。我们学习的碰撞处理程序可以解决具有数千个顶点的不可见高维网格的碰撞。为了在如此大和不可见的空间中获得稳定的网络性能，我们根据网络推断中的错误逐步插入新的冲突数据。我们使用分析碰撞检测器自动标记这些数据，并逐步微调检测网络。我们评估了我们对复杂3D网格的碰撞处理方法，这些网格来自多个具有不同形状和拓扑的数据集，包括与穿着和未穿衣服的人体姿势、布料模拟以及使用多视图捕获系统获取的人体手部姿势相对应的数据集。我们的方法优于监督学习方法，与分析方法相比，其准确度达到93.8-98.1\%$。与之前的学习方法相比，我们的方法在碰撞检查方面的假阴性率降低了5.16\%-25.50\%，在碰撞处理方面的成功率提高了9.65\%-58.91\%。摘要：We present a robust learning algorithm to detect and handle collisions in 3D deforming meshes. Our collision detector is represented as a bilevel deep autoencoder with an attention mechanism that identifies colliding mesh sub-parts. We use a numerical optimization algorithm to resolve penetrations guided by the network. Our learned collision handler can resolve collisions for unseen, high-dimensional meshes with thousands of vertices. To obtain stable network performance in such large and unseen spaces, we progressively insert new collision data based on the errors in network inferences. We automatically label these data using an analytical collision detector and progressively fine-tune our detection networks. We evaluate our method for collision handling of complex, 3D meshes coming from several datasets with different shapes and topologies, including datasets corresponding to dressed and undressed human poses, cloth simulations, and human hand poses acquired using multiview capture systems. Our approach outperforms supervised learning methods and achieves $93.8-98.1\%$ accuracy compared to the groundtruth by analytic methods. Compared to prior learning methods, our approach results in a $5.16\%-25.50\%$ lower false negative rate in terms of collision checking and a $9.65\%-58.91\%$ higher success rate in collision handling.

时序|行为识别|姿态|视频|运动估计(4篇)

【1】 Pose-guided Generative Adversarial Net for Novel View Action Synthesis 标题：姿态制导的产生式对抗性网络在新视点动作合成中的应用链接：https://arxiv.org/abs/2110.07993

作者：Xianhang Li,Junhao Zhang,Kunchang Li,Shruti Vyas,Yogesh S Rawat 机构： CRCV, University of Central Florida , National University of Singapore, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 备注：Accepted by WACV2022 摘要：我们关注的是新视角的人类行为综合问题。给定动作视频，目标是从看不见的视点生成相同的动作。当然，新颖的视点视频合成比图像合成更具挑战性。它需要合成一系列具有时间相干性的真实帧。此外，将不同的动作转移到一个新的目标视图需要同时意识到动作类别和视点的变化。为了应对这些挑战，我们提出了一个新的框架，称为姿势引导动作分离生成对抗网（PAS-GAN），该框架利用姿势来缓解这项任务的难度。首先，我们提出了一个循环姿态转换模块，该模块将动作从源视图转换到目标视图，并在二维坐标空间中生成新的视图姿态序列。其次，经过良好变换的姿势序列使我们能够在目标视图中分离动作和背景。我们采用了一种新颖的局部-全局空间变换模块，利用这些动作和背景特征在目标视图中有效地生成序列视频特征。最后，生成的视频特征在3D解码器的帮助下用于合成人体动作。此外，为了关注视频中的动态动作，我们提出了一种新的多尺度动作分离损失算法，进一步提高了视频质量。我们在NTU-RGBD和PKU-MMD两个大规模多视角人体动作数据集上进行了大量实验，证明了PAS-GAN的有效性，其性能优于现有方法。摘要：We focus on the problem of novel-view human action synthesis. Given an action video, the goal is to generate the same action from an unseen viewpoint. Naturally, novel view video synthesis is more challenging than image synthesis. It requires the synthesis of a sequence of realistic frames with temporal coherency. Besides, transferring the different actions to a novel target view requires awareness of action category and viewpoint change simultaneously. To address these challenges, we propose a novel framework named Pose-guided Action Separable Generative Adversarial Net (PAS-GAN), which utilizes pose to alleviate the difficulty of this task. First, we propose a recurrent pose-transformation module which transforms actions from the source view to the target view and generates novel view pose sequence in 2D coordinate space. Second, a well-transformed pose sequence enables us to separatethe action and background in the target view. We employ a novel local-global spatial transformation module to effectively generate sequential video features in the target view using these action and background features. Finally, the generated video features are used to synthesize human action with the help of a 3D decoder. Moreover, to focus on dynamic action in the video, we propose a novel multi-scale action-separable loss which further improves the video quality. We conduct extensive experiments on two large-scale multi-view human action datasets, NTU-RGBD and PKU-MMD, demonstrating the effectiveness of PAS-GAN which outperforms existing approaches.

【2】 EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation 标题：EFENet：增强流量估计的基于参考的视频超分辨率链接：https://arxiv.org/abs/2110.07797

作者：Yaping Zhao,Mengqi Ji,Ruqi Huang,Bin Wang,Shengjin Wang 机构：Tsinghua University,Hikvision 备注：12 pages, 6 figures 摘要：在本文中，我们考虑基于参考的视频超分辨率（RIFVSR）的问题，即，如何利用高分辨率（HR）参考帧来超分辨低分辨率（LR）视频序列。现有的RefVSR方法主要是在存在分辨率差和长时间范围的情况下，将参考序列和输入序列对齐。然而，它们要么忽略输入序列中的时间结构，要么遭受累积对齐错误。为了解决这些问题，我们建议EFENet同时利用HR参考中包含的视觉线索和LR序列中包含的时间信息。EFENet首先全局估计参考帧和每个LR帧之间的交叉尺度流。然后，我们新颖的EFENet流细化模块使用所有估计流来细化关于最远帧的流，这利用了序列中的全局时间信息，因此有效地减少了对齐错误。我们提供了全面的评估，以验证我们方法的优势，并证明所提出的框架优于最先进的方法。代码可在https://github.com/IndigoPurple/EFENet. 摘要：In this paper, we consider the problem of reference-based video super-resolution(RefVSR), i.e., how to utilize a high-resolution (HR) reference frame to super-resolve a low-resolution (LR) video sequence. The existing approaches to RefVSR essentially attempt to align the reference and the input sequence, in the presence of resolution gap and long temporal range. However, they either ignore temporal structure within the input sequence, or suffer accumulative alignment errors. To address these issues, we propose EFENet to exploit simultaneously the visual cues contained in the HR reference and the temporal information contained in the LR sequence. EFENet first globally estimates cross-scale flow between the reference and each LR frame. Then our novel flow refinement module of EFENet refines the flow regarding the furthest frame using all the estimated flows, which leverages the global temporal information within the sequence and therefore effectively reduces the alignment errors. We provide comprehensive evaluations to validate the strengths of our approach, and to demonstrate that the proposed framework outperforms the state-of-the-art methods. Code is available at https://github.com/IndigoPurple/EFENet.

【3】 Shaping embodied agent behavior with activity-context priors from egocentric video 标题：从以自我为中心的视频中利用活动情境先验塑造具体化代理行为链接：https://arxiv.org/abs/2110.07692

作者：Tushar Nagarajan,Kristen Grauman 机构：UT Austin and Facebook AI Research 摘要：复杂的物理任务需要一系列对象交互，每个对象交互都有其自身的前提条件——这对于机器人代理来说，仅仅通过自己的经验来有效地学习是很困难的。我们介绍了一种方法来发现活动的背景先验知识，从野生的自我中心的视频捕获与人类佩戴的相机。对于给定对象，activity context Previor表示活动成功所需的一组其他兼容对象（例如，与番茄一起使用的刀和砧板有助于切割）。我们将基于Previor的视频编码为一个辅助奖励函数，鼓励代理在尝试交互之前将兼容对象聚集在一起。通过这种方式，我们的模型将日常人类经验转化为具体的代理技能。我们使用EPIC kitchen的视频演示了我们的想法，视频中的人们正在进行无脚本的厨房活动，这有助于虚拟家庭机器人代理在AI2 iTHOR中执行各种复杂任务，显著加快代理学习。项目页面：http://vision.cs.utexas.edu/projects/ego-rewards/ 摘要：Complex physical tasks entail a sequence of object interactions, each with its own preconditions -- which can be difficult for robotic agents to learn efficiently solely through their own experience. We introduce an approach to discover activity-context priors from in-the-wild egocentric video captured with human worn cameras. For a given object, an activity-context prior represents the set of other compatible objects that are required for activities to succeed (e.g., a knife and cutting board brought together with a tomato are conducive to cutting). We encode our video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction. In this way, our model translates everyday human experience into embodied agent skills. We demonstrate our idea using egocentric EPIC-Kitchens video of people performing unscripted kitchen activities to benefit virtual household robotic agents performing various complex tasks in AI2-iTHOR, significantly accelerating agent learning. Project page: http://vision.cs.utexas.edu/projects/ego-rewards/

【4】 Neural Dubber: Dubbing for Silent Videos According to Scripts 标题：Neuran Dubber：根据脚本为无声视频配音链接：https://arxiv.org/abs/2110.08243

作者：Chenxu Hu,Qiao Tian,Tingle Li,Yuping Wang,Yuxuan Wang,Hang Zhao 机构：Tsinghua University, ByteDance, Shanghai Qi Zhi Institute 备注：Accepted by NeurIPS 2021 摘要：配音是重新录制演员对话的后期制作过程，广泛用于电影制作和视频制作。它通常由专业的配音演员手动执行，他们以适当的韵律朗读台词，并与预先录制的视频同步。在这项工作中，我们提出了神经配音器，第一个神经网络模型来解决一个新的自动视频配音（AVD）任务：从文本中合成与给定无声视频同步的人类语音。神经配音器是一种多模态文本到语音（TTS）模型，它利用视频中的嘴唇运动来控制生成语音的韵律。此外，还开发了基于图像的说话人嵌入（ISE）模块，用于多说话人设置，使神经配音器能够根据说话人的脸型生成具有合理音色的语音。在化学讲座单扬声器数据集和LRS2多说话者数据集上的实验表明，神经配音器可以在语音质量方面与最先进的TTS模型相媲美地生成语音音频。最重要的是，定性和定量评估表明，神经配音器可以通过视频控制合成语音的韵律，并生成与视频时间同步的高保真语音。摘要：Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

医学相关(3篇)

【1】 Single volume lung biomechanics from chest computed tomography using a mode preserving generative adversarial network 标题：基于模式保持生成性对抗网络的胸部CT单体积肺生物力学研究链接：https://arxiv.org/abs/2110.07878

作者：Muhammad F. A. Chaudhary,Sarah E. Gerard,Di Wang,Gary E. Christensen,Christopher B. Cooper,Joyce D. Schroeder,Eric A. Hoffman,Joseph M. Reinhardt 机构：⋆ University of Iowa, Iowa City, IA, USA, † University of California, Los Angeles, CA, USA, ‡ University of Utah, Salt Lake City, UT, USA 备注：5 pages, 5 figures 摘要：肺的局部组织扩张通常是通过注册在多个肺容积下获得的计算机断层扫描（CT）来获得的。然而，获得多次扫描会增加辐射剂量、时间和成本，在许多情况下可能不可能，因此限制了基于配准的生物力学的适用性。我们提出了一种生成性对抗学习方法，用于直接从单个CT扫描估计局部组织扩张。对来自SPIROMICS队列的2500名受试者进行了拟议框架的训练和评估。经过训练后，该框架可作为预测局部组织扩张的无注册方法。我们评估了不同疾病严重程度的模型性能，并将其性能与两种图像到图像转换框架（UNet和Pix2Pix）进行了比较。我们的模型在1 mm3的高空间分辨率下实现了18.95分贝的总PSNR、0.840的SSIM和0.61的Spearman相关性。摘要：Local tissue expansion of the lungs is typically derived by registering computed tomography (CT) scans acquired at multiple lung volumes. However, acquiring multiple scans incurs increased radiation dose, time, and cost, and may not be possible in many cases, thus restricting the applicability of registration-based biomechanics. We propose a generative adversarial learning approach for estimating local tissue expansion directly from a single CT scan. The proposed framework was trained and evaluated on 2500 subjects from the SPIROMICS cohort. Once trained, the framework can be used as a registration-free method for predicting local tissue expansion. We evaluated model performance across varying degrees of disease severity and compared its performance with two image-to-image translation frameworks - UNet and Pix2Pix. Our model achieved an overall PSNR of 18.95 decibels, SSIM of 0.840, and Spearman's correlation of 0.61 at a high spatial resolution of 1 mm3.

【2】 Application of Homomorphic Encryption in Medical Imaging 标题：同态加密在医学成像中的应用链接：https://arxiv.org/abs/2110.07768

作者：Francis Dutil,Alexandre See,Lisa Di Jorio,Florent Chandelier 机构：Imagia 摘要：在本技术报告中，我们探讨了同态加密（HE）在深度学习（DL）模型训练和预测中的使用，以提供严格的\textit{Privacy by Design}服务，并实施数据治理的零信任模型。首先，我们展示了如何利用他对医学图像进行预测，同时防止未经授权的二次使用数据，并详细介绍了我们对OCT图像疾病分类任务的结果。然后，我们证明他可以通过联邦学习确保DL模型的训练，并报告了一些使用3D胸部CT扫描进行结节检测任务的实验。摘要：In this technical report, we explore the use of homomorphic encryption (HE) in the context of training and predicting with deep learning (DL) models to deliver strict \textit{Privacy by Design} services, and to enforce a zero-trust model of data governance. First, we show how HE can be used to make predictions over medical images while preventing unauthorized secondary use of data, and detail our results on a disease classification task with OCT images. Then, we demonstrate that HE can be used to secure the training of DL models through federated learning, and report some experiments using 3D chest CT-Scans for a nodule detection task.

【3】 3D Structure from 2D Microscopy images using Deep Learning 标题：基于深度学习的二维显微图像三维结构链接：https://arxiv.org/abs/2110.07608

作者：Benjamin J. Blundell,Christian Sieben,Suliana Manley,Ed Rosten,QueeLim Ch'ng,Susan Cox 机构：Centre for Developmental Biology, Institute of Psychiatry, Psychology & Neuroscience, Nanoscale Infection Biology Lab (NIBI), Helmholtz Centre for Infection Research, Germany, ´Ecole Polytechnique F´ed´erale de Lausanne, Switzerland 备注：32 Pages, 12 figures. Awaiting publication in 'Frontiers in Bioinformatics - Computational Bioimaging' - this https URL 摘要：了解蛋白质复合物的结构对于确定其功能至关重要。然而，从显微镜图像中检索精确的三维结构是一项极具挑战性的任务，特别是因为许多成像模式都是二维的。人工智能的最新进展已经应用于这个问题，主要是使用基于体素的方法来分析电子显微镜图像集。在此，我们提出了一种深度学习解决方案，用于从大量二维单分子定位显微镜图像重建蛋白质复合物，该解决方案完全不受约束。我们的卷积神经网络与一个可微渲染器相结合，预测姿势并导出单一结构。经过训练后，对网络进行分解，该方法的输出为适合数据集的结构模型。我们展示了我们的系统在两种蛋白质复合物上的性能：CEP152（包括中心粒近端环形的一部分）和中心粒。摘要：Understanding the structure of a protein complex is crucial indetermining its function. However, retrieving accurate 3D structures from microscopy images is highly challenging, particularly as many imaging modalities are two-dimensional. Recent advances in Artificial Intelligence have been applied to this problem, primarily using voxel based approaches to analyse sets of electron microscopy images. Herewe present a deep learning solution for reconstructing the protein com-plexes from a number of 2D single molecule localization microscopy images, with the solution being completely unconstrained. Our convolutional neural network coupled with a differentiable renderer predicts pose and derives a single structure. After training, the network is dis-carded, with the output of this method being a structural model which fits the data-set. We demonstrate the performance of our system on two protein complexes: CEP152 (which comprises part of the proximal toroid of the centriole) and centrioles.

GAN|对抗|攻击|生成相关(9篇)

【1】 Guiding Visual Question Generation 标题：引导可视化问题生成链接：https://arxiv.org/abs/2110.08226

作者：Nihir Vedd,Zixu Wang,Marek Rei,Yishu Miao,Lucia Specia 机构：Department of Computing, Imperial College London 备注：11 pages including references and Appendix. 3 figures and 3 tables 摘要：在传统的视觉问题生成（VQG）中，大多数图像都有多个概念（例如对象和类别），可以为其生成问题，但模型经过训练以模仿训练数据中给出的任意概念选择。这使得训练变得困难，也给评估带来了问题——对于大多数图像，存在多个有效问题，但只有一个或几个问题被人类参考捕获。我们提出了引导性视觉问题生成——VQG的一种变体，它根据对问题类型及其应该探究的对象的期望，根据分类信息对问题生成器进行调节。我们提出了两种变体：（i）一种显式引导模型，使参与者（人工或自动）能够选择生成问题的对象和类别；和（ii）一个隐式引导模型，该模型基于离散的潜在变量，学习哪些对象和类别作为条件。建议的模型在答案类别增强的VQA数据集上进行了评估，我们的定量结果显示，与目前的技术水平相比，有了实质性的改进（增加了9个BLEU-4）。人类评估验证了指导有助于生成语法连贯且与给定图像和对象相关的问题。摘要：In traditional Visual Question Generation (VQG), most images have multiple concepts (e.g. objects and categories) for which a question could be generated, but models are trained to mimic an arbitrary choice of concept as given in their training data. This makes training difficult and also poses issues for evaluation -- multiple valid questions exist for most images but only one or a few are captured by the human references. We present Guiding Visual Question Generation - a variant of VQG which conditions the question generator on categorical information based on expectations on the type of question and the objects it should explore. We propose two variants: (i) an explicitly guided model that enables an actor (human or automated) to select which objects and categories to generate a question for; and (ii) an implicitly guided model that learns which objects and categories to condition on, based on discrete latent variables. The proposed models are evaluated on an answer-category augmented VQA dataset and our quantitative results show a substantial improvement over the current state of the art (over 9 BLEU-4 increase). Human evaluation validates that guidance helps the generation of questions that are grammatically coherent and relevant to the given image and objects.

【2】 Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis 标题：多尾、多头、空间动态记忆精化文图合成链接：https://arxiv.org/abs/2110.08143

作者：Amrit Diggavi Seshadri,Balaraman Ravindran 机构：Robert Bosch Center for Data Science and Artificial Intelligence, Indian Institute of Technology Madras 摘要：从文本描述合成高质量、逼真的图像是一项具有挑战性的任务，当前的方法以多阶段的方式从文本合成图像，通常是首先生成粗略的初始图像，然后在后续阶段细化图像细节。然而，遵循这一范式的现有方法受到三个重要限制。首先，它们合成初始图像，而不尝试在单词级别分离图像属性。因此，初始图像的对象属性（为后续细化提供基础）本质上是纠缠和模糊的。其次，通过对所有区域使用通用的文本表示，当前的方法阻止我们在图像的不同部分以根本不同的方式解释文本。因此，不同的图像区域只允许在每个细化阶段从文本中吸收相同类型的信息。最后，当前的方法在每个细化阶段只生成一次细化特征，并试图在一次拍摄中解决所有图像方面的问题。这种单镜头细化限制了每个细化阶段学习改进先前图像的精度。我们提出的方法引入了三个新的组件来解决这些缺点：（1）初始生成阶段，显式地为每个单词n-gram生成单独的图像特征集。（2）一种空间动态存储模块，用于细化图像。（3）一种迭代的多头机制，使其更容易在多个图像方面进行改进。实验结果表明，在CUB和COCO数据集上，我们使用多尾词级初始生成（MSMT-GAN）对多头空间动态记忆图像进行细化，与之前的最新技术相比表现良好。摘要：Synthesizing high-quality, realistic images from text-descriptions is a challenging task, and current methods synthesize images from text in a multi-stage manner, typically by first generating a rough initial image and then refining image details at subsequent stages. However, existing methods that follow this paradigm suffer from three important limitations. Firstly, they synthesize initial images without attempting to separate image attributes at a word-level. As a result, object attributes of initial images (that provide a basis for subsequent refinement) are inherently entangled and ambiguous in nature. Secondly, by using common text-representations for all regions, current methods prevent us from interpreting text in fundamentally different ways at different parts of an image. Different image regions are therefore only allowed to assimilate the same type of information from text at each refinement stage. Finally, current methods generate refinement features only once at each refinement stage and attempt to address all image aspects in a single shot. This single-shot refinement limits the precision with which each refinement stage can learn to improve the prior image. Our proposed method introduces three novel components to address these shortcomings: (1) An initial generation stage that explicitly generates separate sets of image features for each word n-gram. (2) A spatial dynamic memory module for refinement of images. (3) An iterative multi-headed mechanism to make it easier to improve upon multiple image aspects. Experimental results demonstrate that our Multi-Headed Spatial Dynamic Memory image refinement with our Multi-Tailed Word-level Initial Generation (MSMT-GAN) performs favourably against the previous state of the art on the CUB and COCO datasets.

【3】 Adversarial Attacks on ML Defense Models Competition 标题：ML防御模型大赛的对抗性攻击链接：https://arxiv.org/abs/2110.08042

作者：Yinpeng Dong,Qi-An Fu,Xiao Yang,Wenzhao Xiang,Tianyu Pang,Hang Su,Jun Zhu,Jiayu Tang,Yuefeng Chen,XiaoFeng Mao,Yuan He,Hui Xue,Chao Li,Ye Liu,Qilong Zhang,Lianli Gao,Yunrui Yu,Xitong Gao,Zhe Zhao,Daquan Lin,Jiadong Lin,Chuanbiao Song,Zihao Wang,Zhennan Wu,Yang Guo,Jiequan Cui,Xiaogang Xu,Pengguang Chen 机构： Tsinghua University, Alibaba Group, RealAI, Shanghai Jiao Tong University, University of Electronic Science and Technology of China, University of Macau, Chinese Academy of Sciences, ShanghaiTech University, Huazhong University of Science and Technology 备注：Competition Report 摘要：由于深层神经网络（DNN）对对抗性示例的脆弱性，近年来提出了大量防御技术来缓解这一问题。然而，不完整或不正确的稳健性评估通常会阻碍建立更稳健模型的进程。为了加速对图像分类中当前防御模型的对抗鲁棒性进行可靠评估的研究，清华大学蔡尔集团和阿里巴巴安全集团组织了本次竞赛，并举办了CVPR 2021对抗性机器学习研讨会(https://aisecure-workshop.github.io/amlcvpr2021/). 本次竞赛的目的是激发新的攻击算法，以便更有效、更可靠地评估对手的鲁棒性。鼓励参与者开发更强大的白盒攻击算法，以发现不同防御的最坏情况鲁棒性。本次比赛在对抗性稳健性评估平台ARES上进行(https://github.com/thu-ml/ares)，并在天池平台上举行(https://tianchi.aliyun.com/competition/entrance/531847/introduction)作为AI安全挑战者计划系列之一。比赛结束后，我们总结了结果，并在https://ml.cs.tsinghua.edu.cn/ares-bench/，允许用户上传对抗性攻击算法和防御模型以进行评估。摘要：Due to the vulnerability of deep neural networks (DNNs) to adversarial examples, a large number of defense techniques have been proposed to alleviate this problem in recent years. However, the progress of building more robust models is usually hampered by the incomplete or incorrect robustness evaluation. To accelerate the research on reliable evaluation of adversarial robustness of the current defense models in image classification, the TSAIL group at Tsinghua University and the Alibaba Security group organized this competition along with a CVPR 2021 workshop on adversarial machine learning (https://aisecure-workshop.github.io/amlcvpr2021/). The purpose of this competition is to motivate novel attack algorithms to evaluate adversarial robustness more effectively and reliably. The participants were encouraged to develop stronger white-box attack algorithms to find the worst-case robustness of different defenses. This competition was conducted on an adversarial robustness evaluation platform -- ARES (https://github.com/thu-ml/ares), and is held on the TianChi platform (https://tianchi.aliyun.com/competition/entrance/531847/introduction) as one of the series of AI Security Challengers Program. After the competition, we summarized the results and established a new adversarial robustness benchmark at https://ml.cs.tsinghua.edu.cn/ares-bench/, which allows users to upload adversarial attack algorithms and defense models for evaluation.

【4】 MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining 标题：磁铁：无需再训练的深度生成网络流形的均匀采样链接：https://arxiv.org/abs/2110.08009

作者：Ahmed Imtiaz Humayun,Randall Balestriero,Richard Baraniuk 机构：Rice University 备注：13 pages, 14 pages Appendix, 23 figures 摘要：深度生成网络（DGN）广泛应用于生成性对抗网络（GAN）、变分自动编码器（VAE）及其变体中，以近似数据流形，以及该流形上的数据分布。然而，训练样本通常是基于经验数据分布中的偏好、成本或便利性人工制品获得的，例如，CelebA数据集中的大部分笑脸或FFHQ中的大部分黑发个体。当从经过训练的DGN中进行采样时，这些不一致性将重现，这对公平性、数据增强、异常检测、域自适应等具有深远的潜在影响。作为回应，我们开发了一种基于微分几何的取样器——人造磁铁——在给定任何经过训练的DGN的情况下，该取样器可以产生均匀分布在学习流形上的样本。我们从理论和经验上证明，无论训练集分布如何，我们的技术都能在流形上产生一致的分布。我们在各种数据集和DGN上进行了一系列实验。其中一个考虑了在FFHQ数据集上训练的最先进的StyleGAN2，通过磁铁的均匀采样可将分布精度和召回率提高4.1%和3.0%，并将性别偏见降低41.2%，而无需标签或再训练。摘要：Deep Generative Networks (DGNs) are extensively employed in Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and their variants to approximate the data manifold, and data distribution on that manifold. However, training samples are often obtained based on preferences, costs, or convenience producing artifacts in the empirical data distribution e.g., the large fraction of smiling faces in the CelebA dataset or the large fraction of dark-haired individuals in FFHQ. These inconsistencies will be reproduced when sampling from the trained DGN, which has far-reaching potential implications for fairness, data augmentation, anomaly detection, domain adaptation, and beyond. In response, we develop a differential geometry based sampler -- coined MaGNET -- that, given any trained DGN, produces samples that are uniformly distributed on the learned manifold. We prove theoretically and empirically that our technique produces a uniform distribution on the manifold regardless of the training set distribution. We perform a range of experiments on various datasets and DGNs. One of them considers the state-of-the-art StyleGAN2 trained on FFHQ dataset, where uniform sampling via MaGNET increases distribution precision and recall by 4.1% & 3.0% and decreases gender bias by 41.2%, without requiring labels or retraining.

【5】 On Generating Identifiable Virtual Faces 标题：关于生成可识别虚拟人脸的研究链接：https://arxiv.org/abs/2110.07986

作者：Zhuowen Yuan,Sheng Li,Xinpeng Zhang,Zhenxin Qian,Alex Kot 机构：Fudan University, Nanyang Technology University 摘要：使用生成模型的人脸匿名越来越普遍，因为它们通过生成虚拟人脸图像来净化私人信息，确保隐私和图像效用。在删除或保护原始身份后，此类虚拟人脸图像通常无法识别。在本文中，我们形式化和处理生成可识别的虚拟人脸图像的问题。为了保护隐私，我们的虚拟人脸图像在视觉上与原始图像不同。此外，它们还绑定了新的虚拟身份，可以直接用于人脸识别。我们提出了一种可识别的虚拟人脸生成器（IVFG）来生成虚拟人脸图像。IVFG根据用户特定密钥将原始人脸图像的潜在向量投影为虚拟向量，并基于该密钥生成虚拟人脸图像。为了使虚拟人脸图像具有可识别性，我们提出了一种多任务学习目标和三重训练策略来学习IVFG。各种实验证明了IVFG生成可识别虚拟人脸图像的有效性。摘要：Face anonymization with generative models have become increasingly prevalent since they sanitize private information by generating virtual face images, ensuring both privacy and image utility. Such virtual face images are usually not identifiable after the removal or protection of the original identity. In this paper, we formalize and tackle the problem of generating identifiable virtual face images. Our virtual face images are visually different from the original ones for privacy protection. In addition, they are bound with new virtual identities, which can be directly used for face recognition. We propose an Identifiable Virtual Face Generator (IVFG) to generate the virtual face images. The IVFG projects the latent vectors of the original face images into virtual ones according to a user specific key, based on which the virtual face images are generated. To make the virtual face images identifiable, we propose a multi-task learning objective as well as a triplet styled training strategy to learn the IVFG. Various experiments demonstrate the effectiveness of the IVFG for generate identifiable virtual face images.

【6】 Data Generation using Texture Co-occurrence and Spatial Self-Similarity for Debiasing 标题：基于纹理共现和空间自相似的去偏数据生成链接：https://arxiv.org/abs/2110.07920

作者：Myeongkyun Kang,Dongkyu Won,Miguel Luna,Kyung Soo Hong,June Hong Ahn,Sang Hyun Park 机构： Robotics Engineering, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Daegu, Korea, Division of Pulmonology and Allergy, Department of Internal Medicine, Regional Center for Respiratory Diseases 摘要：在有偏数据集上训练的分类模型通常在分布外样本上表现不佳，因为有偏表示嵌入到模型中。最近，有人提出了对抗式学习方法来消除有偏表征，但在不改变其他相关信息的情况下仅丢弃有偏特征是一个挑战。在本文中，我们提出了一种新的去偏倚方法，该方法使用反向标记图像的纹理表示显式生成附加图像，以扩大训练数据集，并在训练分类器时减轻偏倚的影响。每个新生成的图像包含来自源图像的相似空间信息，同时从相反标签的目标图像传输纹理。我们的模型集成了纹理共现损失（确定生成图像的纹理是否与目标纹理相似）和空间自相似损失（确定生成图像和源图像之间的空间细节是否得到良好保留）。生成的和原始的训练图像进一步用于训练能够避免学习未知偏差表示的分类器。我们使用三个不同的人工设计的数据集，已知偏差，以证明我们的方法能够缓解偏差信息，并报告与现有最先进方法相比的竞争性能。摘要：Classification models trained on biased datasets usually perform poorly on out-of-distribution samples since biased representations are embedded into the model. Recently, adversarial learning methods have been proposed to disentangle biased representations, but it is challenging to discard only the biased features without altering other relevant information. In this paper, we propose a novel de-biasing approach that explicitly generates additional images using texture representations of oppositely labeled images to enlarge the training dataset and mitigate the effect of biases when training a classifier. Every new generated image contains similar spatial information from a source image while transferring textures from a target image of opposite label. Our model integrates a texture co-occurrence loss that determines whether a generated image's texture is similar to that of the target, and a spatial self-similarity loss that determines whether the spatial details between the generated and source images are well preserved. Both generated and original training images are further used to train a classifier that is able to avoid learning unknown bias representations. We employ three distinct artificially designed datasets with known biases to demonstrate the ability of our method to mitigate bias information, and report competitive performance over existing state-of-the-art methods.

【7】 Adversarial Purification through Representation Disentanglement 标题：通过表征解缠实现对抗性净化链接：https://arxiv.org/abs/2110.07801

作者：Tao Bai,Jun Zhao,Lanqing Guo,Bihan Wen 机构：Nanyang Technological University 摘要：深度学习模型容易受到对抗性示例的攻击，并犯下无法理解的错误，这对其在现实世界中的部署构成了威胁。结合对抗训练的思想，基于预处理的防御因其任务独立性和良好的可推广性而广受欢迎，使用方便。目前的防御方法，特别是净化，倾向于消除“噪音”“通过学习和恢复自然图像。然而，与随机噪声不同，由于对抗模式与图像的强相关性，在模型训练过程中更容易过度拟合。在这项工作中，我们提出了一种新的对抗性净化方案，通过提出自然图像和对抗性扰动的解纠缠作为预处理防御。通过大量的实验，我们的防御被证明是可推广的，并对看不见的强大对抗性攻击提供了重要的保护。它将最先进的\textbf{emession}攻击的成功率平均从\textbf{61.7\%}降低到\textbf{14.9\%}，优于许多现有方法。值得注意的是，我们的防御可以完美地恢复受干扰的图像，并且不会损害主干模型的干净精度，这在实践中是非常理想的。摘要：Deep learning models are vulnerable to adversarial examples and make incomprehensible mistakes, which puts a threat on their real-world deployment. Combined with the idea of adversarial training, preprocessing-based defenses are popular and convenient to use because of their task independence and good generalizability. Current defense methods, especially purification, tend to remove ``noise" by learning and recovering the natural images. However, different from random noise, the adversarial patterns are much easier to be overfitted during model training due to their strong correlation to the images. In this work, we propose a novel adversarial purification scheme by presenting disentanglement of natural images and adversarial perturbations as a preprocessing defense. With extensive experiments, our defense is shown to be generalizable and make significant protection against unseen strong adversarial attacks. It reduces the success rates of state-of-the-art \textbf{ensemble} attacks from \textbf{61.7\%} to \textbf{14.9\%} on average, superior to a number of existing methods. Notably, our defense restores the perturbed images perfectly and does not hurt the clean accuracy of backbone models, which is highly desirable in practice.

【8】 Adversarial Attack across Datasets 标题：跨数据集的对抗性攻击链接：https://arxiv.org/abs/2110.07718

作者：Yunxiao Qin,Yuanhao Xiong,Jinfeng Yi,Cho-Jui Hsieh 机构：Communication University of China, University of California, Los Angeles, JD Technology 摘要：已经观察到，在无查询黑盒设置中，深度神经网络（DNN）容易受到传输攻击。然而，以往关于转移攻击的研究都假设攻击者拥有的白盒代理模型和黑盒受害者模型是在同一个数据集上训练的，这意味着攻击者隐式地知道受害者模型的标签集和输入大小。但是，这种假设通常是不现实的，因为攻击者可能不知道受害者模型使用的数据集，而且，攻击者需要攻击任何随机遇到的图像，这些图像可能不是来自同一数据集。因此，在本文中，我们定义了一个新的广义可转移攻击（GTA）问题，其中我们假设攻击者有一组在不同数据集（具有不同的标签集和图像大小）上训练的代理模型，并且没有一个与受害者模型使用的数据集相等。然后，我们提出了一种称为图像分类擦除器（ICE）的新方法来擦除任意数据集中遇到的任何图像的分类信息。在Cifar-10、Cifar-100和TieredImageNet上的大量实验证明了所提出的ICE对GTA问题的有效性。此外，我们还表明，现有的转移攻击方法可以进行修改以解决GTA问题，但与ICE相比，其性能要差得多。摘要：It has been observed that Deep Neural Networks (DNNs) are vulnerable to transfer attacks in the query-free black-box setting. However, all the previous studies on transfer attack assume that the white-box surrogate models possessed by the attacker and the black-box victim models are trained on the same dataset, which means the attacker implicitly knows the label set and the input size of the victim model. However, this assumption is usually unrealistic as the attacker may not know the dataset used by the victim model, and further, the attacker needs to attack any randomly encountered images that may not come from the same dataset. Therefore, in this paper we define a new Generalized Transferable Attack (GTA) problem where we assume the attacker has a set of surrogate models trained on different datasets (with different label sets and image sizes), and none of them is equal to the dataset used by the victim model. We then propose a novel method called Image Classification Eraser (ICE) to erase classification information for any encountered images from arbitrary dataset. Extensive experiments on Cifar-10, Cifar-100, and TieredImageNet demonstrate the effectiveness of the proposed ICE on the GTA problem. Furthermore, we show that existing transfer attack methods can be modified to tackle the GTA problem, but with significantly worse performance compared with ICE.

【9】 Deep Human-guided Conditional Variational Generative Modeling for Automated Urban Planning 标题：深度人引导条件变分生成模型在城市规划自动化中的应用链接：https://arxiv.org/abs/2110.07717

作者：Dongjie Wang,Kunpeng Liu,Pauline Johnson,Leilei Sun,Bowen Du,Yanjie Fu 机构：Department of Computer Science, University of Central Florida, Orlando, Department of Computer Science, Beihang University, Beijing 备注：ICDM2021 摘要：城市规划设计土地使用结构，有利于建设宜居、可持续、安全的社区。受图像生成的启发，深度城市规划旨在利用深度学习生成土地使用配置。然而，城市规划是一个复杂的过程。现有的研究通常忽略了规划中个性化的人的引导需求，以及规划生成中的空间层次结构。此外，缺乏大规模土地利用配置样本构成了数据稀疏性的挑战。本文研究了一种新的深度人文引导的城市规划方法，以共同解决上述挑战。具体地说，我们将问题转化为一个基于深度条件变分自动编码器的框架。在这个框架中，我们利用深度编码器-解码器设计来生成土地使用配置。为了捕获土地利用的空间层次结构，我们强制解码器生成功能区的粗粒度层和POI分布的细粒度层。为了整合人类指导，我们允许人类将他们需要的描述为文本，并将这些文本用作模型条件输入。为了减少训练数据的稀疏性，提高模型的鲁棒性，我们引入了一种变分高斯嵌入机制。它不仅使我们能够更好地逼近训练数据的嵌入空间分布，并对更大的人群进行抽样以克服稀疏性，而且在城市规划生成中增加了更多的概率随机性，以提高嵌入的多样性，从而提高鲁棒性。最后，我们进行了大量的实验来验证我们方法的增强性能。摘要：Urban planning designs land-use configurations and can benefit building livable, sustainable, safe communities. Inspired by image generation, deep urban planning aims to leverage deep learning to generate land-use configurations. However, urban planning is a complex process. Existing studies usually ignore the need of personalized human guidance in planning, and spatial hierarchical structure in planning generation. Moreover, the lack of large-scale land-use configuration samples poses a data sparsity challenge. This paper studies a novel deep human guided urban planning method to jointly solve the above challenges. Specifically, we formulate the problem into a deep conditional variational autoencoder based framework. In this framework, we exploit the deep encoder-decoder design to generate land-use configurations. To capture the spatial hierarchy structure of land uses, we enforce the decoder to generate both the coarse-grained layer of functional zones, and the fine-grained layer of POI distributions. To integrate human guidance, we allow humans to describe what they need as texts and use these texts as a model condition input. To mitigate training data sparsity and improve model robustness, we introduce a variational Gaussian embedding mechanism. It not just allows us to better approximate the embedding space distribution of training data and sample a larger population to overcome sparsity, but also adds more probabilistic randomness into the urban planning generation to improve embedding diversity so as to improve robustness. Finally, we present extensive experiments to validate the enhanced performances of our method.

自动驾驶|车辆|车道检测等(1篇)

【1】 DG-Labeler and DGL-MOTS Dataset: Boost the Autonomous Driving Perception 标题：DG-Labeler和DGL-MOTS数据集：增强自主驾驶感知链接：https://arxiv.org/abs/2110.07790

作者：Yiming Cui,Zhiwen Cao,Yixin Xie,Xingyu Jiang,Feng Tao,Yingjie Chen,Lin Li,Dongfang Liu 机构：University of Florida, Gainesville, FL , Purdue University, West Lafayette, IN , The University of Texas at El Paso, El Paso, TX , The University of Texas at San Antonio, San Antonio, TX , Yingjie Victor Chen, Rochester Institute of Technology, Rochester, NY 摘要：多目标跟踪与分割（MOTS）是自主驾驶应用中的一项关键任务。现有的交通运输部研究面临两个关键挑战：1）公布的数据集未能充分反映网络训练的现实复杂性，以应对各种驾驶环境；2）文献中正在研究工作管道注释工具，以提高MOTS学习示例的质量。在这项工作中，我们引入DG Labeler和DGL-MOTS数据集，以方便MOTS任务的训练数据注释，从而提高网络训练的准确性和效率。DG Labeler使用新的深度粒度模块来描述实例的空间关系，并生成细粒度的实例掩码。由DG Labeler注释，我们的DGL-MOTS数据集在数据多样性、注释质量和时间表示方面超过了先前的努力（即KITTI MOTS和BDD100K）。广泛的跨数据集评估结果表明，在我们的DGL-MOTS数据集上训练的几种最先进的方法的性能有了显著提高。我们相信我们的DGL-MOTS数据集和DG Labeler具有提升未来交通视觉感知的宝贵潜力。摘要：Multi-object tracking and segmentation (MOTS) is a critical task for autonomous driving applications. The existing MOTS studies face two critical challenges: 1) the published datasets inadequately capture the real-world complexity for network training to address various driving settings; 2) the working pipeline annotation tool is under-studied in the literature to improve the quality of MOTS learning examples. In this work, we introduce the DG-Labeler and DGL-MOTS dataset to facilitate the training data annotation for the MOTS task and accordingly improve network training accuracy and efficiency. DG-Labeler uses the novel Depth-Granularity Module to depict the instance spatial relations and produce fine-grained instance masks. Annotated by DG-Labeler, our DGL-MOTS dataset exceeds the prior effort (i.e., KITTI MOTS and BDD100K) in data diversity, annotation quality, and temporal representations. Results on extensive cross-dataset evaluations indicate significant performance improvements for several state-of-the-art methods trained on our DGL-MOTS dataset. We believe our DGL-MOTS Dataset and DG-Labeler hold the valuable potential to boost the visual perception of future transportation.

人脸|人群计数(1篇)

【1】 Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? 标题：交流绘画的共享视觉表征：不同的偏见如何影响人类的解释力和意图？链接：https://arxiv.org/abs/2110.08203

作者：Daniela Mihai,Jonathon Hare 机构：Electronics and Computer Science, The University of Southampton, Southampton, UK 摘要：我们提出了一项调查，调查如何代表性损失可以影响绘画制作的人工代理玩通信游戏。基于最近的进展，我们表明，强大的预训练编码器网络与适当的感应偏差相结合，可以使代理绘制可识别的草图，同时仍能进行良好的通信。此外，我们开始开发一种方法来帮助自动分析草图所传达的语义内容，并证明，尽管agent训练是自我监督的，但当前诱导感知偏差的方法导致了对象性是一个关键特征的概念。摘要：We present an investigation into how representational losses can affect the drawings produced by artificial agents playing a communication game. Building upon recent advances, we show that a combination of powerful pretrained encoder networks, with appropriate inductive biases, can lead to agents that draw recognisable sketches, whilst still communicating well. Further, we start to develop an approach to help automatically analyse the semantic content being conveyed by a sketch and demonstrate that current approaches to inducing perceptual biases lead to a notion of objectness being a key feature despite the agent training being self-supervised.

跟踪(1篇)

【1】 Pyramid Correlation based Deep Hough Voting for Visual Object Tracking 标题：基于金字塔相关的深霍夫投票视觉目标跟踪链接：https://arxiv.org/abs/2110.07994

作者：Ying Wang,Tingfa Xu,Jianan Li,Shenwang Jiang,Junjie Chen 机构：Beijing Institute of Technology, Beijing, China ∗, Editors: Vineeth N Balasubramanian and Ivor Tsang 备注：Accepted by ACML 2021 Conference Track (Short Oral) 摘要：大多数现有的基于暹罗语的跟踪器将跟踪问题视为分类和回归的并行任务。然而，一些研究表明，在网络训练过程中，兄弟姐妹头部结构可能导致次优解决方案。通过实验我们发现，在没有回归的情况下，只要我们精心设计适合训练目标的网络，性能同样有希望。我们提出了一种新的基于投票的分类跟踪算法，称为基于金字塔相关的Deep-Hough投票（简称PCDHV），用于联合定位目标的左上角和右下角。具体地说，我们创新性地构建了一个金字塔相关模块，使嵌入的特征具有细粒度的局部结构和全局空间上下文；精心设计的Deep-Hough投票模块进一步接管，集成像素的长距离依赖性以感知角点；此外，通过提高特征地图的空间分辨率，同时利用通道空间关系，简单但有效地缓解了普遍存在的离散化差距。该算法具有通用性、鲁棒性和简单性。我们通过一系列烧蚀实验证明了该模块的有效性。在没有铃声和哨声的情况下，我们的跟踪器在三个具有挑战性的基准（TrackingNet、GOT-10k和LaSOT）上实现了比SOTA算法更好或相当的性能，同时以80 FPS的实时速度运行。代码和型号将发布。摘要：Most of the existing Siamese-based trackers treat tracking problem as a parallel task of classification and regression. However, some studies show that the sibling head structure could lead to suboptimal solutions during the network training. Through experiments we find that, without regression, the performance could be equally promising as long as we delicately design the network to suit the training objective. We introduce a novel voting-based classification-only tracking algorithm named Pyramid Correlation based Deep Hough Voting (short for PCDHV), to jointly locate the top-left and bottom-right corners of the target. Specifically we innovatively construct a Pyramid Correlation module to equip the embedded feature with fine-grained local structures and global spatial contexts; The elaborately designed Deep Hough Voting module further take over, integrating long-range dependencies of pixels to perceive corners; In addition, the prevalent discretization gap is simply yet effectively alleviated by increasing the spatial resolution of the feature maps while exploiting channel-space relationships. The algorithm is general, robust and simple. We demonstrate the effectiveness of the module through a series of ablation experiments. Without bells and whistles, our tracker achieves better or comparable performance to the SOTA algorithms on three challenging benchmarks (TrackingNet, GOT-10k and LaSOT) while running at a real-time speed of 80 FPS. Codes and models will be released.

裁剪|量化|加速|压缩相关(2篇)

【1】 Joint Channel and Weight Pruning for Model Acceleration on Moblie Devices 标题：Moblie设备模型加速的联合通道和权重剪枝链接：https://arxiv.org/abs/2110.08013

作者：Tianli Zhao,Xi Sheryl Zhang,Wentao Zhu,Jiaxing Wang,Ji Liu,Jian Cheng 机构：†Institute of Automation, Chinese Academy of Sciences, §School of Artificial Intelligence, University of Chinese Academy of Sciences 备注：23 pages, 6 figures 摘要：对于移动设备上的实际的深度神经网络设计，重要的是考虑计算资源和各种应用中的推理延迟所产生的约束。在与深度网络加速相关的方法中，剪枝是一种广泛采用的平衡计算资源消耗和精度的方法，在这种方法中，可以明智地或随机地删除不重要的连接，对模型精度的影响最小。通道修剪可立即显著减少延迟，而随机权重修剪更灵活，可以平衡延迟和准确性。在本文中，我们提出了一个联合通道剪枝和权重剪枝（JCW）的统一框架，并在延迟和准确性之间实现了比以前的模型压缩方法更好的Pareto边界。为了充分优化延迟和准确度之间的权衡，我们在JCW框架中开发了一种定制的多目标进化算法，该算法可以通过一次搜索获得满足各种部署需求的最佳候选体系结构。大量的实验表明，JCW与ImageNet分类数据集上的各种最先进的修剪方法相比，在延迟和准确性之间取得了更好的平衡。我们的代码可在https://github.com/jcw-anonymous/JCW. 摘要：For practical deep neural network design on mobile devices, it is essential to consider the constraints incurred by the computational resources and the inference latency in various applications. Among deep network acceleration related approaches, pruning is a widely adopted practice to balance the computational resource consumption and the accuracy, where unimportant connections can be removed either channel-wisely or randomly with a minimal impact on model accuracy. The channel pruning instantly results in a significant latency reduction, while the random weight pruning is more flexible to balance the latency and accuracy. In this paper, we present a unified framework with Joint Channel pruning and Weight pruning (JCW), and achieves a better Pareto-frontier between the latency and accuracy than previous model compression approaches. To fully optimize the trade-off between the latency and accuracy, we develop a tailored multi-objective evolutionary algorithm in the JCW framework, which enables one single search to obtain the optimal candidate architectures for various deployment requirements. Extensive experiments demonstrate that the JCW achieves a better trade-off between the latency and accuracy against various state-of-the-art pruning methods on the ImageNet classification dataset. Our codes are available at https://github.com/jcw-anonymous/JCW.

【2】 PTQ-SL: Exploring the Sub-layerwise Post-training Quantization 标题：PTQ-SL：探索子层智慧型训练后量化链接：https://arxiv.org/abs/2110.07809

作者：Zhihang Yuan,Yiqi Chen,Chenhao Xue,Chenguang Zhang,Qiankun Wang,Qiankun Wang,Guangyu Sun 机构：Center for Energy-Efficient Computing and Applications, Peking University, Beijing, China, Houmo AI, Beijing, China 摘要：网络量化是一种有效的卷积神经网络压缩技术。量化粒度决定了如何在权重中共享缩放因子，从而影响网络量化的性能。对于卷积层的量化，大多数现有方法以分层或信道方式共享缩放因子。信道量化和分层量化在各种应用中得到了广泛的应用。然而，很少研究其他量化粒度。在本文中，我们将探讨跨多个输入和输出通道共享比例因子的子层粒度。提出了一种有效的子层粒度训练后量化方法（PTQ-SL）。然后，我们对各种粒度进行了系统的实验，发现量化神经网络的预测精度与粒度有很强的相关性。此外，我们发现调整信道的位置可以提高子层量化的性能。因此，我们提出了一种对信道重新排序的方法，用于子层量化。实验表明，采用适当的信道重排序的子层量化比信道量化具有更好的性能。摘要：Network quantization is a powerful technique to compress convolutional neural networks. The quantization granularity determines how to share the scaling factors in weights, which affects the performance of network quantization. Most existing approaches share the scaling factors layerwisely or channelwisely for quantization of convolutional layers. Channelwise quantization and layerwise quantization have been widely used in various applications. However, other quantization granularities are rarely explored. In this paper, we will explore the sub-layerwise granularity that shares the scaling factor across multiple input and output channels. We propose an efficient post-training quantization method in sub-layerwise granularity (PTQ-SL). Then we systematically experiment on various granularities and observe that the prediction accuracy of the quantized neural network has a strong correlation with the granularity. Moreover, we find that adjusting the position of the channels can improve the performance of sub-layerwise quantization. Therefore, we propose a method to reorder the channels for sub-layerwise quantization. The experiments demonstrate that the sub-layerwise quantization with appropriate channel reordering can outperform the channelwise quantization.

多模态(1篇)

【1】 Multi-modal Aggregation Network for Fast MR Imaging 标题：用于快速磁共振成像的多模式聚合网络链接：https://arxiv.org/abs/2110.08080

作者：Chun-Mei Feng,Huazhu Fe,Tianfei Zhou,Yong Xu,Ling Shao,David Zhang 摘要：磁共振（MR）成像是一种常用的扫描技术，用于疾病检测、诊断和治疗监测。虽然它能够生成具有更好对比度的器官和组织的详细图像，但它的采集时间很长，这使得图像质量很容易受到运动伪影的影响。近年来，为了加速磁共振成像，人们开发了许多方法来从部分观测的测量数据重建全采样图像。然而，这些工作大多集中在单个模态的重建或多模态的简单融合上，而忽略了在不同特征水平上发现相关知识。在这项工作中，我们提出了一种新的多模态聚合网络，称为MANet，它能够从完全采样的辅助模态中发现互补表示，并用它分层指导给定目标模态的重建。在我们的MANet中，来自完全采样辅助和欠采样目标模式的表示是通过特定网络独立学习的。然后，在每个卷积阶段引入引导注意模块，有选择地聚集多模态特征以更好地重建，从而产生全面、多尺度、多模态的特征融合。此外，我们的MANet遵循一种混合域学习框架，允许它同时恢复$k$-空间域中的频率信号以及从图像域恢复图像细节。大量实验表明，该方法优于目前最先进的MR图像重建方法。摘要：Magnetic resonance (MR) imaging is a commonly used scanning technique for disease detection, diagnosis and treatment monitoring. Although it is able to produce detailed images of organs and tissues with better contrast, it suffers from a long acquisition time, which makes the image quality vulnerable to say motion artifacts. Recently, many approaches have been developed to reconstruct full-sampled images from partially observed measurements in order to accelerate MR imaging. However, most of these efforts focus on reconstruction over a single modality or simple fusion of multiple modalities, neglecting the discovery of correlation knowledge at different feature level. In this work, we propose a novel Multi-modal Aggregation Network, named MANet, which is capable of discovering complementary representations from a fully sampled auxiliary modality, with which to hierarchically guide the reconstruction of a given target modality. In our MANet, the representations from the fully sampled auxiliary and undersampled target modalities are learned independently through a specific network. Then, a guided attention module is introduced in each convolutional stage to selectively aggregate multi-modal features for better reconstruction, yielding comprehensive, multi-scale, multi-modal feature fusion. Moreover, our MANet follows a hybrid domain learning framework, which allows it to simultaneously recover the frequency signal in the $k$-space domain as well as restore the image details from the image domain. Extensive experiments demonstrate the superiority of the proposed approach over state-of-the-art MR image reconstruction methods.

3D|3D重建等相关(2篇)

【1】 3D Reconstruction of Curvilinear Structures with Stereo Matching DeepConvolutional Neural Networks 标题：基于立体匹配深度卷积神经网络的曲线结构三维重建链接：https://arxiv.org/abs/2110.07766

作者：Okan Altingövde,Anastasiia Mishchuk,Gulnaz Ganeeva,Emad Oveisi,Cecile Hebert,Pascal Fua 机构：Computer Vision Laboratory, ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Switzerland, Electron Spectrometry and Microscopy Laboratory, ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Switzerland 摘要：曲线结构经常作为感兴趣的对象出现在显微镜成像中。晶体缺陷，即电子位错，是在透射电镜（TEM）下反复研究的曲线结构之一，其三维结构信息对于理解材料的性质具有重要意义。脱位的三维信息通常通过断层扫描获得，这是一个繁琐的过程，因为它需要获得许多具有不同倾角和类似成像条件的图像。尽管替代立体方法将所需图像的数量减少到两个，但它们仍然需要人为干预和形状优先权才能进行准确的3D估计。我们提出了一种利用深度卷积神经网络（CNN）检测和匹配立体对中曲线结构的全自动管道，无需对三维形状进行任何事先假设。在这项工作中，我们主要致力于从TEM立体图像对中重建位错。摘要：Curvilinear structures frequently appear in microscopy imaging as the object of interest. Crystallographic defects, i.edislocations, are one of the curvilinear structures that have been repeatedly investigated under transmission electronmicroscopy (TEM) and their 3D structural information is of great importance for understanding the properties ofmaterials. 3D information of dislocations is often obtained by tomography which is a cumbersome process since itis required to acquire many images with different tilt angles and similar imaging conditions. Although, alternativestereoscopy methods lower the number of required images to two, they still require human intervention and shape priorsfor accurate 3D estimation. We propose a fully automated pipeline for both detection and matching of curvilinearstructures in stereo pairs by utilizing deep convolutional neural networks (CNNs) without making any prior assumptionon 3D shapes. In this work, we mainly focus on 3D reconstruction of dislocations from stereo pairs of TEM images.

【2】 Pre-training Molecular Graph Representation with 3D Geometry 标题：基于三维几何的预训练分子图表示链接：https://arxiv.org/abs/2110.07728

作者：Shengchao Liu,Hanchen Wang,Weiyang Liu,Joan Lasenby,Hongyu Guo,Jian Tang 机构：Mila, Université de Montréal, University of Cambridge, MPI for Intelligent Systems, Tübingen, National Research Council Canada, HEC Montréal, CIFAR AI Chair 摘要：分子图表征学习是现代药物和材料发现中的一个基本问题。分子图通常由其二维拓扑结构建模，但最近发现三维几何信息在预测分子功能方面起着更重要的作用。然而，现实场景中缺乏三维信息，这严重阻碍了几何图形表示的学习。为了应对这一挑战，我们提出了图形多视图预训练（GraphMVP）框架，其中通过利用二维拓扑结构和三维几何视图之间的对应性和一致性来执行自监督学习（SSL）。GraphMVP有效地学习2D分子图编码器，该编码器通过更丰富和更具辨别力的3D几何结构得到增强。我们进一步提供理论见解来证明GraphMVP的有效性。最后，综合实验表明，GraphMVP的性能始终优于现有的graphssl方法。摘要：Molecular graph representation learning is a fundamental problem in modern drug and material discovery. Molecular graphs are typically modeled by their 2D topological structures, but it has been recently discovered that 3D geometric information plays a more vital role in predicting molecular functionalities. However, the lack of 3D information in real-world scenarios has significantly impeded the learning of geometric graph representation. To cope with this challenge, we propose the Graph Multi-View Pre-training (GraphMVP) framework where self-supervised learning (SSL) is performed by leveraging the correspondence and consistency between 2D topological structures and 3D geometric views. GraphMVP effectively learns a 2D molecular graph encoder that is enhanced by richer and more discriminative 3D geometry. We further provide theoretical insights to justify the effectiveness of GraphMVP. Finally, comprehensive experiments show that GraphMVP can consistently outperform existing graph SSL methods.

其他神经网络|深度学习|模型|建模(7篇)

【1】 The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color 标题：章鱼的世界：报道偏见如何影响语言模型对颜色的感知链接：https://arxiv.org/abs/2110.08182

作者：Cory Paik,Stéphane Aroca-Ouellette,Alessandro Roncone,Katharina Kann 机构：University of Colorado Boulder 备注：Accepted to EMNLP 2021, 9 Pages 摘要：最近的工作引起了人们对纯文本训练固有局限性的关注。在本文中，我们首先证明了报告偏差，即人们不陈述明显情况的倾向，是造成这一限制的原因之一，然后调查多模式训练在多大程度上可以缓解这一问题。为此，我们1）生成颜色数据集（CoDa），这是521个常见对象的人类感知颜色分布数据集；2）使用CoDa分析和比较文本中发现的颜色分布、语言模型捕获的分布以及人类对颜色的感知；3）调查纯文本模型和多模态模型在尾波方面的性能差异。我们的研究结果表明，语言模型恢复的颜色分布与文本中发现的不准确分布的相关性比与基本事实的相关性更强，支持报告偏见对纯文本训练产生负面影响和内在限制的说法。然后，我们证明了多模态模型可以利用其视觉训练来缓解这些影响，为未来的研究提供了一个有希望的途径。摘要：Recent work has raised concerns about the inherent limitations of text-only pretraining. In this paper, we first demonstrate that reporting bias, the tendency of people to not state the obvious, is one of the causes of this limitation, and then investigate to what extent multimodal training can mitigate this issue. To accomplish this, we 1) generate the Color Dataset (CoDa), a dataset of human-perceived color distributions for 521 common objects; 2) use CoDa to analyze and compare the color distribution found in text, the distribution captured by language models, and a human's perception of color; and 3) investigate the performance differences between text-only and multimodal models on CoDa. Our results show that the distribution of colors that a language model recovers correlates more strongly with the inaccurate distribution found in text than with the ground-truth, supporting the claim that reporting bias negatively impacts and inherently limits text-only training. We then demonstrate that multimodal models can leverage their visual training to mitigate these effects, providing a promising avenue for future research.

【2】 Learning to Infer Kinematic Hierarchies for Novel Object Instances 标题：学习推断新对象实例的运动学层次链接：https://arxiv.org/abs/2110.07911

作者：Hameed Abdul-Rashid,Miles Freeman,Ben Abbatematteo,George Konidaris,Daniel Ritchie 机构：Brown University, University of Illinois at Urbana-Champaign 摘要：操纵铰接对象需要感知其运动学层次：其各个部分、每个部分如何移动以及这些运动如何耦合。以前的工作已经探索了运动学的每个概念，但没有一项工作在不依赖模式或模板的情况下，在从未见过的对象实例上推断出完整的运动学。我们提出了一种新的感知系统来实现这一目标。我们的系统推断出物体的运动部分以及与之相关的运动耦合。对于其他零件，它使用点云实例分割神经网络来推断运动学层次，它使用图形神经网络来预测与推断零件相关的边（即关节）的存在、方向和类型。我们使用合成3D模型的模拟扫描来训练这些网络。我们通过3D对象的模拟扫描来评估我们的系统，并且我们展示了我们的系统用于驱动真实世界机器人操作的概念验证。摘要：Manipulating an articulated object requires perceiving itskinematic hierarchy: its parts, how each can move, and howthose motions are coupled. Previous work has explored per-ception for kinematics, but none infers a complete kinematichierarchy on never-before-seen object instances, without relyingon a schema or template. We present a novel perception systemthat achieves this goal. Our system infers the moving parts ofan object and the kinematic couplings that relate them. Toinfer parts, it uses a point cloud instance segmentation neuralnetwork and to infer kinematic hierarchies, it uses a graphneural network to predict the existence, direction, and typeof edges (i.e. joints) that relate the inferred parts. We trainthese networks using simulated scans of synthetic 3D models.We evaluate our system on simulated scans of 3D objects, andwe demonstrate a proof-of-concept use of our system to drivereal-world robotic manipulation.

【3】 NeuroView: Explainable Deep Network Decision Making 标题：NeuroView：可解释的深度网络决策链接：https://arxiv.org/abs/2110.07778

作者：CJ Barberan,Randall Balestriero,Richard G. Baraniuk 机构：Department of Electrical, and Computer Engineering, Rice University, Houston, TX 备注：12 pages, 7 figures 摘要：深度神经网络（DNs）在许多计算机视觉任务中提供了超人的性能，但仍不清楚DN的哪些单元对特定决策有贡献。NeuroView是一个新的DN体系结构家族，可通过设计进行解释。通过对单位输出值进行矢量量化并将其输入到全局线性分类器中，该家族的每个成员都从标准DN体系结构中派生出来。由此产生的体系结构在每个单元的状态和分类决策之间建立了直接的因果关系。我们在标准数据集和分类任务上验证NeuroView，以显示其单元/类映射如何帮助理解决策过程。摘要：Deep neural networks (DNs) provide superhuman performance in numerous computer vision tasks, yet it remains unclear exactly which of a DN's units contribute to a particular decision. NeuroView is a new family of DN architectures that are interpretable/explainable by design. Each member of the family is derived from a standard DN architecture by vector quantizing the unit output values and feeding them into a global linear classifier. The resulting architecture establishes a direct, causal link between the state of each unit and the classification decision. We validate NeuroView on standard datasets and classification tasks to show that how its unit/class mapping aids in understanding the decision-making process.

【4】 4D flight trajectory prediction using a hybrid Deep Learning prediction method based on ADS-B technology: a case study of Hartsfield-Jackson Atlanta International Airport(ATL) 标题：基于ADS-B技术的混合深度学习预测方法在4D飞行轨迹预测中的应用--以亚特兰大哈茨菲尔德-杰克逊国际机场(ATL)为例链接：https://arxiv.org/abs/2110.07774

作者：Hesam Sahfienya,Amelia C. Regan 机构：(corresponding author) 备注：17 pages, 10 figures 摘要：任何飞行计划的核心都是飞行轨迹。特别是，4D轨迹是飞行属性预测最关键的组成部分。特别是，4D轨迹是飞行属性预测最关键的组成部分。每个轨迹都包含与不确定性相关的空间和时间特征，这些不确定性使得预测过程变得复杂。今天，由于对航空运输的需求不断增加，机场和航空公司必须有一个优化的时间表，以利用机场的所有基础设施潜力。这是可能的使用先进的弹道预测方法。针对哈茨菲尔德-杰克逊亚特兰大国际机场（ATL）预测模型的不确定性，提出了一种新的混合深度学习模型来提取机场的时空特征。自动相关监视广播（ADS-B）数据用作模型的输入。本研究分三步进行：（a）数据预处理；（b）通过卷积神经网络和选通递归单元（CNN-GRU）以及3D-CNN模型进行预测；（c）第三步也是最后一步是通过对比实验结果，将该模型的性能与所提出的模型进行比较。采用蒙特卡罗衰减法（MC衰减法）考虑深层模型的不确定性。Mont Carlo辍学被添加到网络层，通过在不同神经元之间切换的鲁棒方法来增强模型的预测性能。结果表明，与其他模型（即3D CNN、CNN-GRU）相比，该模型具有较低的误差测量。带有MC辍学的模型将误差进一步平均降低21%。摘要：The core of any flight schedule is the trajectories. In particular, 4D trajectories are the most crucial component for flight attribute prediction. In particular, 4D trajectories are the most crucial component for flight attribute prediction. Each trajectory contains spatial and temporal features that are associated with uncertainties that make the prediction process complex. Today because of the increasing demand for air transportation, it is compulsory for airports and airlines to have an optimized schedule to use all of the airport's infrastructure potential. This is possible using advanced trajectory prediction methods. This paper proposes a novel hybrid deep learning model to extract the spatial and temporal features considering the uncertainty of the prediction model for Hartsfield-Jackson Atlanta International Airport(ATL). Automatic Dependent Surveillance-Broadcast (ADS-B) data are used as input to the models. This research is conducted in three steps: (a) data preprocessing; (b) prediction by a hybrid Convolutional Neural Network and Gated Recurrent Unit (CNN-GRU) along with a 3D-CNN model; (c) The third and last step is the comparison of the model's performance with the proposed model by comparing the experimental results. The deep model uncertainty is considered using the Mont-Carlo dropout (MC-Dropout). Mont-Carlo dropouts are added to the network layers to enhance the model's prediction performance by a robust approach of switching off between different neurons. The results show that the proposed model has low error measurements compared to the other models (i.e., 3D CNN, CNN-GRU). The model with MC-dropout reduces the error further by an average of 21 %.

【5】 Decomposing Convolutional Neural Networks into Reusable and Replaceable Modules 标题：将卷积神经网络分解为可重用和可替换的模块链接：https://arxiv.org/abs/2110.07720

作者：Rangeet Pan,Hridesh Rajan 机构：Dept. of Computer Science, Iowa State University, Atanasoff Hall, Ames, IA, USA 摘要：从头开始训练是建立基于卷积神经网络（CNN）的模型的最常用方法。如果我们可以通过重用以前构建的CNN模型中的部分来构建新的CNN模型呢？如果我们可以通过用其他部件替换（可能有故障的）部件来改进CNN模型呢？在这两种情况下，我们是否可以确定负责模型中每个输出类（模块）的部件，而不是进行训练，并仅重用或替换所需的输出类来构建模型？之前的工作建议将密集型网络分解为模块（每个输出类对应一个模块），以便在各种场景中实现可重用性和可替换性。然而，这项工作仅限于密集层，并且基于连续层中节点之间的一对一关系。由于CNN模型中的共享架构，先前的工作无法直接调整。在本文中，我们建议将用于图像分类问题的CNN模型分解为每个输出类的模块。这些模块可以进一步重用或替换以构建新模型。我们使用CIFAR-10、CIFAR-100和ImageNet微小数据集以及三种不同的ResNet模型对我们的方法进行了评估，发现实现分解的成本很小（top-1和top-5的准确率分别为2.38%和0.81%）。此外，通过重用或替换模块构建模型的平均精度损失为2.3%和0.5%。此外，与从头开始训练模型相比，重用和更换这些模块可将二氧化碳排放量减少约37倍。摘要：Training from scratch is the most common way to build a Convolutional Neural Network (CNN) based model. What if we can build new CNN models by reusing parts from previously build CNN models? What if we can improve a CNN model by replacing (possibly faulty) parts with other parts? In both cases, instead of training, can we identify the part responsible for each output class (module) in the model(s) and reuse or replace only the desired output classes to build a model? Prior work has proposed decomposing dense-based networks into modules (one for each output class) to enable reusability and replaceability in various scenarios. However, this work is limited to the dense layers and based on the one-to-one relationship between the nodes in consecutive layers. Due to the shared architecture in the CNN model, prior work cannot be adapted directly. In this paper, we propose to decompose a CNN model used for image classification problems into modules for each output class. These modules can further be reused or replaced to build a new model. We have evaluated our approach with CIFAR-10, CIFAR-100, and ImageNet tiny datasets with three variations of ResNet models and found that enabling decomposition comes with a small cost (2.38% and 0.81% for top-1 and top-5 accuracy, respectively). Also, building a model by reusing or replacing modules can be done with a 2.3% and 0.5% average loss of accuracy. Furthermore, reusing and replacing these modules reduces CO2e emission by ~37 times compared to training the model from scratch.

【6】 Non-deep Networks 标题：非深度网络链接：https://arxiv.org/abs/2110.07641

作者：Ankit Goyal,Alexey Bochkovskiy,Jia Deng,Vladlen Koltun 机构：Princeton University, Intel Labs 摘要：深度是深度神经网络的特征。但深度越深意味着顺序计算越多，延迟越大。这就引出了一个问题——有可能建立高性能的“非深度”神经网络吗？我们证明了这一点。为此，我们使用并行子网络，而不是一层接一层地堆叠。这有助于在保持高性能的同时有效减少深度。通过利用并行子结构，我们首次表明，深度仅为12的网络可以在ImageNet上达到80%以上的顶级精度，在CIFAR10上达到96%，在CIFAR100上达到81%。我们还表明，具有低深度（12）主干的网络可以在MS-COCO上实现48%的AP。我们分析了设计中的缩放规则，并展示了如何在不改变网络深度的情况下提高性能。最后，我们提供了如何使用非深度网络构建低延迟识别系统的概念证明。代码可在https://github.com/imankgoyal/NonDeepNetworks. 摘要：Depth is the hallmark of deep neural networks. But more depth means more sequential computation and higher latency. This begs the question -- is it possible to build high-performing "non-deep" neural networks? We show that it is. To do so, we use parallel subnetworks instead of stacking one layer after another. This helps effectively reduce depth while maintaining high performance. By utilizing parallel substructures, we show, for the first time, that a network with a depth of just 12 can achieve top-1 accuracy over 80% on ImageNet, 96% on CIFAR10, and 81% on CIFAR100. We also show that a network with a low-depth (12) backbone can achieve an AP of 48% on MS-COCO. We analyze the scaling rules for our design and show how to increase performance without changing the network's depth. Finally, we provide a proof of concept for how non-deep networks could be used to build low-latency recognition systems. Code is available at https://github.com/imankgoyal/NonDeepNetworks.

【7】 Prediction of Lung CT Scores of Systemic Sclerosis by Cascaded Regression Neural Networks 标题：级联回归神经网络预测系统性硬化症肺部CT评分链接：https://arxiv.org/abs/2110.08085

作者：Jingnan Jia,Marius Staring,Irene Hernández-Girón,Lucia J. M. Kroft,Anne A. Schouffoer,Berend C. Stoel 机构：Division of Image Processing, bDepartment of Radiology, cDepartment of Rheumatology, Leiden, University Medical Center (LUMC), P.O. Box , RC, Leiden, The Netherlands. 备注：SPIE 2022 accepted 摘要：通过CT扫描对系统性硬化症患者的肺部病变进行可视化评分在监测病情进展方面起着重要作用，但其劳动强度限制了实际应用。因此，我们提出了一个由两个级联的深度回归神经网络组成的自动评分框架。第一个（3D）网络旨在预测3D CT扫描上五个解剖定义的评分水平的颅尾位置。第二个（2D）网络接收生成的2D轴向切片并预测分数。我们使用227个3D CT扫描来训练和验证第一个网络，得到的1135个轴向切片用于第二个网络。两位专家对数据子集进行独立评分，以获得观察者内部和观察者之间的变量，并一致获得所有数据的基本事实。为了缓解第二个网络中训练标签的不平衡，我们引入了采样技术，并为了增加训练样本的多样性，生成了模拟磨砂玻璃和网状图案的合成数据。4倍交叉验证表明，我们提出的网络的平均MAE为5.90、4.66和4.49，总分（TOT）、毛玻璃（GG）和网状模式（RET）的加权kappa分别为0.66、0.58和0.65。我们的网络在TOT和GG预测方面的表现略差于最好的专家，但在RET预测方面具有竞争力，并且有可能成为CT胸部研究中SSc视觉评分的客观替代方案。摘要：Visually scoring lung involvement in systemic sclerosis from CT scans plays an important role in monitoring progression, but its labor intensiveness hinders practical application. We proposed, therefore, an automatic scoring framework that consists of two cascaded deep regression neural networks. The first (3D) network aims to predict the craniocaudal position of five anatomically defined scoring levels on the 3D CT scans. The second (2D) network receives the resulting 2D axial slices and predicts the scores. We used 227 3D CT scans to train and validate the first network, and the resulting 1135 axial slices were used in the second network. Two experts scored independently a subset of data to obtain intra- and interobserver variabilities and the ground truth for all data was obtained in consensus. To alleviate the unbalance in training labels in the second network, we introduced a sampling technique and to increase the diversity of the training samples synthetic data was generated, mimicking ground glass and reticulation patterns. The 4-fold cross validation showed that our proposed network achieved an average MAE of 5.90, 4.66 and 4.49, weighted kappa of 0.66, 0.58 and 0.65 for total score (TOT), ground glass (GG) and reticular pattern (RET), respectively. Our network performed slightly worse than the best experts on TOT and GG prediction but it has competitive performance on RET prediction and has the potential to be an objective alternative for the visual scoring of SSc in CT thorax studies.

其他(11篇)

【1】 Combining Diverse Feature Priors 标题：组合不同的功能先验链接：https://arxiv.org/abs/2110.08220

作者：Saachi Jain,Dimitris Tsipras,Aleksander Madry 机构：MIT, Aleksander M ˛adry 摘要：为了提高模型的泛化能力，模型设计者通常会隐式或显式地限制其模型使用的特征。在这项工作中，我们通过将这些特性作为数据的不同视角来探索利用这些特性优先级的设计空间。具体而言，我们发现，使用不同的特征先验集训练的模型具有较少的重叠故障模式，因此可以更有效地组合。此外，我们还证明，在附加（未标记）数据上联合训练此类模型可以使它们纠正彼此的错误，从而提高泛化能力和对虚假相关性的恢复能力。代码可在https://github.com/MadryLab/copriors. 摘要：To improve model generalization, model designers often restrict the features that their models use, either implicitly or explicitly. In this work, we explore the design space of leveraging such feature priors by viewing them as distinct perspectives on the data. Specifically, we find that models trained with diverse sets of feature priors have less overlapping failure modes, and can thus be combined more effectively. Moreover, we demonstrate that jointly training such models on additional (unlabeled) data allows them to correct each other's mistakes, which, in turn, leads to better generalization and resilience to spurious correlations. Code available at https://github.com/MadryLab/copriors.

【2】 Trade-offs of Local SGD at Scale: An Empirical Study 标题：地方SGD规模取舍的实证研究链接：https://arxiv.org/abs/2110.08133

作者：Jose Javier Gonzalez Ortiz,Jonathan Frankle,Mike Rabbat,Ari Morcos,Nicolas Ballas 机构：MIT CSAIL, Facebook AI Research 摘要：随着数据集和模型变得越来越大，分布式训练已经成为允许深度神经网络在合理的时间内进行训练的必要组成部分。然而，分布式训练可能会有大量的通信开销，这阻碍了其可扩展性。减少此开销的一种策略是在同步步骤之间对每个工作进程独立执行多个非同步SGD步骤，这是一种称为本地SGD的技术。我们对大规模图像分类任务中的局部SGD和相关方法进行了全面的实证研究。我们发现执行本地SGD是有代价的：更低的通信成本（从而更快的训练）伴随着更低的准确性。这一发现与先前工作中的小规模实验形成对比，表明本地SGD在规模上面临挑战。我们进一步表明，结合Wang等人（2020年）的慢动量框架，在不需要额外沟通的情况下持续提高准确性，暗示未来可能摆脱这种权衡的方向。摘要：As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time. However, distributed training can have substantial communication overhead that hinders its scalability. One strategy for reducing this overhead is to perform multiple unsynchronized SGD steps independently on each worker between synchronization steps, a technique known as local SGD. We conduct a comprehensive empirical study of local SGD and related methods on a large-scale image classification task. We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy. This finding is in contrast from the smaller-scale experiments in prior work, suggesting that local SGD encounters challenges at scale. We further show that incorporating the slow momentum framework of Wang et al. (2020) consistently improves accuracy without requiring additional communication, hinting at future directions for potentially escaping this trade-off.

【3】 FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes 标题：FlexConv：核大小可微的连续核卷积链接：https://arxiv.org/abs/2110.08059

作者：David W. Romero,Robert-Jan Bruintjes,Jakub M. Tomczak,Erik J. Bekkers,Mark Hoogendoorn,Jan C. van Gemert 机构： Vrije Universiteit Amsterdam, Delft University of Technology, University of Amsterdam, The Netherlands 备注：First two authors contributed equally to this work 摘要：在设计卷积神经网络（CNN）时，必须在训练之前选择卷积核的大小。最近的研究表明，CNN从不同层的不同内核大小中获益，但探索所有可能的组合在实践中是不可行的。一种更有效的方法是在训练期间学习内核大小。然而，现有的了解内核大小的工作带宽有限。这些方法通过膨胀来缩放内核，因此它们所能描述的细节是有限的。在这项工作中，我们提出了FlexConv，一种新的卷积运算，通过这种运算，可以以固定的参数代价学习具有可学习核大小的高带宽卷积核。FlexNet在不使用池的情况下建模长期依赖关系，在多个连续数据集上实现最先进的性能，在了解内核大小方面优于最近的工作，并且在图像基准数据集上与更深入的Resnet竞争。此外，FlexNet的部署分辨率比训练期间的分辨率更高。为了避免混叠，我们提出了一种新的核参数化方法，通过这种方法可以解析地控制核的频率。我们新的内核参数化显示出比现有参数化更高的描述能力和更快的收敛速度。这将大大提高分类精度。摘要：When designing Convolutional Neural Networks (CNNs), one must select the size of the convolutional kernels before training. Recent works show CNNs benefit from different kernel sizes at different layers, but exploring all possible combinations is unfeasible in practice. A more efficient approach is to learn the kernel size during training. However, existing works that learn the kernel size have a limited bandwidth. These approaches scale kernels by dilation, and thus the detail they can describe is limited. In this work, we propose FlexConv, a novel convolutional operation with which high bandwidth convolutional kernels of learnable kernel size can be learned at a fixed parameter cost. FlexNets model long-term dependencies without the use of pooling, achieve state-of-the-art performance on several sequential datasets, outperform recent works with learned kernel sizes, and are competitive with much deeper ResNets on image benchmark datasets. Additionally, FlexNets can be deployed at higher resolutions than those seen during training. To avoid aliasing, we propose a novel kernel parameterization with which the frequency of the kernels can be analytically controlled. Our novel kernel parameterization shows higher descriptive power and faster convergence speed than existing parameterizations. This leads to important improvements in classification accuracy.

【4】 Advances and Challenges in Deep Lip Reading 标题：深度唇语阅读的进展与挑战链接：https://arxiv.org/abs/2110.07879

作者：Marzieh Oghbaie,Arian Sabaghi,Kooshan Hashemifard,Mohammad Akbari 机构：Imam Khomeini International University, Qazvin, Iran, Amirkabir University of Technology(Tehran Polytechnic), Tehran, Iran, University of Alicante, Alicante, Spain, School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran 摘要：在深度学习技术和大规模数据集的推动下，近年来，自动唇读的范式发生了转变。虽然视觉语音识别（VSR）的主旨是提高音频语音识别系统的准确性，但其他潜在的应用，如生物特征识别，以及VSR系统的预期收益，已经激发了开发唇读技术的广泛努力。本文对基于深度学习的VSR研究现状进行了全面综述，重点介绍了数据挑战、任务复杂性以及相应的解决方案。这些方向的进步将加速无声语音界面从理论到实践的转变。我们还讨论了VSR管道的主要模块和有影响的数据集。最后，我们介绍了一些典型的VSR应用关注点和现实场景中的障碍以及未来的研究方向。摘要：Driven by deep learning techniques and large-scale datasets, recent years have witnessed a paradigm shift in automatic lip reading. While the main thrust of Visual Speech Recognition (VSR) was improving accuracy of Audio Speech Recognition systems, other potential applications, such as biometric identification, and the promised gains of VSR systems, have motivated extensive efforts on developing the lip reading technology. This paper provides a comprehensive survey of the state-of-the-art deep learning based VSR research with a focus on data challenges, task-specific complications, and the corresponding solutions. Advancements in these directions will expedite the transformation of silent speech interface from theory to practice. We also discuss the main modules of a VSR pipeline and the influential datasets. Finally, we introduce some typical VSR application concerns and impediments to real-world scenarios as well as future research directions.

【5】 Gait-based Frailty Assessment using Image Representation of IMU Signals and Deep CNN 标题：基于IMU信号图像表示和深层CNN的步态脆弱性评估链接：https://arxiv.org/abs/2110.07821

作者：Muhammad Zeeshan Arshad,Dawoon Jung,Mina Park,Hyungeun Shin,Jinwook Kim,Kyung-Ryoul Mun 机构：©, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media 备注：Accepted in 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2021) 摘要：虚弱是老年人常见且严重的疾病，可能导致健康状况进一步恶化。然而，基于活动相关问卷的传统脆弱性评估存在困难和复杂性。这些都可以通过监测步态虚弱的影响来克服。本文表明，通过将步态信号编码为图像，基于深度学习的模型可以用于步态类型的分类。提出了两种深度学习模型（a）基于单步幅输入图像的SS-CNN和（b）基于3个连续步幅的MS-CNN。结果表明，MS-CNN的准确率为85.1%，而SS-CNN的准确率为77.3%。这是因为MS-CNN可以观察到更多与步幅变化相对应的特征，步幅变化是虚弱的关键症状之一。步态信号用STFT、CWT和GAF编码为图像。虽然使用GAF图像的MS-CNN模型取得了最佳的整体准确度和精确度，但CWT的召回率略高。这项研究展示了如何利用图像编码的步态数据来充分利用深度学习CNN模型评估脆弱性的潜力。摘要：Frailty is a common and critical condition in elderly adults, which may lead to further deterioration of health. However, difficulties and complexities exist in traditional frailty assessments based on activity-related questionnaires. These can be overcome by monitoring the effects of frailty on the gait. In this paper, it is shown that by encoding gait signals as images, deep learning-based models can be utilized for the classification of gait type. Two deep learning models (a) SS-CNN, based on single stride input images, and (b) MS-CNN, based on 3 consecutive strides were proposed. It was shown that MS-CNN performs best with an accuracy of 85.1\%, while SS-CNN achieved an accuracy of 77.3\%. This is because MS-CNN can observe more features corresponding to stride-to-stride variations which is one of the key symptoms of frailty. Gait signals were encoded as images using STFT, CWT, and GAF. While the MS-CNN model using GAF images achieved the best overall accuracy and precision, CWT has a slightly better recall. This study demonstrates how image encoded gait data can be used to exploit the full potential of deep learning CNN models for the assessment of frailty.

【6】 Occupancy Estimation from Thermal Images 标题：基于热像的入住率估算链接：https://arxiv.org/abs/2110.07796

作者：Zishan Qin,Dipankar Chaki,Abdallah Lakhdari,Amani Abusafia,Athman Bouguettaya 机构： School of Computing, Australian National University, Canberra, Australia, School of Computer Science, University of Sydney, Australia 备注：4 pages, 2 figures. This is an accepted demo paper and to be published in the proceedings of 19th International Conference on Service Oriented Computing (ICSOC 2021) 摘要：我们提出了一个非侵入性的、保护隐私的智能环境占用率估计系统。该方案利用热图像检测给定区域内的人数。使用基于强度和基于运动的人体分割的概念设计占用率估计模型。利用差分捕捉器、连通元件标记、噪声滤波器和记忆传播的概念来估计占用数。我们使用一个真实的数据集来证明所提出的系统的有效性。摘要：We propose a non-intrusive, and privacy-preserving occupancy estimation system for smart environments. The proposed scheme uses thermal images to detect the number of people in a given area. The occupancy estimation model is designed using the concepts of intensity-based and motion-based human segmentation. The notion of difference catcher, connected component labeling, noise filter, and memory propagation are utilized to estimate the occupancy number. We use a real dataset to demonstrate the effectiveness of the proposed system.

【7】 Multifocal Stereoscopic Projection Mapping 标题：多焦立体投影图链接：https://arxiv.org/abs/2110.07726

作者：Sorashi Kimura,Daisuke Iwai,Parinya Punpongsanon,Kosuke Sato 机构：A, B, ETL, camera, D, LC shutter, high-speed projector, b 摘要：立体投影映射（PM）允许用户使用投影图像看到漂浮在我们周围任意形状的物理表面上的三维（3D）计算机生成（CG）对象。然而，当前的立体PM技术仅满足双目提示，不能提供正确的聚焦提示，这导致了一种会聚-调节冲突（VAC）。因此，我们提出了一种多焦方法来缓解立体PM中的真空。我们的主要技术贡献是将电聚焦可调透镜（ETL）连接到主动快门眼镜上，以控制聚散度和调节。具体来说，我们对ETL应用快速和周期性的焦点扫描，这会导致通过ETL观察到的场景的“虚拟图像”（作为光学术语）在每个扫描周期中来回移动。仅当投影图像的虚拟图像位于所需距离时，才从同步高速投影仪投影3D CG对象。这为观察者提供了所需的正确聚焦提示。在本研究中，我们解决了立体视觉PM独有的三个技术问题：（1）3D CG对象显示在非平面甚至运动表面上；（2）物理表面需要在没有焦点调制的情况下显示；（3）此外，快门眼镜还需要与ETL和投影仪同步。我们还开发了一种新的补偿技术来处理通过焦距调制改变虚拟图像视网膜大小的“透镜呼吸”伪影。此外，使用概念验证原型，我们证明了我们的技术可以在正确的深度呈现目标3D CG对象的虚拟图像。最后，我们通过用户对深度匹配任务的研究，将其与传统立体PM进行比较，验证了我们的技术所提供的优势。摘要：Stereoscopic projection mapping (PM) allows a user to see a three-dimensional (3D) computer-generated (CG) object floating over physical surfaces of arbitrary shapes around us using projected imagery. However, the current stereoscopic PM technology only satisfies binocular cues and is not capable of providing correct focus cues, which causes a vergence--accommodation conflict (VAC). Therefore, we propose a multifocal approach to mitigate VAC in stereoscopic PM. Our primary technical contribution is to attach electrically focus-tunable lenses (ETLs) to active shutter glasses to control both vergence and accommodation. Specifically, we apply fast and periodical focal sweeps to the ETLs, which causes the "virtual image'" (as an optical term) of a scene observed through the ETLs to move back and forth during each sweep period. A 3D CG object is projected from a synchronized high-speed projector only when the virtual image of the projected imagery is located at a desired distance. This provides an observer with the correct focus cues required. In this study, we solve three technical issues that are unique to stereoscopic PM: (1) The 3D CG object is displayed on non-planar and even moving surfaces; (2) the physical surfaces need to be shown without the focus modulation; (3) the shutter glasses additionally need to be synchronized with the ETLs and the projector. We also develop a novel compensation technique to deal with the "lens breathing" artifact that varies the retinal size of the virtual image through focal length modulation. Further, using a proof-of-concept prototype, we demonstrate that our technique can present the virtual image of a target 3D CG object at the correct depth. Finally, we validate the advantage provided by our technique by comparing it with conventional stereoscopic PM using a user study on a depth-matching task.

【8】 Appearance Editing with Free-viewpoint Neural Rendering 标题：基于自由视点神经绘制的外观编辑链接：https://arxiv.org/abs/2110.07674

作者：Pulkit Gera,Aakash KT,Dhawal Sirikonda,Parikshit Sakurikar,P. J. Narayanan 机构：PJ Narayanan, CVIT, KCIS, IIIT-Hyderabad, DreamVu Inc. 摘要：我们提出了一个神经渲染框架，用于从已知环境照明下捕获的多视图图像中同时进行视图合成和场景外观编辑。现有的方法要么单独实现视图合成，要么在重新照明的同时实现视图合成，而不直接控制场景的外观。我们的方法明确地将外观分离，并学习独立于它的照明表示。具体而言，我们独立估计BRDF，并使用它学习场景的仅照明表示。这样的解纠缠允许我们的方法在执行视图合成时推广到外观的任意变化。我们展示了编辑真实场景外观的结果，证明了我们的方法可以生成合理的外观编辑。我们的视图合成方法的性能被证明是在真实和合成数据与最先进的方法相媲美。摘要：We present a neural rendering framework for simultaneous view synthesis and appearance editing of a scene from multi-view images captured under known environment illumination. Existing approaches either achieve view synthesis alone or view synthesis along with relighting, without direct control over the scene's appearance. Our approach explicitly disentangles the appearance and learns a lighting representation that is independent of it. Specifically, we independently estimate the BRDF and use it to learn a lighting-only representation of the scene. Such disentanglement allows our approach to generalize to arbitrary changes in appearance while performing view synthesis. We show results of editing the appearance of a real scene, demonstrating that our approach produces plausible appearance editing. The performance of our view synthesis approach is demonstrated to be at par with state-of-the-art approaches on both real and synthetic data.

【9】 Augmenting Imitation Experience via Equivariant Representations 标题：用等变表示法扩充模仿体验链接：https://arxiv.org/abs/2110.07668

作者：Dhruv Sharma,Alihusein Kuwajerwala,Florian Shkurti 机构： University of TorontoRoboticsInstitute 备注：7 pages (including references), 15 figures 摘要：通过模仿训练的视觉导航策略的鲁棒性通常取决于训练图像-动作对的增强。传统上，这是通过从多个摄像头收集数据、使用计算机视觉的标准数据增强（如向每个图像添加随机噪声）或合成训练图像来实现的。在本文中，我们展示了另一种基于外推视点嵌入和在训练数据中观察到的动作附近的视觉导航数据增强的实用替代方案。我们的方法利用了二维和三维视觉导航问题的几何结构，并且依赖于与图像相反的等变嵌入函数的策略。给定来自训练导航数据集的图像-动作对，我们的神经网络模型使用等变特性预测附近视点处图像的潜在表示，并扩充数据集。然后，我们在扩充数据集上训练策略。我们的模拟结果表明，与使用标准增强方法训练的策略相比，以这种方式训练的策略显示出减少的交叉跟踪误差，并且需要更少的干预。我们还展示了真实地面机器人沿500米以上路径进行自主视觉导航的类似结果。摘要：The robustness of visual navigation policies trained through imitation often hinges on the augmentation of the training image-action pairs. Traditionally, this has been done by collecting data from multiple cameras, by using standard data augmentations from computer vision, such as adding random noise to each image, or by synthesizing training images. In this paper we show that there is another practical alternative for data augmentation for visual navigation based on extrapolating viewpoint embeddings and actions nearby the ones observed in the training data. Our method makes use of the geometry of the visual navigation problem in 2D and 3D and relies on policies that are functions of equivariant embeddings, as opposed to images. Given an image-action pair from a training navigation dataset, our neural network model predicts the latent representations of images at nearby viewpoints, using the equivariance property, and augments the dataset. We then train a policy on the augmented dataset. Our simulation results indicate that policies trained in this way exhibit reduced cross-track error, and require fewer interventions compared to policies trained using standard augmentation methods. We also show similar results in autonomous visual navigation by a real ground robot along a path of over 500m.

【10】 Interactive Analysis of CNN Robustness 标题：CNN稳健性的交互分析链接：https://arxiv.org/abs/2110.07667

作者：Stefan Sietzen,Mathias Lechner,Judy Borowski,Ramin Hasani,Manuela Waldner 备注：Accepted at Pacific Graphics 2021 摘要：虽然卷积神经网络（CNN）已被广泛用作图像相关任务的最新模型，但它们的预测通常对小的输入扰动高度敏感，人类视觉对这些扰动具有鲁棒性。本文介绍了Perferter，这是一个基于web的应用程序，允许用户即时探索当3D输入场景受到交互干扰时CNN的激活和预测是如何演变的。扰动器提供了大量场景修改，如摄影机控制、照明和着色效果、背景修改、对象变形以及对抗性攻击，以便于发现潜在漏洞。微调后的模型版本可以直接进行比较，以便对其稳健性进行定性评估。机器学习专家的案例研究表明，扰动器可以帮助用户快速生成关于模型漏洞的假设，并定性地比较模型行为。通过定量分析，我们可以用其他CNN体系结构和输入图像复制用户的见解，从而得出关于经过对抗训练的模型的脆弱性的新见解。摘要：While convolutional neural networks (CNNs) have found wide adoption as state-of-the-art models for image-related tasks, their predictions are often highly sensitive to small input perturbations, which the human vision is robust against. This paper presents Perturber, a web-based application that allows users to instantaneously explore how CNN activations and predictions evolve when a 3D input scene is interactively perturbed. Perturber offers a large variety of scene modifications, such as camera controls, lighting and shading effects, background modifications, object morphing, as well as adversarial attacks, to facilitate the discovery of potential vulnerabilities. Fine-tuned model versions can be directly compared for qualitative evaluation of their robustness. Case studies with machine learning experts have shown that Perturber helps users to quickly generate hypotheses about model vulnerabilities and to qualitatively compare model behavior. Using quantitative analyses, we could replicate users' insights with other CNN architectures and input images, yielding new insights about the vulnerability of adversarially trained models.

【11】 HumBugDB: A Large-scale Acoustic Mosquito Dataset 标题：HumBugDB：一个大规模声学蚊子数据集链接：https://arxiv.org/abs/2110.07607

作者：Ivan Kiskin,Marianne Sinka,Adam D. Cobb,Waqas Rafique,Lawrence Wang,Davide Zilli,Benjamin Gutteridge,Rinita Dam,Theodoros Marinos,Yunpeng Li,Dickson Msaky,Emmanuel Kaindoa,Gerard Killeen,Eva Herreros-Moya,Kathy J. Willis,Stephen J. Roberts 机构：University of Oxford, SRI International, Mind Foundry Ltd, University of Surrey, IHI Tanzania, UCC, BEES, Kathy Willis† 备注：Accepted at the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. 10 pages main, 39 pages including appendix. This paper accompanies the dataset found at this https URL with corresponding code at this https URL 摘要：本文介绍了第一个在自由飞行中连续追踪蚊子的大规模多物种声学记录数据集。我们提供20个小时的录音，我们已经熟练地及时标记和标记。值得注意的是，18小时的记录包含了36个不同物种的注释。众所周知，蚊子是疟疾、登革热和黄热病等疾病的携带者。收集该数据集的动机是需要协助利用蚊子声学进行调查的应用程序，以帮助预测疫情并告知干预政策。从蚊子的翅膀拍打声中检测蚊子的任务是一项具有挑战性的任务，因为很难从现实场景中收集记录。为了解决这个问题，作为骗子项目的一部分，我们进行了全球实验，记录从在训练笼中繁殖到野外捕获的蚊子。因此，音频记录的信噪比各不相同，并且包含来自坦桑尼亚、泰国、肯尼亚、美国和英国的各种室内和室外背景环境。在本文中，我们详细描述了我们如何收集、标记和管理数据。数据来自PostgreSQL数据库，其中包含重要的元数据，如捕获方法、年龄、喂养状态和蚊子性别。此外，我们还提供了用于提取特征和训练贝叶斯卷积神经网络的代码，用于两项关键任务：从相应的背景环境中识别蚊子，以及将检测到的蚊子分类为物种。我们庞大的数据集不仅对专注于声学识别的机器学习研究人员具有挑战性，而且对昆虫学家、地理空间建模者和其他领域专家了解蚊子行为、模拟其分布并管理其对人类构成的威胁也至关重要。摘要：This paper presents the first large-scale multi-species dataset of acoustic recordings of mosquitoes tracked continuously in free flight. We present 20 hours of audio recordings that we have expertly labelled and tagged precisely in time. Significantly, 18 hours of recordings contain annotations from 36 different species. Mosquitoes are well-known carriers of diseases such as malaria, dengue and yellow fever. Collecting this dataset is motivated by the need to assist applications which utilise mosquito acoustics to conduct surveys to help predict outbreaks and inform intervention policy. The task of detecting mosquitoes from the sound of their wingbeats is challenging due to the difficulty in collecting recordings from realistic scenarios. To address this, as part of the HumBug project, we conducted global experiments to record mosquitoes ranging from those bred in culture cages to mosquitoes captured in the wild. Consequently, the audio recordings vary in signal-to-noise ratio and contain a broad range of indoor and outdoor background environments from Tanzania, Thailand, Kenya, the USA and the UK. In this paper we describe in detail how we collected, labelled and curated the data. The data is provided from a PostgreSQL database, which contains important metadata such as the capture method, age, feeding status and gender of the mosquitoes. Additionally, we provide code to extract features and train Bayesian convolutional neural networks for two key tasks: the identification of mosquitoes from their corresponding background environments, and the classification of detected mosquitoes into species. Our extensive dataset is both challenging to machine learning researchers focusing on acoustic identification, and critical to entomologists, geo-spatial modellers and other domain experts to understand mosquito behaviour, model their distribution, and manage the threat they pose to humans.

机器翻译，仅供参考

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-10-18，如有侵权请联系 cloudcommunity@tencent.com 删除

linux