计算机视觉与模式识别学术速递[12.22]

公众号-arXiv每日学术速递

发布于 2021-12-24 08:49:57

1.6K0

cs.CV 方向，今日共计62篇

Transformer(4篇)

【1】 iSegFormer: Interactive Image Segmentation with Transformers 标题：iSegFormer：基于Transformers的交互式图像分割链接：https://arxiv.org/abs/2112.11325

作者：Qin Liu 机构：Computer Science 摘要：我们提出了iSegFormer，一种新的基于变换器的交互式图像分割方法。iSegFormer建立在现有分段转换器的基础上，用户点击作为附加输入，允许用户以交互和迭代方式优化分段掩码。摘要：We propose iSegFormer, a novel transformer-based approach for interactive image segmentation. iSegFormer is built upon existing segmentation transformers with user clicks as an additional input, allowing users to interactively and iteratively refine the segmentation mask.

【2】 SOIT: Segmenting Objects with Instance-Aware Transformers 标题：SOIT：使用实例感知转换器分割对象链接：https://arxiv.org/abs/2112.11037

作者：Xiaodong Yu,Dahu Shi,Xing Wei,Ye Ren,Tingqun Ye,Wenming Tan 机构： Hikvision Research Institute, Hangzhou, China, School of Software Engineering, Xi’an Jiaotong University 备注：AAAI 2022 摘要：本文提出了一种端到端实例分割框架，称为SOIT，该框架使用实例感知转换器分割对象。受DETR~\cite{carion2020end}的启发，我们的方法将实例分割视为一个直接集预测问题，并有效地消除了对许多手工组件的需求，如RoI裁剪、一对多标签分配和非最大抑制（NMS）。在SOIT中，在全局图像上下文中，学习多个查询来直接推理一组语义类别、边界框位置和像素级掩码的对象嵌入。类和边界框可以很容易地嵌入一个固定长度的向量。特别是像素级掩模由一组参数嵌入，以构造一个轻量级的实例感知转换器。之后，实例感知转换器生成全分辨率掩码，而不涉及任何基于RoI的操作。总的来说，SOIT引入了一个简单的单阶段实例分割框架，它既没有RoI，也没有NMS。在MS COCO数据集上的实验结果表明，SOIT显著优于最先进的实例分割方法。此外，在一个统一的查询嵌入中多个任务的联合学习也可以显著提高检测性能。代码位于\url{https://github.com/yuxiaodongHRI/SOIT}. 摘要：This paper presents an end-to-end instance segmentation framework, termed SOIT, that Segments Objects with Instance-aware Transformers. Inspired by DETR~\cite{carion2020end}, our method views instance segmentation as a direct set prediction problem and effectively removes the need for many hand-crafted components like RoI cropping, one-to-many label assignment, and non-maximum suppression (NMS). In SOIT, multiple queries are learned to directly reason a set of object embeddings of semantic category, bounding-box location, and pixel-wise mask in parallel under the global image context. The class and bounding-box can be easily embedded by a fixed-length vector. The pixel-wise mask, especially, is embedded by a group of parameters to construct a lightweight instance-aware transformer. Afterward, a full-resolution mask is produced by the instance-aware transformer without involving any RoI-based operation. Overall, SOIT introduces a simple single-stage instance segmentation framework that is both RoI- and NMS-free. Experimental results on the MS COCO dataset demonstrate that SOIT outperforms state-of-the-art instance segmentation approaches significantly. Moreover, the joint learning of multiple tasks in a unified query embedding can also substantially improve the detection performance. Code is available at \url{https://github.com/yuxiaodongHRI/SOIT}.

【3】 MPViT: Multi-Path Vision Transformer for Dense Prediction 标题：MPViT：用于密度预测的多路径视觉转换器链接：https://arxiv.org/abs/2112.11010

作者：Youngwan Lee,Jonghee Kim,Jeff Willette,Sung Ju Hwang 机构：Electronics and Telecommunications Research Institute (ETRI), South Korea, Korea Advanced Institute of Science and Technology (KAIST), South Korea, AITRICS, South Korea 备注：technical report 摘要：密集的计算机视觉任务，如目标检测和分割，需要有效的多尺度特征表示来检测或分类大小不同的对象或区域。虽然卷积神经网络（CNN）已成为此类任务的主导架构，但最近引入的视觉转换器（VIT）旨在取代它们作为主干。与CNN类似，VIT构建了一个简单的多阶段结构（即从细到粗），用于单尺度面片的多尺度表示。在这项工作中，我们从不同于现有Transformer的角度，探索多尺度面片嵌入和多路径结构，构建多路径视觉Transformer（MPViT）。MPViT通过重叠卷积面片嵌入，将相同大小（即序列长度）的特征同时嵌入不同尺度的面片。然后，不同比例的令牌通过多条路径独立地馈送到转换器编码器中，并聚合得到的特征，从而在相同的特征级别上实现精细和粗略的特征表示。多亏了多样、多尺度的特征表示，我们的MPVIT从微小的（5M）扩展到基本的（73M），在ImageNet分类、对象检测、实例分割和语义分割方面始终比最先进的视觉转换器具有更高的性能。这些广泛的结果表明，MPViT可以作为多种视觉任务的通用骨干网络。代码将在\url公开{https://git.io/MPViT}. 摘要：Dense computer vision tasks such as object detection and segmentation require effective multi-scale feature representation for detecting or classifying objects or regions with varying sizes. While Convolutional Neural Networks (CNNs) have been the dominant architectures for such tasks, recently introduced Vision Transformers (ViTs) aim to replace them as a backbone. Similar to CNNs, ViTs build a simple multi-stage structure (i.e., fine-to-coarse) for multi-scale representation with single-scale patches. In this work, with a different perspective from existing Transformers, we explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds features of the same size~(i.e., sequence length) with patches of different scales simultaneously by using overlapping convolutional patch embedding. Tokens of different scales are then independently fed into the Transformer encoders via multiple paths and the resulting features are aggregated, enabling both fine and coarse feature representations at the same feature level. Thanks to the diverse, multi-scale feature representations, our MPViTs scaling from tiny~(5M) to base~(73M) consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation. These extensive results demonstrate that MPViT can serve as a versatile backbone network for various vision tasks. Code will be made publicly available at \url{https://git.io/MPViT}.

【4】 Lite Vision Transformer with Enhanced Self-Attention 标题：增强自我注意的Lite视觉Transformer 链接：https://arxiv.org/abs/2112.10809

作者：Chenglin Yang,Yilin Wang,Jianming Zhang,He Zhang,Zijun Wei,Zhe Lin,Alan Yuille 机构：Johns Hopkins University, Adobe Inc. 摘要：尽管视觉变换器模型具有令人印象深刻的表示能力，但当前的轻型视觉变换器模型在局部地区仍然存在不一致和错误的密集预测。我们怀疑他们自我注意机制的力量在较浅和较薄的网络中是有限的。我们提出了Lite Vision Transformer（LVT），这是一种新型的轻型转换网络，具有两种增强的自我注意机制，以提高移动部署的模型性能。对于低层特征，我们引入了卷积自我注意（CSA）。与以前合并卷积和自我注意的方法不同，CSA将局部自我注意引入到大小为3x3的内核中的卷积中，以丰富LVT第一阶段的低级功能。对于高层特征，我们提出了递归阿托拉斯自我注意（RASA），它利用多尺度上下文计算相似度映射，并通过递归机制以边际额外参数代价提高表示能力。LVT在ImageNet识别、ADE20K语义分割和COCO全景分割等方面的优越性得到了证明。该守则是公开的。摘要：Despite the impressive representation capacity of vision transformer models, current light-weight vision transformer models still suffer from inconsistent and incorrect dense predictions at local regions. We suspect that the power of their self-attention mechanism is limited in shallower and thinner networks. We propose Lite Vision Transformer (LVT), a novel light-weight transformer network with two enhanced self-attention mechanisms to improve the model performances for mobile deployment. For the low-level features, we introduce Convolutional Self-Attention (CSA). Unlike previous approaches of merging convolution and self-attention, CSA introduces local self-attention into the convolution within a kernel of size 3x3 to enrich low-level features in the first stage of LVT. For the high-level features, we propose Recursive Atrous Self-Attention (RASA), which utilizes the multi-scale context when calculating the similarity map and a recursive mechanism to increase the representation capability with marginal extra parameter cost. The superiority of LVT is demonstrated on ImageNet recognition, ADE20K semantic segmentation, and COCO panoptic segmentation. The code is made publicly available.

检测相关(6篇)

【1】 Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from Videos for MediaEval 2021 标题：体育视频：2021年中世纪体育视频中乒乓球击球动作的细粒度检测与分类链接：https://arxiv.org/abs/2112.11384

作者：Pierre-Etienne Martin,Jordan Calandre,Boris Mansencal,Jenny Benois-Pineau,Renaud Péteri,Laurent Mascarilla,Julien Morlier 机构：CCP Department, Max Planck Institute for Evolutionary Anthropology, D-, Leipzig, Germany, MIA, La Rochelle University, La Rochelle, France, Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, Talence, France, IMS, University of Bordeaux, Talence, France 备注：MediaEval 2021, Dec 2021, Online, Germany 摘要：由于应用领域的多样性，体育视频分析是一个流行的研究课题，从具有用户定制摘要的多媒体智能设备到运动员表现分析。体育视频任务是中世纪2021基准的一部分。此任务处理视频中的细粒度动作检测和分类。重点是乒乓球比赛的录音。该任务自2019年开始运行，针对在自然条件下记录的未经修剪的视频（每一次中风的时间边界已知）提出了分类挑战。今年，该数据集得到了扩展，此外，它还提供了一个来自未经剪辑的无注释视频的检测挑战。这项工作旨在为体育教练和运动员创建工具，以便分析运动成绩。运动分析和运动员档案可以建立在这些技术的基础上，以丰富运动员的训练经验，提高他们的成绩。摘要：Sports video analysis is a prevalent research topic due to the variety of application areas, ranging from multimedia intelligent devices with user-tailored digests up to analysis of athletes' performance. The Sports Video task is part of the MediaEval 2021 benchmark. This task tackles fine-grained action detection and classification from videos. The focus is on recordings of table tennis games. Running since 2019, the task has offered a classification challenge from untrimmed video recorded in natural conditions with known temporal boundaries for each stroke. This year, the dataset is extended and offers, in addition, a detection challenge from untrimmed videos without annotations. This work aims at creating tools for sports coaches and players in order to analyze sports performance. Movement analysis and player profiling may be built upon such technology to enrich the training experience of athletes and improve their performance.

【2】 Contrastive Object Detection Using Knowledge Graph Embeddings 标题：基于知识图嵌入的对比目标检测链接：https://arxiv.org/abs/2112.11366

作者：Christopher Lang,Alexander Braun,Abhinav Valada 机构：Robert Bosch GmbH, University of Freiburg 摘要：对象识别在很大程度上是一个热点问题，它将类视为离散的和不相关的。每个图像区域必须分配给一组对象（包括背景类）中的一个成员，而不考虑对象类型中的任何相似性。在这项工作中，我们比较了从一个热点方法学习到的类嵌入与从自然语言处理或广泛应用于开放世界目标检测的知识图中得到的语义结构嵌入的错误统计。多个知识嵌入和距离度量的广泛的实验结果表明，基于知识的类表示导致更多的语义接地错误分类，而在一个热的方法上进行挑战性的COCO和CITYSCAPE对象检测基准的PAR上执行。我们通过为基于关键点和基于转换器的对象检测架构提出知识嵌入式设计，将我们的发现推广到多个对象检测架构。摘要：Object recognition for the most part has been approached as a one-hot problem that treats classes to be discrete and unrelated. Each image region has to be assigned to one member of a set of objects, including a background class, disregarding any similarities in the object types. In this work, we compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs that are widely applied in open world object detection. Extensive experimental results on multiple knowledge-embeddings as well as distance metrics indicate that knowledge-based class representations result in more semantically grounded misclassifications while performing on par compared to one-hot methods on the challenging COCO and Cityscapes object detection benchmarks. We generalize our findings to multiple object detection architectures by proposing a knowledge-embedded design for keypoint-based and transformer-based object detection architectures.

【3】 Review of Face Presentation Attack Detection Competitions 标题：人脸呈现攻击检测竞赛综述链接：https://arxiv.org/abs/2112.11290

作者：Zitong Yu,Jukka Komulainen,Xiaobai Li,Guoying Zhao 机构： University of Oulu 备注：Handbook of Biometric Anti-Spoofing (3rd Ed.) 摘要：自从欺骗漏洞被广泛认识以来，人脸呈现攻击检测（PAD）受到了越来越多的关注。在2011, 2013, 2017个、2019, 2020个和2021个主要生物测定学和计算机视觉会议中组织了八个国际竞赛，评估了单峰和多模态人脸反欺骗的现状，每一个都对研究界提出了新的挑战。在这一章中，我们展示了从2019到2021的五个最新比赛的设计和结果。前两项挑战旨在评估脸垫在多模式设置中的有效性，除了彩色摄像机数据外，还引入近红外（NIR）和深度模式，而最近的三场比赛则侧重于评估在传统彩色图像和视频上运行的脸垫算法的域和攻击类型泛化能力。我们还讨论了从比赛中吸取的经验教训以及该领域的未来挑战。摘要：Face presentation attack detection (PAD) has received increasing attention ever since the vulnerabilities to spoofing have been widely recognized. The state of the art in unimodal and multi-modal face anti-spoofing has been assessed in eight international competitions organized in conjunction with major biometrics and computer vision conferences in 2011, 2013, 2017, 2019, 2020 and 2021, each introducing new challenges to the research community. In this chapter, we present the design and results of the five latest competitions from 2019 until 2021. The first two challenges aimed to evaluate the effectiveness of face PAD in multi-modal setup introducing near-infrared (NIR) and depth modalities in addition to colour camera data, while the latest three competitions focused on evaluating domain and attack type generalization abilities of face PAD algorithms operating on conventional colour images and videos. We also discuss the lessons learnt from the competitions and future challenges in the field in general.

【4】 Projected Sliced Wasserstein Autoencoder-based Hyperspectral Images Anomaly Detection 标题：基于投影切片Wasserstein自动编码器的高光谱图像异常检测链接：https://arxiv.org/abs/2112.11243

作者：Yurong Chen,Hui Zhang,Yaonan Wang,Q. M. Jonathan Wu,Yimin Yang 机构： Wu is with the Department of Electrical and Computer Engi-neering, University of Windsor 摘要：异常检测是指识别偏离正常模式的观测值，这一直是各个领域的一个活跃研究领域。近年来，随着数据规模、复杂度和维数的不断增加，传统的基于表示和统计的离群点检测方法变得具有挑战性。在本文中，我们利用生成模型进行高光谱图像异常检测。其要点是对正态数据的分布进行建模，而分布外的样本可以看作是异常值。首先，研究了基于变分推理的异常检测方法。我们从理论和经验上发现，由于距离（$f$-散度）作为正则化的强烈概念，它们是不稳定的。其次，本文引入了切片瓦瑟斯坦距离，这是一种比f-散度更弱的分布测度。然而，随机切片的数量给估计真实距离带来了困难。最后，我们提出了一种基于投影切片Wasserstein（PSW）自动编码器的异常筛选方法。特别是，我们利用一种计算友好的特征分解方法，在切片高维数据时找到主成分。此外，即使先验分布不是高斯分布，我们提出的距离也可以用闭合形式计算。在各种真实的高光谱异常检测基准上进行的综合实验证明了我们提出的方法的优越性能。摘要：Anomaly detection refers to identifying the observation that deviates from the normal pattern, which has been an active research area in various domains. Recently, the increasing data scale, complexity, and dimension turns the traditional representation and statistical-based outlier detection method into challenging. In this paper, we leverage the generative model in hyperspectral images anomaly detection. The gist is to model the distribution of the normal data, while the out-of-distribution sample can be viewed as the outlier. At first, the variational inference-based anomaly detection methods are investigated. We theoretically and empirically find that they are unstable due to the strong notion of distance ($f$-divergence) served as the regularization. Secondly, this paper introduces sliced Wasserstein distance, which is a weaker distribution measure compared with f-divergence. However, the number of randomly slicing poses a difficulty to estimate the true distance. In the end, we propose a projected sliced Wasserstein (PSW) autoencoder-based anomaly screening method. In particular, we leverage a computation-friendly eigen-decomposition method to find the principal component as slicing the high-dimensional data. Furthermore, our proposed distance can be calculated with the closed-form, even the prior distribution is not Gaussian. Comprehensive experiments conducted on various real-world hyperspectral anomaly detection benchmarks demonstrate the superior performance of our proposed method.

【5】 EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection 标题：EPNet++：级联双向融合的多模态三维目标检测链接：https://arxiv.org/abs/2112.11088

作者：Zhe Liu,Tengteng~Huang,Bingling Li,Xiwu Chen,Xi Wang,Xiang Bai 摘要：近年来，融合激光雷达点云和相机图像以提高三维目标检测的性能和鲁棒性受到越来越多的关注，因为这两种模式自然具有很强的互补性。在本文中，我们通过引入一个新的级联双向融合（CB-Fusion）模块和一个多模态一致性（MC）损失，提出了用于多模态三维目标检测的EPNet++。更具体地说，所提出的CB融合模块以级联双向交互融合的方式将点特征的丰富语义信息与图像特征相结合，从而得到更全面、更具区别性的特征表示。MC损失明确保证了两种模式的预测分数之间的一致性，以获得更全面和可靠的置信度分数。在KITTI、JRDB和SUN-RGBD数据集上的实验结果表明，与最先进的方法相比，EPNet++具有优越性。此外，我们强调了一个关键但容易被忽略的问题，即在稀疏场景中探索3D探测器的性能和鲁棒性。大量实验表明，EPNet++在高度稀疏的点云情况下优于现有的SOTA方法，具有显著的裕度，这可能是降低激光雷达传感器昂贵成本的一个有效方向。代码将在将来发布。摘要：Recently, fusing the LiDAR point cloud and camera image to improve the performance and robustness of 3D object detection has received more and more attention, as these two modalities naturally possess strong complementarity. In this paper, we propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion~(CB-Fusion) module and a Multi-Modal Consistency~(MC) loss. More concretely, the proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner, leading to more comprehensive and discriminative feature representations. The MC loss explicitly guarantees the consistency between predicted scores from two modalities to obtain more comprehensive and reliable confidence scores. The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods. Besides, we emphasize a critical but easily overlooked problem, which is to explore the performance and robustness of a 3D detector in a sparser scene. Extensive experiments present that EPNet++ outperforms the existing SOTA methods with remarkable margins in highly sparse point cloud cases, which might be an available direction to reduce the expensive cost of LiDAR sensors. Code will be released in the future.

【6】 Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion 标题：观看这些词语：使用词语条件化面部运动的视频篡改检测链接：https://arxiv.org/abs/2112.10936

作者：Shruti Agarwal,Liwen Hu,Evonne Ng,Trevor Darrell,Hao Li,Anna Rohrbach 机构：UC Berkeley, Pinscreen 摘要：在当今的数字误报时代，我们日益面临视频篡改技术带来的新威胁。此类伪造包括廉价假货（例如，长相或音频配音）和深度假货（例如，复杂的人工智能媒体合成方法），这些假货在感知上与真实视频无法区分。为了应对这一挑战，我们提出了一种多模态语义取证方法，以发现超出检测视觉质量差异的线索，从而处理更简单的廉价假货和具有视觉说服力的深度假货。在这项工作中，我们的目标是通过检测他们的面部动作和他们所说的话之间的异常对应来验证视频中所看到的所谓的人确实是他们自己。我们利用归因的概念来学习特定于个人的生物特征模式，从而将特定的说话人与其他人区分开来。我们使用可解释动作单位（AUs）来捕捉一个人的面部和头部运动，而不是深度CNN视觉特征，我们是第一个使用词条件面部运动分析的人。与现有的针对特定人的方法不同，我们的方法也能有效地防止针对嘴唇操纵的攻击。我们进一步证明了我们的方法对训练中未发现的一系列假货的有效性，包括未经视频处理的假货，这些假货在之前的工作中未得到解决。摘要：In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques. Such falsifications range from cheapfakes (e.g., lookalikes or audio dubbing) to deepfakes (e.g., sophisticated AI media synthesis methods), which are becoming perceptually indistinguishable from real videos. To tackle this challenge, we propose a multi-modal semantic forensic approach to discover clues that go beyond detecting discrepancies in visual quality, thereby handling both simpler cheapfakes and visually persuasive deepfakes. In this work, our goal is to verify that the purported person seen in the video is indeed themselves by detecting anomalous correspondences between their facial movements and the words they are saying. We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others. We use interpretable Action Units (AUs) to capture a persons' face and head movement as opposed to deep CNN visual features, and we are the first to use word-conditioned facial motion analysis. Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation. We further demonstrate our method's effectiveness on a range of fakes not seen in training including those without video manipulation, that were not addressed in prior work.

分类|识别相关(5篇)

【1】 Unsupervised deep learning techniques for powdery mildew recognition based on multispectral imaging 标题：基于多光谱成像的无监督深度学习白粉病识别技术链接：https://arxiv.org/abs/2112.11242

作者：Alessandro Benfenati,Paola Causin,Roberto Oberti,Giovanni Stefanello 机构： Dept. of Environmental Science and Policy, Universita degli Studi di Milano, Milano, Dept. of Mathematics, Universita degli Studi di Milano, Milano, Italy, Dept. of Agricultural and Environmental Sciences - Production, Landscape 摘要：目标。植物病害的可持续管理是一项具有相关经济和环境影响的公开挑战。最佳策略依赖于人类专业知识，在有利条件下进行实地侦察，以评估疾病症状的当前存在和程度。这项劳动密集型任务由于要侦察的大面积区域以及要检测的早期症状的毫米级大小而变得复杂。有鉴于此，基于图像的早期疾病症状检测是自动化该过程的一种有吸引力的方法，能够以可持续的成本实现潜在的高通量监测。方法。深度学习已成功地应用于各个领域，通过训练过程学习滤波器，自动选择相关图像特征。深度学习最近也进入了植物病害检测领域：基于这一思想，我们提出了一种自动识别黄瓜叶片白粉病的深度学习方法。我们专注于应用于多光谱成像数据的无监督深度学习技术，并建议使用自动编码器架构来研究两种疾病检测策略：i）压缩空间中的特征聚类；ii）异常检测。后果这两种方法已通过定量指标进行了评估。聚类方法本身并不能提供准确的预测，但它确实满足相关信息的需要。相反，异常检测具有很大的分辨率潜力，可以进一步利用它作为有监督体系结构（标记样本数量非常有限）的先验知识。摘要：Objectives. Sustainable management of plant diseases is an open challenge which has relevant economic and environmental impact. Optimal strategies rely on human expertise for field scouting under favourable conditions to assess the current presence and extent of disease symptoms. This labor-intensive task is complicated by the large field area to be scouted, combined with the millimeter-scale size of the early symptoms to be detected. In view of this, image-based detection of early disease symptoms is an attractive approach to automate this process, enabling a potential high throughput monitoring at sustainable costs. Methods. Deep learning has been successfully applied in various domains to obtain an automatic selection of the relevant image features by learning filters via a training procedure. Deep learning has recently entered also the domain of plant disease detection: following this idea, in this work we present a deep learning approach to automatically recognize powdery mildew on cucumber leaves. We focus on unsupervised deep learning techniques applied to multispectral imaging data and we propose the use of autoencoder architectures to investigate two strategies for disease detection: i) clusterization of features in a compressed space; ii) anomaly detection. Results. The two proposed approaches have been assessed by quantitative indices. The clusterization approach is not fully capable by itself to provide accurate predictions but it does cater relevant information. Anomaly detection has instead a significant potential of resolution which could be further exploited as a prior for supervised architectures with a very limited number of labeled samples.

【2】 Attention-Based Sensor Fusion for Human Activity Recognition Using IMU Signals 标题：基于注意力的传感器融合在IMU信号人体活动识别中的应用链接：https://arxiv.org/abs/2112.11224

作者：Wenjin Tao,Haodong Chen,Md Moniruzzaman,Ming C. Leu,Zhaozheng Yi,Ruwen Qin 机构：Department of Mechanical and Aerospace Engineering, Missouri University of Science and, Technology, Rolla, MO , USA, Department of Computer Science, Stony Brook University, Stony Brook, NY , USA 摘要：使用嵌入式惯性测量单元（IMU）传感器的智能手表等可穿戴设备的人类活动识别（HAR）具有与我们日常生活相关的各种应用，如训练跟踪和健康监测。在本文中，我们提出了一种新的基于注意的人类活动识别方法，使用多个佩戴在不同身体位置的IMU传感器。首先，设计了一个传感器特征提取模块，利用卷积神经网络（CNN）从单个传感器中提取最具鉴别能力的特征。其次，提出了一种基于注意的融合机制，用于学习不同身体位置传感器的重要性，并生成注意特征表示。最后，应用传感器间特征提取模块学习传感器间相关性，并将其连接到分类器输出预测的活动类别。使用五个公共数据集对所提出的方法进行了评估，它在各种活动类别上都优于最先进的方法。摘要：Human Activity Recognition (HAR) using wearable devices such as smart watches embedded with Inertial Measurement Unit (IMU) sensors has various applications relevant to our daily life, such as workout tracking and health monitoring. In this paper, we propose a novel attention-based approach to human activity recognition using multiple IMU sensors worn at different body locations. Firstly, a sensor-wise feature extraction module is designed to extract the most discriminative features from individual sensors with Convolutional Neural Networks (CNNs). Secondly, an attention-based fusion mechanism is developed to learn the importance of sensors at different body locations and to generate an attentive feature representation. Finally, an inter-sensor feature extraction module is applied to learn the inter-sensor correlations, which are connected to a classifier to output the predicted classes of activities. The proposed approach is evaluated using five public datasets and it outperforms state-of-the-art methods on a wide variety of activity categories.

【3】 Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition 标题：扩展-压缩-激励融合网络在老年人活动识别中的应用链接：https://arxiv.org/abs/2112.10992

作者：Xiangbo Shu,Jiawen Yang,Rui Yan,Yan Song 机构： Song are with the School of Computer Sci-ence and Engineering, Nanjing University of Science and Technology 摘要：由于老年人活动中存在个体行为和人-物交互作用，老年人活动识别是一项具有挑战性的任务。因此，我们试图通过仔细融合多模态特征，从RGB视频和骨架序列中有效地聚集动作和交互的鉴别信息。近年来，一些非线性多模态融合方法被提出，它们利用了从挤压和激励网络（SENet）扩展而来的非线性注意机制。受此启发，我们提出了一种新的扩展-挤压激励融合网络（ESE-FN）来有效地解决老年人活动识别问题，该网络学习模态和通道扩展-挤压激励（ESE）注意事项，以便在模态和通道方式中仔细融合多模态特征。此外，我们设计了一种新的多模态损失（ML），通过增加单个模态上的最小预测损失与融合模态上的预测损失之间的差值惩罚，来保持单模态特征与融合多模态特征之间的一致性。最后，我们在一个最大规模的老年人活动数据集，即ETRI-Activity3D（包括110000多个视频和50多个类别）上进行了实验，以证明所提出的ESE-FN与最先进的方法相比达到了最佳精度。此外，更广泛的实验结果表明，所提出的ESE-FN在正常动作识别任务方面也与其他方法相当。摘要：This work focuses on the task of elderly activity recognition, which is a challenging task due to the existence of individual actions and human-object interactions in elderly activities. Thus, we attempt to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multi-modal features. Recently, some nonlinear multi-modal fusion approaches are proposed by utilizing nonlinear attention mechanism that is extended from Squeeze-and-Excitation Networks (SENet). Inspired by this, we propose a novel Expansion-Squeeze-Excitation Fusion Network (ESE-FN) to effectively address the problem of elderly activity recognition, which learns modal and channel-wise Expansion-Squeeze-Excitation (ESE) attentions for attentively fusing the multi-modal features in the modal and channel-wise ways. Furthermore, we design a new Multi-modal Loss (ML) to keep the consistency between the single-modal features and the fused multi-modal features by adding the penalty of difference between the minimum prediction losses on single modalities and the prediction loss on the fused modality. Finally, we conduct experiments on a largest-scale elderly activity dataset, i.e., ETRI-Activity3D (including 110,000+ videos, and 50+ categories), to demonstrate that the proposed ESE-FN achieves the best accuracy compared with the state-of-the-art methods. In addition, more extensive experimental results show that the proposed ESE-FN is also comparable to the other methods in terms of normal action recognition task.

【4】 Task-Oriented Image Transmission for Scene Classification in Unmanned Aerial Systems 标题：面向任务的无人机场景分类图像传输链接：https://arxiv.org/abs/2112.10948

作者：Xu Kang,Bin Song,Jie Guo,Zhijin Qin,F. Richard Yu 摘要：物联网的蓬勃发展使其计算和存储能力能够扩展到云和edge协作的航空系统中的计算任务，特别是基于深度学习（DL）的人工智能（AI）任务。无人机（UAV）收集大量图像/视频数据，由于其存储和计算能力有限，只能将智能分析任务移交给后端移动边缘计算（MEC）服务器。如何有效地为人工智能模型传输最相关的信息是一个具有挑战性的课题。受近年来面向任务通信的启发，我们提出了一种新的用于场景分类任务的航空图像传输模式。在前端无人机上开发了一个轻量级模型，用于感知图像和信道条件的语义块传输。为了在传输延迟和分类精度之间达成折衷，使用深度强化学习（DRL）来探索在各种信道条件下对后端分类器贡献最大的语义块。实验结果表明，与固定传输策略和传统的内容感知方法相比，该方法能显著提高分类精度。摘要：The vigorous developments of Internet of Things make it possible to extend its computing and storage capabilities to computing tasks in the aerial system with collaboration of cloud and edge, especially for artificial intelligence (AI) tasks based on deep learning (DL). Collecting a large amount of image/video data, Unmanned aerial vehicles (UAVs) can only handover intelligent analysis tasks to the back-end mobile edge computing (MEC) server due to their limited storage and computing capabilities. How to efficiently transmit the most correlated information for the AI model is a challenging topic. Inspired by the task-oriented communication in recent years, we propose a new aerial image transmission paradigm for the scene classification task. A lightweight model is developed on the front-end UAV for semantic blocks transmission with perception of images and channel conditions. In order to achieve the tradeoff between transmission latency and classification accuracy, deep reinforcement learning (DRL) is used to explore the semantic blocks which have the best contribution to the back-end classifier under various channel conditions. Experimental results show that the proposed method can significantly improve classification accuracy compared to the fixed transmission strategy and traditional content perception methods.

【5】 Structured Semantic Transfer for Multi-Label Recognition with Partial Labels 标题：部分标签多标签识别的结构化语义转移链接：https://arxiv.org/abs/2112.10941

作者：Tianshui Chen,Tao Pu,Hefeng Wu,Yuan Xie,Liang Lin 机构： Guangdong University of Technology, Sun Yat-Sen University 备注：Accepted by AAAI 2022 摘要：多标签图像识别是一项基本而实用的任务，因为现实世界中的图像固有地具有多个语义标签。然而，由于输入图像和输出标签空间的复杂性，难以收集大规模多标签注释。为了降低标注成本，我们提出了一个结构化语义转移（SST）框架，该框架能够训练具有部分标签的多标签识别模型，即每个图像中只有一些标签已知，而其他标签丢失（也称为未知标签）。该框架由两个互补的传输模块组成，它们探索图像内和图像间的语义关联，以传输已知标签的知识，为未知标签生成伪标签。具体地说，图像内语义传输模块学习图像特定标签共现矩阵，并基于该矩阵将已知标签映射为补充未知标签。同时，交叉图像传输模块学习特定类别的特征相似性，并帮助以高相似性补充未知标签。最后，使用已知标签和生成的标签对多标签识别模型进行训练。在Microsoft COCO、Visual Genome和Pascal VOC数据集上的大量实验表明，所提出的SST框架比当前最先进的算法具有更高的性能。代码位于\url{https://github.com/HCPLab-SYSU/SST-MLR-PL 摘要：Multi-label image recognition is a fundamental yet practical task because real-world images inherently possess multiple semantic labels. However, it is difficult to collect large-scale multi-label annotations due to the complexity of both the input images and output label spaces. To reduce the annotation cost, we propose a structured semantic transfer (SST) framework that enables training multi-label recognition models with partial labels, i.e., merely some labels are known while other labels are missing (also called unknown labels) per image. The framework consists of two complementary transfer modules that explore within-image and cross-image semantic correlations to transfer knowledge of known labels to generate pseudo labels for unknown labels. Specifically, an intra-image semantic transfer module learns image-specific label co-occurrence matrix and maps the known labels to complement unknown labels based on this matrix. Meanwhile, a cross-image transfer module learns category-specific feature similarities and helps complement unknown labels with high similarities. Finally, both known and generated labels are used to train the multi-label recognition models. Extensive experiments on the Microsoft COCO, Visual Genome and Pascal VOC datasets show that the proposed SST framework obtains superior performance over current state-of-the-art algorithms. Codes are available at \url{https://github.com/HCPLab-SYSU/SST-MLR-PL

分割|语义相关(7篇)

【1】 Generalizable Cross-modality Medical Image Segmentation via Style Augmentation and Dual Normalization 标题：基于样式增强和双重归一化的通用型跨模态医学图像分割链接：https://arxiv.org/abs/2112.11177

作者：Ziqi Zhou,Lei Qi,Xin Yang,Dong Ni,Yinghuan Shi 机构：Nanjing University, Southeast University, Shenzhen University 摘要：对于医学图像分割，想象一下，如果模型仅使用源域中的MR图像进行训练，那么它在目标域中直接分割CT图像的性能如何？这种设置，即广义跨模态分割，具有临床潜力，比其他相关设置（例如域适应）更具挑战性。为了实现这一目标，我们在本文中提出了一种新的双重标准化模块，在广义分割过程中利用增强的源相似和源不相似图像。具体地说，给定一个源域，为了模拟未知目标域中可能出现的外观变化，我们首先利用非线性变换来增强源相似和源不相似的图像。然后，为了充分利用这两种类型的增强，我们提出的基于双重规范化的模型采用了一个共享主干但独立的批量规范化层来进行单独的规范化。然后，我们提出了一种基于风格的选择方案，在测试阶段自动选择合适的路径。在三个公开的数据集（即BraTS、跨模态心脏和腹部多器官数据集）上进行的大量实验表明，我们的方法优于其他最先进的领域泛化方法。摘要：For medical image segmentation, imagine if a model was only trained using MR images in source domain, how about its performance to directly segment CT images in target domain? This setting, namely generalizable cross-modality segmentation, owning its clinical potential, is much more challenging than other related settings, e.g., domain adaptation. To achieve this goal, we in this paper propose a novel dual-normalization module by leveraging the augmented source-similar and source-dissimilar images during our generalizable segmentation. To be specific, given a single source domain, aiming to simulate the possible appearance change in unseen target domains, we first utilize a nonlinear transformation to augment source-similar and source-dissimilar images. Then, to sufficiently exploit these two types of augmentations, our proposed dual-normalization based model employs a shared backbone yet independent batch normalization layer for separate normalization. Afterwards, we put forward a style-based selection scheme to automatically choose the appropriate path in the test stage. Extensive experiments on three publicly available datasets, i.e., BraTS, Cross-Modality Cardiac and Abdominal Multi-Organ dataset, have demonstrated that our method outperforms other state-of-the-art domain generalization methods.

【2】 Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning 标题：通用的Few-Shot语义分割：你所需要的就是微调链接：https://arxiv.org/abs/2112.10982

作者：Josh Myers-Dean,Yinan Zhao,Brian Price,Scott Cohen,Danna Gurari 机构：University of Colorado, Boulder, +University of Texas at Austin, †Adobe Research 备注：Includes supplementary materials 摘要：广义Few-Shot语义分割的引入不仅仅是对新类的Few-Shot分割模型进行评估，还包括测试它们对基类的记忆能力。虽然目前所有的方法都是基于元学习的，但它们的表现很差，在只观察了几次镜头之后，学习就饱和了。我们提出了第一个微调解决方案，并证明它解决了饱和问题，同时在两个数据集PASCAL-$5^i$和COCO-$20^i$上实现了最先进的结果。我们还表明，无论是微调多个最终层还是仅微调最终层，它都优于现有方法。最后，我们提出了一个三重态损失正则化，它展示了如何在新类别和基本类别之间重新分配性能平衡，从而使它们之间的差距更小。摘要：Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While all approaches currently are based on meta-learning, they perform poorly and saturate in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-art results on two datasets, PASCAL-$5^i$ and COCO-$20^i$. We also show it outperforms existing methods whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.

【3】 Nonlinear Transform Source-Channel Coding for Semantic Communications 标题：语义通信中的非线性变换信源信道编码链接：https://arxiv.org/abs/2112.10961

作者：Jincheng Dai,Sixian Wang,Kailin Tan,Zhongwei Si,Xiaoqi Qin,Kai Niu,Ping Zhang 摘要：在本文中，我们提出了一类新的高效的深度联合信源信道编码方法，这种方法能够紧密适应非线性变换下的信源分布，可以被命名为非线性变换信源信道编码（NTSCC）。在所考虑的模型中，发射机首先学习非线性分析变换，将源数据映射到潜在空间，然后通过深度联合信源信道编码将潜在表示发送给接收机。该模型将非线性变换作为强先验，有效地提取信源语义特征，为信源信道编码提供旁侧信息。与现有的深度联合信源信道编码方法不同，所提出的NTSCC本质上学习信源潜在表示和熵模型作为潜在表示的先验。因此，开发了新的自适应速率传输和超先验辅助编解码器细化机制来升级深度联合信源信道编码。整个系统设计被描述为一个优化问题，其目标是在既定的感知质量指标下最小化端到端传输率失真性能。通过简单的示例源和测试图像源，我们发现所提出的NTSCC传输方法通常优于使用标准深度联合信源信道编码的模拟传输和基于经典分离的数字传输。值得注意的是，由于其强大的内容感知能力，所提出的NTSCC方法可能支持未来的语义通信。摘要：In this paper, we propose a new class of high-efficient deep joint source-channel coding methods that can closely adapt to the source distribution under the nonlinear transform, it can be collected under the name nonlinear transform source-channel coding (NTSCC). In the considered model, the transmitter first learns a nonlinear analysis transform to map the source data into latent space, then transmits the latent representation to the receiver via deep joint source-channel coding. Our model incorporates the nonlinear transform as a strong prior to effectively extract the source semantic features and provide side information for source-channel coding. Unlike existing conventional deep joint source-channel coding methods, the proposed NTSCC essentially learns both the source latent representation and an entropy model as the prior on the latent representation. Accordingly, novel adaptive rate transmission and hyperprior-aided codec refinement mechanisms are developed to upgrade deep joint source-channel coding. The whole system design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics. Across simple example sources and test image sources, we find that the proposed NTSCC transmission method generally outperforms both the analog transmission using the standard deep joint source-channel coding and the classical separation-based digital transmission. Notably, the proposed NTSCC method can potentially support future semantic communications due to its vigorous content-aware ability.

【4】 One Sketch for All: One-Shot Personalized Sketch Segmentation 标题：一张通用草图：一次个性化草图分割链接：https://arxiv.org/abs/2112.10838

作者：Anran Qi,Yulia Gryaditskaya,Tao Xiang,Yi-Zhe Song 机构：SketchX Lab, CVSSP, University of Surrey 摘要：我们提出了第一种一次性的个性化草图分割方法。我们的目标是分割属于同一类别的所有草图，这些草图由具有给定零件注释的单个草图提供，同时（i）保留嵌入示例中的零件语义，以及（ii）对输入样式和抽象具有鲁棒性。我们将此场景称为个性化场景。这样，我们就可以为下游细粒度草图分析任务提供非常理想的个性化功能。为了训练健壮的分割模块，我们将示例草图变形为同一类别的每个可用草图。我们的方法推广到训练期间未观察到的草图。我们的主要贡献是一个特定于草图的分层变形网络。给定通过图卷积网络获得的多级草图笔划编码，我们的方法在上层估计从参考到示例的刚体变换。通过在较低级别上的笔划方向变形，进一步获得从示例到全局扭曲参考草图的更精细变形。这两个级别的变形都由在没有监督的情况下学习的关键点之间的均方距离引导，确保笔划语义得到保留。我们根据最先进的分割和感知分组基线对我们的方法进行了评估，这些基线是为一次拍摄设定的，并且与两种少量拍摄的3D形状分割方法进行了比较。我们表明，我们的方法比所有备选方案平均高出10%以上。研究进一步表明，我们的方法对个性化具有鲁棒性：输入部分语义和风格差异的变化。摘要：We present the first one-shot personalized sketch segmentation method. We aim to segment all sketches belonging to the same category provisioned with a single sketch with a given part annotation while (i) preserving the parts semantics embedded in the exemplar, and (ii) being robust to input style and abstraction. We refer to this scenario as personalized. With that, we importantly enable a much-desired personalization capability for downstream fine-grained sketch analysis tasks. To train a robust segmentation module, we deform the exemplar sketch to each of the available sketches of the same category. Our method generalizes to sketches not observed during training. Our central contribution is a sketch-specific hierarchical deformation network. Given a multi-level sketch-strokes encoding obtained via a graph convolutional network, our method estimates rigid-body transformation from the reference to the exemplar, on the upper level. Finer deformation from the exemplar to the globally warped reference sketch is further obtained through stroke-wise deformations, on the lower level. Both levels of deformation are guided by mean squared distances between the keypoints learned without supervision, ensuring that the stroke semantics are preserved. We evaluate our method against the state-of-the-art segmentation and perceptual grouping baselines re-purposed for the one-shot setting and against two few-shot 3D shape segmentation methods. We show that our method outperforms all the alternatives by more than 10% on average. Ablation studies further demonstrate that our method is robust to personalization: changes in input part semantics and style differences.

【5】 A novel approach for the automated segmentation and volume quantification of cardiac fats on computed tomography 标题：一种基于CT的心脏脂肪自动分割和体积量化新方法链接：https://arxiv.org/abs/2112.11381

作者：Érick Oliveira Rodrigues,FFC Morais,NAOS Morais,LS Conci,LV Neto,Aura Conci 机构： Brazilb Department of Internal Medicine and Endocrine Unit, Medical School and Hospital Universitário Clementino FragaFilho, Universidade Federal do Rio de Janeiro (UFRJ), 2 5 5 – Cidade Universitária, Brazilc Department of Specialized Medicine 备注：Computer methods and programs in biomedicine, 2016 摘要：心脏周围的脂肪沉积与多种健康危险因素相关，如动脉粥样硬化、颈动脉僵硬、冠状动脉钙化、心房颤动等。这些沉积物的变化与肥胖症无关，这加强了对其进行进一步量化的直接分割。然而，由于所需的人力工作量和由此产生的高昂的医生和技术人员成本，这些脂肪的手动分割尚未广泛应用于临床实践。在这项工作中，我们提出了一种统一的方法来自动分割和量化两种类型的心脏脂肪。分节的脂肪被称为心外膜和纵隔，通过心包彼此分开。为实现最小的用户干预付出了大量努力。提出的方法主要包括注册和分类算法，以执行所需的分割。我们比较了几种分类算法的性能，包括神经网络、概率模型和决策树算法。该方法的实验结果表明，心外膜脂肪和纵隔脂肪的平均准确率为98.5%（如果特征正常化，则为99.5%），平均真阳性率为98.0%。平均而言，骰子相似性指数为97.6%。摘要：The deposits of fat on the surroundings of the heart are correlated to several health risk factors such as atherosclerosis, carotid stiffness, coronary artery calcification, atrial fibrillation and many others. These deposits vary unrelated to obesity, which reinforces its direct segmentation for further quantification. However, manual segmentation of these fats has not been widely deployed in clinical practice due to the required human workload and consequential high cost of physicians and technicians. In this work, we propose a unified method for an autonomous segmentation and quantification of two types of cardiac fats. The segmented fats are termed epicardial and mediastinal, and stand apart from each other by the pericardium. Much effort was devoted to achieve minimal user intervention. The proposed methodology mainly comprises registration and classification algorithms to perform the desired segmentation. We compare the performance of several classification algorithms on this task, including neural networks, probabilistic models and decision tree algorithms. Experimental results of the proposed methodology have shown that the mean accuracy regarding both epicardial and mediastinal fats is 98.5% (99.5% if the features are normalized), with a mean true positive rate of 98.0%. In average, the Dice similarity index was equal to 97.6%.

【6】 RC-Net: A Convolutional Neural Network for Retinal Vessel Segmentation 标题：RC-Net：一种用于视网膜血管分割的卷积神经网络链接：https://arxiv.org/abs/2112.11078

作者：Tariq M Khan,Antonio Robles-Kelly,Syed S. Naqvi 机构： School of IT, Deakin University, Waurn Ponds, VIC , Australia, Dept. of Electrical and Computer Eng., COMSATS University Islamabad, Islamabad, Pakistan 摘要：近年来，基于复杂卷积神经网络结构的越来越复杂的方法已经缓慢地提高了成熟基准数据集的性能。在本文中，我们退一步来研究这种复杂性的真正需要。我们提出了一种完全卷积网络RC-Net，它优化了每层滤波器的数量，以减少特征重叠和复杂性。我们还使用跳过连接，通过将网络中的池操作数量保持在最小值，将空间信息丢失保持在最小值。在我们的实验中使用了两个公开的视网膜血管分割数据集。在我们的实验中，RC-Net具有很强的竞争力，优于具有两个甚至三个数量级可训练参数的分割方法。摘要：Over recent years, increasingly complex approaches based on sophisticated convolutional neural network architectures have been slowly pushing performance on well-established benchmark datasets. In this paper, we take a step back to examine the real need for such complexity. We present RC-Net, a fully convolutional network, where the number of filters per layer is optimized to reduce feature overlapping and complexity. We also used skip connections to keep spatial information loss to a minimum by keeping the number of pooling operations in the network to a minimum. Two publicly available retinal vessel segmentation datasets were used in our experiments. In our experiments, RC-Net is quite competitive, outperforming alternatives vessels segmentation methods with two or even three orders of magnitude less trainable parameters.

【7】 Leveraging Image Complexity in Macro-Level Neural Network Design for Medical Image Segmentation 标题：利用图像复杂度进行医学图像分割的宏级神经网络设计链接：https://arxiv.org/abs/2112.11065

作者：Tariq M. Khan,Syed S. Naqvi,Erik Meijering 摘要：编码器-解码器神经网络体系结构设计的最新进展在广泛的医学图像分割任务中带来了显著的性能改进。然而，用于给定任务的最先进的网络可能在计算上要求太高，无法在负担得起的硬件上运行，因此用户通常通过修改各种宏观设计方面求助于实际解决办法。两个常见的例子是对输入图像进行下采样和减少网络深度以满足计算机内存限制。在本文中，我们研究了这些变化对分割性能的影响，并表明图像复杂度可以作为选择对给定数据集最好的方法的准则。我们考虑四个统计措施，以量化图像的复杂性，并评估其适用性在十个不同的公共数据集。为了我们的实验目的，我们还提出了两种新的编码器-解码器架构，它们代表浅层和深层网络，比目前流行的网络具有更高的内存效率。我们的结果表明，在决定可接受的输入下采样因子和网络深度时，中值频率是最佳的复杂性度量。对于高复杂度数据集，在原始图像上运行的浅层网络可能比在下采样图像上运行的深层网络产生更好的分割结果，而对于低复杂度图像则可能相反。摘要：Recent progress in encoder-decoder neural network architecture design has led to significant performance improvements in a wide range of medical image segmentation tasks. However, state-of-the-art networks for a given task may be too computationally demanding to run on affordable hardware, and thus users often resort to practical workarounds by modifying various macro-level design aspects. Two common examples are downsampling of the input images and reducing the network depth to meet computer memory constraints. In this paper we investigate the effects of these changes on segmentation performance and show that image complexity can be used as a guideline in choosing what is best for a given dataset. We consider four statistical measures to quantify image complexity and evaluate their suitability on ten different public datasets. For the purpose of our experiments we also propose two new encoder-decoder architectures representing shallow and deep networks that are more memory efficient than currently popular networks. Our results suggest that median frequency is the best complexity measure in deciding about an acceptable input downsampling factor and network depth. For high-complexity datasets, a shallow network running on the original images may yield better segmentation results than a deep network running on downsampled images, whereas the opposite may be the case for low-complexity images.

Zero/Few Shot|迁移|域适配|自适应(3篇)

【1】 Geometry-Aware Unsupervised Domain Adaptation 标题：几何感知的无监督区域自适应链接：https://arxiv.org/abs/2112.11041

作者：You-Wei Luo,Chuan-Xian Ren,Zi-Ying Chen 机构： the source and target subspaces are different but cor- 1School of Mathematics, Sun Yat-Sen University 摘要：无监督领域自适应（Unsupervised Domain adaption，UDA）的目的是在存在数据集移位的情况下，将知识从标记的源领域转移到未标记的目标领域。大多数现有方法无法很好地解决域对齐和类区分问题，这可能会扭曲下游任务（例如分类）的固有数据结构。为此，我们提出了一种新的几何感知模型，通过核范数优化同时学习可转移性和可鉴别性。我们从子空间几何的角度介绍了UDA的域相干性和类正交性。领域一致性将确保模型具有更大的学习可分离表示的能力，而类正交性将最小化簇之间的相关性，以缓解失调。因此，它们是一致的，可以相互受益。此外，我们对UDA中基于规范的学习文献提供了理论见解，这确保了我们模型的可解释性。我们发现，域和簇的范数预期将分别变大和变小，以增强可转移性和可辨别性。在标准UDA数据集上的大量实验结果证明了我们的理论和模型的有效性。摘要：Unsupervised Domain Adaptation (UDA) aims to transfer the knowledge from the labeled source domain to the unlabeled target domain in the presence of dataset shift. Most existing methods cannot address the domain alignment and class discrimination well, which may distort the intrinsic data structure for downstream tasks (e.g., classification). To this end, we propose a novel geometry-aware model to learn the transferability and discriminability simultaneously via nuclear norm optimization. We introduce the domain coherence and class orthogonality for UDA from the perspective of subspace geometry. The domain coherence will ensure the model has a larger capacity for learning separable representations, and class orthogonality will minimize the correlation between clusters to alleviate the misalignment. So, they are consistent and can benefit from each other. Besides, we provide a theoretical insight into the norm-based learning literature in UDA, which ensures the interpretability of our model. We show that the norms of domains and clusters are expected to be larger and smaller to enhance the transferability and discriminability, respectively. Extensive experimental results on standard UDA datasets demonstrate the effectiveness of our theory and model.

【2】 Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding 标题：基于差错阈值的自适应稀疏编码学习ISTA 链接：https://arxiv.org/abs/2112.10985

作者：Ziang Li,Kailun Wu,Yiwen Guo,Changshui Zhang 机构： and Changshui Zhang are with the Institute for Arti-ficial Intelligence, Tsinghua University (THUAI), Department of Automation, TsinghuaUniversity 摘要：学习的迭代收缩阈值算法（LISTA）在一些用于稀疏编码的收缩函数中引入了具有可学习阈值的深度展开模型。根据一些理论见解，我们为LISTA提出了一种基于错误的阈值（EBT）机制，该机制利用分层重建错误的函数为每一层上的每个观察建议一个合适的阈值。我们表明，EBT机制很好地将收缩函数中的可学习参数从重建误差中分离出来，使其更适合各种观测。通过严格的理论分析，我们发现所提出的EBT除了具有更高的自适应性外，还可以在LISTA及其变体的基础上实现更快的收敛。大量的实验结果证实了我们的理论分析，并验证了我们方法的有效性。摘要：The learned iterative shrinkage thresholding algorithm (LISTA) introduces deep unfolding models with learnable thresholds in some shrinkage functions for sparse coding. Drawing on some theoretical insights, we advocate an error-based thresholding (EBT) mechanism for LISTA, which leverages a function of the layer-wise reconstruction error to suggest an appropriate threshold value for each observation on each layer. We show that the EBT mechanism well disentangles the learnable parameters in the shrinkage functions from the reconstruction errors, making them more adaptive to the various observations. With rigorous theoretical analyses, we show that the proposed EBT can lead to a faster convergence on the basis of LISTA and its variants, in addition to its higher adaptivity. Extensive experimental results confirm our theoretical analyses and verify the effectiveness of our methods.

【3】 Translational Concept Embedding for Generalized Compositional Zero-shot Learning 标题：基于翻译概念嵌入的广义组合零射学习链接：https://arxiv.org/abs/2112.10871

作者：He Huang,Wei Tang,Jiawei Zhang,Philip S. Yu 机构： Department of Computer Science, University of Illinois at Chicago, Chicago, USA, Florida State University, Tallahassee, USA 摘要：广义组合Zero-Shot学习是指以Zero-Shot方式学习属性-对象对的组合概念，其中模型在一组可见概念上进行训练，并在一组可见和不可见的组合概念上进行测试。这项任务非常具有挑战性，因为不仅可见和不可见概念之间存在差距，而且属性和对象之间存在上下文依赖关系。本文介绍了一种新的方法，称为翻译概念嵌入，以解决这两个困难在一个统一的框架。它将属性应用于对象的效果建模为向对象原型添加平移属性特征。通过生成有条件依赖于对象原型的平移属性特征，我们明确地考虑了属性和对象之间的上下文依赖关系。此外，我们还设计了一个比率-方差约束损失，以提高模型对不可见概念的泛化能力。它通过利用概念预训练单词嵌入的知识来调整概念之间的距离。我们评估了我们的模型在无偏和有偏概念分类任务下的性能，并表明我们的模型能够在预测看不见和看到的概念方面取得良好的平衡。摘要：Generalized compositional zero-shot learning means to learn composed concepts of attribute-object pairs in a zero-shot fashion, where a model is trained on a set of seen concepts and tested on a combined set of seen and unseen concepts. This task is very challenging because of not only the gap between seen and unseen concepts but also the contextual dependency between attributes and objects. This paper introduces a new approach, termed translational concept embedding, to solve these two difficulties in a unified framework. It models the effect of applying an attribute to an object as adding a translational attribute feature to an object prototype. We explicitly take into account of the contextual dependency between attributes and objects by generating translational attribute features conditionally dependent on the object prototypes. Furthermore, we design a ratio variance constraint loss to promote the model's generalization ability on unseen concepts. It regularizes the distances between concepts by utilizing knowledge from their pretrained word embeddings. We evaluate the performance of our model under both the unbiased and biased concept classification tasks, and show that our model is able to achieve good balance in predicting unseen and seen concepts.

半弱无监督|主动学习|不确定性(2篇)

【1】 Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects 标题：观看它的移动：3D关节的无监督发现，用于重新摆出关节对象的姿势链接：https://arxiv.org/abs/2112.11347

作者：Atsuhiro Noguchi,Umar Iqbal,Jonathan Tremblay,Tatsuya Harada,Orazio Gallo 机构：†NVIDIA, ‡The University of Tokyo, §RIKEN 备注：15 pages, Project page: this https URL 摘要：在控制关节对象姿势的同时渲染关节对象对于虚拟现实或电影动画等应用程序至关重要。但是，操纵对象的姿势需要了解其基本结构，即其关节以及它们如何相互作用。不幸的是，假设结构是已知的，就像现有方法一样，排除了处理新对象类别的能力。我们建议通过从多个视图观察以前看不见的铰接对象的移动来了解其外观和结构，而无需额外的监督，例如关节注释或结构信息。我们的见解是，相对运动的相邻部分必须通过接头连接。为了利用这一观察结果，我们在3D中将对象部分建模为椭球体，这使我们能够识别关节。我们将此显式表示与隐式表示相结合，以补偿引入的近似。我们证明了我们的方法适用于不同的结构，从四足动物到单臂机器人，再到人类。摘要：Rendering articulated objects while controlling their poses is critical to applications such as virtual reality or animation for movies. Manipulating the pose of an object, however, requires the understanding of its underlying structure, that is, its joints and how they interact with each other. Unfortunately, assuming the structure to be known, as existing methods do, precludes the ability to work on new object categories. We propose to learn both the appearance and the structure of previously unseen articulated objects by observing them move from multiple views, with no additional supervision, such as joints annotations, or information about the structure. Our insight is that adjacent parts that move relative to each other must be connected by a joint. To leverage this observation, we model the object parts in 3D as ellipsoids, which allows us to identify joints. We combine this explicit representation with an implicit one that compensates for the approximation introduced. We show that our method works for different structures, from quadrupeds, to single-arm robots, to humans.

【2】 ACGNet: Action Complement Graph Network for Weakly-supervised Temporal Action Localization 标题：ACGNet：弱监督时间动作定位的动作补充图网络链接：https://arxiv.org/abs/2112.10977

作者：Zichen Yang,Jie Qin,Di Huang 机构： State Key Laboratory of Software Development Environment, Beihang University, Beijing , China, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing , China 备注：Accepted to AAAI 2022 摘要：未剪辑视频中的弱监督时间动作定位（WTAL）已经成为一项实际但具有挑战性的任务，因为只有视频级别的标签可用。现有方法通常利用现成的段级特征，这些特征存在空间不完整性和时间不一致性，从而限制了它们的性能。在本文中，我们从一个新的角度来解决这个问题，通过一个简单而有效的图卷积网络，即动作补码图网络（ACGNet）来增强段级表示。它有助于当前视频片段感知来自其他视频片段的时空依赖性，这些视频片段可能传达互补的线索，从而隐式地减轻上述两个问题造成的负面影响。通过这种方式，段级特征对时空变化更具辨别力和鲁棒性，有助于提高定位精度。更重要的是，建议的ACGNet作为一个通用模块工作，可以灵活地插入不同的WTAL框架，同时保持端到端的训练方式。在THUMOS'14和ActivityNet1上进行了广泛的实验。2个基准，其中最先进的结果清楚地表明了拟议方法的优越性。摘要：Weakly-supervised temporal action localization (WTAL) in untrimmed videos has emerged as a practical but challenging task since only video-level labels are available. Existing approaches typically leverage off-the-shelf segment-level features, which suffer from spatial incompleteness and temporal incoherence, thus limiting their performance. In this paper, we tackle this problem from a new perspective by enhancing segment-level representations with a simple yet effective graph convolutional network, namely action complement graph network (ACGNet). It facilitates the current video segment to perceive spatial-temporal dependencies from others that potentially convey complementary clues, implicitly mitigating the negative effects caused by the two issues above. By this means, the segment-level features are more discriminative and robust to spatial-temporal variations, contributing to higher localization accuracies. More importantly, the proposed ACGNet works as a universal module that can be flexibly plugged into different WTAL frameworks, while maintaining the end-to-end training fashion. Extensive experiments are conducted on the THUMOS'14 and ActivityNet1.2 benchmarks, where the state-of-the-art results clearly demonstrate the superiority of the proposed approach.

时序|行为识别|姿态|视频|运动估计(4篇)

【1】 Implicit Neural Video Compression 标题：隐式神经网络视频压缩链接：https://arxiv.org/abs/2112.11312

作者：Yunfan Zhang,Ties van Rozendaal,Johann Brehmer,Markus Nagel,Taco Cohen 机构：Taco S. Cohen, Qualcomm AI Research 摘要：我们提出了一种用隐式神经表示法压缩全分辨率视频序列的方法。每个帧都表示为一个神经网络，将坐标位置映射到像素值。我们使用一个单独的隐式网络来调制坐标输入，从而实现帧间的有效运动补偿。再加上一个小的剩余网络，这使得我们能够相对于前一帧有效地压缩P帧。通过使用学习的整数量化存储网络权重，我们进一步降低了比特率。我们称之为隐式像素流（IPF）的方法对已建立的神经视频编解码器进行了一些简化：它不要求接收器访问预训练的神经网络，不使用昂贵的基于插值的扭曲操作，也不需要单独的训练数据集。我们证明了对图像和视频数据进行神经隐式压缩的可行性。摘要：We propose a method to compress full-resolution video sequences with implicit neural representations. Each frame is represented as a neural network that maps coordinate positions to pixel values. We use a separate implicit network to modulate the coordinate inputs, which enables efficient motion compensation between frames. Together with a small residual network, this allows us to efficiently compress P-frames relative to the previous frame. We further lower the bitrate by storing the network weights with learned integer quantization. Our method, which we call implicit pixel flow (IPF), offers several simplifications over established neural video codecs: it does not require the receiver to have access to a pretrained neural network, does not use expensive interpolation-based warping operations, and does not require a separate training dataset. We demonstrate the feasibility of neural implicit compression on image and video data.

【2】 PONet: Robust 3D Human Pose Estimation via Learning Orientations Only 标题：Ponet：仅基于学习方向的鲁棒三维人体姿态估计链接：https://arxiv.org/abs/2112.11153

作者：Jue Wang,Shaoli Huang,Xinchao Wang,Dacheng Tao 摘要：传统的三维人体姿态估计依赖于首先检测二维人体关键点，然后解决二维到三维的对应问题。尽管取得了有希望的结果，但这种学习模式高度依赖于2D关键点检测器的质量，2D关键点检测器不可避免地容易受到遮挡和图像缺失的影响。在本文中，我们提出了一种新的姿态定向网络（PONet），该网络仅通过学习方向就能够可靠地估计三维姿态，从而在没有图像证据的情况下绕过容易出错的关键点检测器。对于具有部分不可见肢体的图像，PONet通过利用局部图像证据来恢复3D姿势来估计这些肢体的3D方向。此外，PONet能够通过利用可见肢体之间的方向相关性来补充估计的姿势，甚至从具有完全不可见肢体的图像推断出完整的3D姿势，从而进一步提高3D姿势估计的鲁棒性。我们在多个数据集上评估我们的方法，包括Human3。6M、MPII、MPI-INF-3DHP和3DPW。我们的方法达到了与理想的设置中的最先进的技术的PAR结果，但显著地消除了对关键点检测器和相应的计算负担的依赖性。在非常具有挑战性的场景中，例如截断和擦除，我们的方法执行非常稳健，并且与最新技术相比产生了更优的结果，显示了其在实际应用中的潜力。摘要：Conventional 3D human pose estimation relies on first detecting 2D body keypoints and then solving the 2D to 3D correspondence problem.Despite the promising results, this learning paradigm is highly dependent on the quality of the 2D keypoint detector, which is inevitably fragile to occlusions and out-of-image absences.In this paper,we propose a novel Pose Orientation Net (PONet) that is able to robustly estimate 3D pose by learning orientations only, hence bypassing the error-prone keypoint detector in the absence of image evidence. For images with partially invisible limbs, PONet estimates the 3D orientation of these limbs by taking advantage of the local image evidence to recover the 3D pose.Moreover, PONet is competent to infer full 3D poses even from images with completely invisible limbs, by exploiting the orientation correlation between visible limbs to complement the estimated poses,further improving the robustness of 3D pose estimation.We evaluate our method on multiple datasets, including Human3.6M, MPII, MPI-INF-3DHP, and 3DPW. Our method achieves results on par with state-of-the-art techniques in ideal settings, yet significantly eliminates the dependency on keypoint detectors and the corresponding computation burden. In highly challenging scenarios, such as truncation and erasing, our method performs very robustly and yields much superior results as compared to state of the art,demonstrating its potential for real-world applications.

【3】 Continuous-Time Video Generation via Learning Motion Dynamics with Neural ODE 标题：基于神经节点学习运动动力学的连续时间视频生成链接：https://arxiv.org/abs/2112.10960

作者：Kangyeol Kim,Sunghyun Park,Junsoo Lee,Joonseok Lee,Sookyung Kim,Jaegul Choo,Edward Choi 机构： KAIST AI, Korea, Kakao Enterprise, Naver Webtoon, Seoul National University, Google Research, United States, Xerox PARC, Letsur Inc. 备注：24 pages; Accepted to BMVC 2021 摘要：为了执行无条件视频生成，我们必须了解真实世界视频的分布。为了合成高质量的视频，各种研究试图学习噪声和视频之间的映射函数，包括最近分离运动分布和外观分布的努力。然而，以前的方法是以离散的固定时间间隔学习运动动力学，这与物理体运动的连续性相反。在本文中，我们提出了一种新的视频生成方法，该方法学习运动和外观的独立分布，前者通过神经ODE建模来学习自然运动动力学。具体而言，我们采用两阶段方法，第一阶段将噪声向量转换为任意帧速率的关键点序列，第二阶段基于给定的关键点序列和外观噪声向量合成视频。我们的模型不仅在数量上优于最近的视频生成基线，而且还展示了多种功能，如动态帧速率操纵和两个数据集之间的运动传输，从而为各种视频生成应用打开了新的大门。摘要：In order to perform unconditional video generation, we must learn the distribution of the real-world videos. In an effort to synthesize high-quality videos, various studies attempted to learn a mapping function between noise and videos, including recent efforts to separate motion distribution and appearance distribution. Previous methods, however, learn motion dynamics in discretized, fixed-interval timesteps, which is contrary to the continuous nature of motion of a physical body. In this paper, we propose a novel video generation approach that learns separate distributions for motion and appearance, the former modeled by neural ODE to learn natural motion dynamics. Specifically, we employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector. Our model not only quantitatively outperforms recent baselines for video generation, but also demonstrates versatile functionality such as dynamic frame rate manipulation and motion transfer between two datasets, thus opening new doors to diverse video generation applications.

【4】 Spatiotemporal Motion Synchronization for Snowboard Big Air 标题：滑雪板大空气的时空运动同步链接：https://arxiv.org/abs/2112.10909

作者：Seiji Matsumura,Dan Mikami,Naoki Saijo,Makio Kashino 机构：NTT Communication Science Laboratories, -, Morinosato Wakamiya, Atsugi-shi, Kanagawa, Japan, Department of Computer Science, Kogakuin University, -,-, Nishi-Shinjuku, Shinjuku-ku, Tokyo, Japan 摘要：在最受欢迎的冬季运动之一滑雪板big air的训练期间，运动员和教练使用一个摄像头或智能手机广泛地拍摄和检查他们的跳跃尝试。然而，通过连续观看视频，很难比较两次试验之间的精确性能差异。因此，并排显示或叠加两个视频可能有助于训练。为了实现这一点，必须确保多个性能的空间和时间对齐。在这项研究中，我们提出了一个传统但可行的解决方案，使用现有的图像处理技术进行滑雪板大空中训练。我们采访了滑雪专家，他们说时空对齐的视频使他们能够准确地识别身体运动的细微差异。结果表明，该方法可用于滑雪板大气压的训练。摘要：During the training for snowboard big air, one of the most popular winter sports, athletes and coaches extensively shoot and check their jump attempts using a single camera or smartphone. However, by watching videos sequentially, it is difficult to compare the precise difference in performance between two trials. Therefore, side-by-side display or overlay of two videos may be helpful for training. To accomplish this, the spatial and temporal alignment of multiple performances must be ensured. In this study, we propose a conventional but plausible solution using the existing image processing techniques for snowboard big air training. We conducted interviews with expert snowboarders who stated that the spatiotemporally aligned videos enabled them to precisely identify slight differences in body movements. The results suggest that the proposed method can be used during the training of snowboard big air.

医学相关(2篇)

【1】 fMRI Neurofeedback Learning Patterns are Predictive of Personal and Clinical Traits 标题：功能磁共振神经反馈学习模式对个人和临床特征的预测链接：https://arxiv.org/abs/2112.11014

作者：Rotem Leibovitz,Jhonathan Osin,Lior Wolf,Guy Gurevitch,Talma Hendler 机构： School of Computer Science, Tel Aviv University, Sagol Brain Institue, Tel Aviv University, Sagol School of Neuroscience, Tel Aviv University, Editors: Under Review for MIDL 摘要：在功能磁共振成像（fMRI）的指导下，我们获得了一个人在自我神经调节任务中学习进展的个人特征。该信号基于预测第二次神经反馈会议中杏仁核的活动，第一次会议中给出了类似的功能磁共振成像衍生的大脑状态。预测是由一个深度神经网络进行的，该网络在整个训练队列的患者中进行训练。该信号指示一个人在执行杏仁核调节任务方面的进展，它在多个原型大脑状态中聚合，然后通过线性分类器分类为各种个人和临床适应症。获得的信号的预测能力比以前从功能磁共振成像神经反馈中获得个人信号的方法更强，并且提供了一个迹象，表明一个人的学习模式可以用作诊断工具。我们的准则已经发布，数据将被共享，但需经过道德审批。摘要：We obtain a personal signature of a person's learning progress in a self-neuromodulation task, guided by functional MRI (fMRI). The signature is based on predicting the activity of the Amygdala in a second neurofeedback session, given a similar fMRI-derived brain state in the first session. The prediction is made by a deep neural network, which is trained on the entire training cohort of patients. This signal, which is indicative of a person's progress in performing the task of Amygdala modulation, is aggregated across multiple prototypical brain states and then classified by a linear classifier to various personal and clinical indications. The predictive power of the obtained signature is stronger than previous approaches for obtaining a personal signature from fMRI neurofeedback and provides an indication that a person's learning pattern may be used as a diagnostic tool. Our code has been made available, and data would be shared, subject to ethical approvals.

【2】 HarmoFL: Harmonizing Local and Global Drifts in Federated Learning on Heterogeneous Medical Images 标题：HarmoFL：异构医学图像联合学习中局部和全局漂移的协调链接：https://arxiv.org/abs/2112.10775

作者：Meirui Jiang,Zirui Wang,Qi Dou 机构： Department of Computer Science and Engineering, The Chinese University of Hong Kong, School of Biological Science and Medical Engineering, Beihang University 摘要：多个医疗机构使用联合学习（FL）协作训练模型已成为最大限度地发挥数据驱动模型潜力的一个有希望的解决方案，但医学图像中的非独立同分布（非iid）数据仍然是现实世界实践中的一个突出挑战。不同的扫描程序或协议导致的特征异质性在本地（客户端）和全局（服务器）优化的学习过程中引入了漂移，这损害了收敛性和模型性能。以前的许多工作都试图通过局部或全局解决漂移来解决非iid问题，但如何联合解决这两个基本耦合的漂移仍然不清楚。在这项工作中，我们专注于处理本地和全球漂移，并引入了一个新的协调框架HarmoFL。首先，我们建议通过将变换到频域的图像的振幅归一化以模拟统一的成像设置来减轻局部更新漂移，以便在局部客户机上生成协调的特征空间。其次，基于协调特征，我们设计了一个客户权重扰动，引导每个局部模型达到平坦最优，其中局部最优解的邻域具有一致的低损失。在没有任何额外通信开销的情况下，扰动通过聚合多个局部平坦最优解，帮助全局模型朝着收敛最优解进行优化。我们对所提出的方法进行了理论分析，并在三个医学图像分类和分割任务上进行了大量实验，结果表明HarmoFL的性能优于一组最新的具有良好收敛性能的方法。摘要：Multiple medical institutions collaboratively training a model using federated learning (FL) has become a promising solution for maximizing the potential of data-driven models, yet the non-independent and identically distributed (non-iid) data in medical images is still an outstanding challenge in real-world practice. The feature heterogeneity caused by diverse scanners or protocols introduces a drift in the learning process, in both local (client) and global (server) optimizations, which harms the convergence as well as model performance. Many previous works have attempted to address the non-iid issue by tackling the drift locally or globally, but how to jointly solve the two essentially coupled drifts is still unclear. In this work, we concentrate on handling both local and global drifts and introduce a new harmonizing framework called HarmoFL. First, we propose to mitigate the local update drift by normalizing amplitudes of images transformed into the frequency domain to mimic a unified imaging setting, in order to generate a harmonized feature space across local clients. Second, based on harmonized features, we design a client weight perturbation guiding each local model to reach a flat optimum, where a neighborhood area of the local optimal solution has a uniformly low loss. Without any extra communication cost, the perturbation assists the global model to optimize towards a converged optimal solution by aggregating several local flat optima. We have theoretically analyzed the proposed method and empirically conducted extensive experiments on three medical image classification and segmentation tasks, showing that HarmoFL outperforms a set of recent state-of-the-art methods with promising convergence behavior.

GAN|对抗|攻击|生成相关(4篇)

【1】 GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping 标题：目标：为手-物体抓取生成4D全身运动链接：https://arxiv.org/abs/2112.11454

作者：Omid Taheri,Vasileios Choutas,Michael J. Black,Dimitrios Tzionas 机构：Max Planck Institute for Intelligent Systems, T¨ubingen, Germany 摘要：生成真实移动的数字人有很多应用，并且得到了广泛的研究，但是现有的方法只关注身体的主要肢体，而忽略了手和头。我们已经分别研究了手，但重点是生成对对象的真实静态抓取。要合成与世界交互的虚拟角色，我们需要同时生成全身运动和真实的手抓握。这两个子问题本身都具有挑战性，加在一起，姿势的状态空间明显更大，手和身体运动的规模不同，全身姿势和手抓握必须一致，满足物理约束，并且是合理的。此外，涉及头部，因为化身必须看着对象才能与其交互。这是第一次，我们解决的问题，产生全身，手和头部的化身抓住一个未知的物体的运动。作为输入，我们的方法称为目标，它获取一个3D对象、其位置以及一个开始的3D身体姿势和形状。目标使用两个新颖的网络输出一系列全身姿势。首先，GNet生成具有真实身体、头部、手臂和手姿势以及手-对象接触的目标全身抓取。其次，MNet生成起始姿势和目标姿势之间的运动。这是一个挑战，因为它需要化身以脚部与地面接触的方式走向物体，将头部朝向物体，伸出手，并以真实的手姿势和手与物体接触的方式抓住物体。为了实现这一点，网络利用结合SMPL-X身体参数和3D顶点偏移的表示。我们在GRAB数据集上定性和定量地训练和评估目标。结果表明，该目标能很好地推广到看不见的对象，优于基线。目标朝着合成真实的全身物体抓取迈出了一步。摘要：Generating digital humans that move realistically has many applications and is widely studied, but existing methods focus on the major limbs of the body, ignoring the hands and head. Hands have been separately studied but the focus has been on generating realistic static grasps of objects. To synthesize virtual characters that interact with the world, we need to generate full-body motions and realistic hand grasps simultaneously. Both sub-problems are challenging on their own and, together, the state-space of poses is significantly larger, the scales of hand and body motions differ, and the whole-body posture and the hand grasp must agree, satisfy physical constraints, and be plausible. Additionally, the head is involved because the avatar must look at the object to interact with it. For the first time, we address the problem of generating full-body, hand and head motions of an avatar grasping an unknown object. As input, our method, called GOAL, takes a 3D object, its position, and a starting 3D body pose and shape. GOAL outputs a sequence of whole-body poses using two novel networks. First, GNet generates a goal whole-body grasp with a realistic body, head, arm, and hand pose, as well as hand-object contact. Second, MNet generates the motion between the starting and goal pose. This is challenging, as it requires the avatar to walk towards the object with foot-ground contact, orient the head towards it, reach out, and grasp it with a realistic hand pose and hand-object contact. To achieve this the networks exploit a representation that combines SMPL-X body parameters and 3D vertex offsets. We train and evaluate GOAL, both qualitatively and quantitatively, on the GRAB dataset. Results show that GOAL generalizes well to unseen objects, outperforming baselines. GOAL takes a step towards synthesizing realistic full-body object grasping.

【2】 StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation 标题：StyleSDF：高分辨率3D一致性图像和几何图形生成链接：https://arxiv.org/abs/2112.11427

作者：Roy Or-El,Xuan Luo,Mengyi Shan,Eli Shechtman,Jeong Joon Park,Ira Kemelmacher-Shlizerman 机构：University of Washington, Adobe Research, Stanford University 备注：Project Webpage: this https URL 摘要：我们介绍了一种高分辨率、三维一致的图像和形状生成技术，我们称之为StyleSDF。我们的方法仅针对单视图RGB数据进行训练，并站在StyleGAN2的肩膀上进行图像生成，同时解决了3D感知GANs中的两个主要挑战：1）RGB图像的高分辨率、视图一致性生成，以及2）详细的3D形状。我们通过将基于SDF的3D表示与基于样式的2D生成器合并来实现这一点。我们的3D隐式网络渲染低分辨率特征贴图，基于样式的网络从中生成与视图一致的1024x1024图像。值得注意的是，我们基于SDF的3D建模定义了详细的3D曲面，从而实现了一致的体绘制。我们的方法在视觉和几何质量方面显示出比最新技术更高的质量结果。摘要：We introduce a high resolution, 3D-consistent image and shape generation technique which we call StyleSDF. Our method is trained on single-view RGB data only, and stands on the shoulders of StyleGAN2 for image generation, while solving two main challenges in 3D-aware GANs: 1) high-resolution, view-consistent generation of the RGB images, and 2) detailed 3D shape. We achieve this by merging a SDF-based 3D representation with a style-based 2D generator. Our 3D implicit network renders low-resolution feature maps, from which the style-based network generates view-consistent, 1024x1024 images. Notably, our SDF-based 3D modeling defines detailed 3D surfaces, leading to consistent volume rendering. Our method shows higher quality results compared to state of the art in terms of visual and geometric quality.

【3】 Generating Photo-realistic Images from LiDAR Point Clouds with Generative Adversarial Networks 标题：基于产生式对抗网络的LiDAR点云真实感图像生成链接：https://arxiv.org/abs/2112.11245

作者：Nuriel Shalom Mor 机构：Darca and Bnei Akiva Schools, Israel, Jonathan Rabinowitz, Ilya Fridburg, Roi Beker, and Liran Zuriel, AI Program Graduates, Hadash Darca School, Israel, Yoav Asnapi, and Eliyahu Aicho, AI Program Students, Harel Bnei Akiva School, Israel 备注：11 pages, 4 figures 摘要：我们研究了生成对抗网络（GANs）从激光雷达点云生成照片真实感图像的可行性。为此，我们创建了一个点云图像对数据集，并训练GAN从包含反射率和距离信息的激光雷达点云预测照片级真实感图像。我们的模型学习了如何仅从点云数据预测逼真的图像，甚至是黑色汽车的图像。黑色汽车很难直接从点云中探测到，因为它们的反射率很低。这种方法将来可能用于对激光雷达点云生成的照片级真实感图像执行视觉对象识别。除了传统的激光雷达系统外，第二个从激光雷达点云生成照片级真实感图像的系统将同时运行，以便实时识别视觉对象。通过这种方式，我们可以保持激光雷达的优势，并从使用照片真实感图像进行视觉对象识别中获益，而无需使用任何相机。此外，这种方法可以在不使用任何相机图像的情况下对点云进行着色。摘要：We examined the feasibility of generative adversarial networks (GANs) to generate photo-realistic images from LiDAR point clouds. For this purpose, we created a dataset of point cloud image pairs and trained the GAN to predict photorealistic images from LiDAR point clouds containing reflectance and distance information. Our models learned how to predict realistically looking images from just point cloud data, even images with black cars. Black cars are difficult to detect directly from point clouds because of their low level of reflectivity. This approach might be used in the future to perform visual object recognition on photorealistic images generated from LiDAR point clouds. In addition to the conventional LiDAR system, a second system that generates photorealistic images from LiDAR point clouds would run simultaneously for visual object recognition in real-time. In this way, we might preserve the supremacy of LiDAR and benefit from using photo-realistic images for visual object recognition without the usage of any camera. In addition, this approach could be used to colorize point clouds without the usage of any camera images.

【4】 Pixel-Stega: Generative Image Steganography Based on Autoregressive Models 标题：Pixel-Stega：基于自回归模型的生成性图像隐写链接：https://arxiv.org/abs/2112.10945

作者：Siyu Zhang,Zhongliang Yang,Haoqin Tu,Jinshuai Yang,Yongfeng Huang 机构： University of ChineseAcademy of Sciences 摘要：在这封信中，我们探讨了基于自回归模型的生成性图像隐写术。我们提出了像素级Stega，它利用自回归模型和算术编码算法实现像素级信息隐藏。首先，使用自回归模型PixelCNN++生成每个像素的显式条件概率分布。其次，通过基于算术编码的隐写采样（StegosSampling），将秘密消息编码为像素选择。我们对灰度和彩色图像数据集进行了定性和定量评估。实验结果表明，像素Stega能够根据像素的熵自适应地嵌入秘密信息，实现高嵌入容量（高达4.3bpp）和近乎完美的不可感知性（约50%的检测准确率）。摘要：In this letter, we explored generative image steganography based on autoregressive models. We proposed Pixel-Stega, which implements pixel-level information hiding with autoregressive models and arithmetic coding algorithm. Firstly, one of the autoregressive models, PixelCNN++, is utilized to produce explicit conditional probability distribution of each pixel. Secondly, secret messages are encoded to the selection of pixels through steganographic sampling (stegosampling) based on arithmetic coding. We carried out qualitative and quantitative assessment on gray-scale and colour image datasets. Experimental results show that Pixel-Stega is able to embed secret messages adaptively according to the entropy of the pixels to achieve both high embedding capacity (up to 4.3 bpp) and nearly perfect imperceptibility (about 50% detection accuracy).

Attention注意力(1篇)

【1】 Learned Queries for Efficient Local Attention 标题：用于高效局部关注的学习型查询链接：https://arxiv.org/abs/2112.11435

作者：Moab Arar,Ariel Shamir,Amit H. Bermano 机构：Tel-Aviv University, Reichman University 摘要：视觉变换器（ViT）是功能强大的视觉模型。与前几年主导视觉研究的卷积神经网络不同，视觉变换器具有捕获数据中长期相关性的能力。尽管如此，任何转换器体系结构的一个组成部分，即自我注意机制，都会受到高延迟和低效内存利用率的影响，因此不太适合高分辨率输入图像。为了缓解这些缺点，分层视觉模型在非交错窗口上局部地采用了自我注意。这种松弛降低了输入大小的线性复杂性；但是，它限制了跨窗口交互，影响了模型性能。在本文中，我们提出了一种新的平移不变局部注意层，称为查询和参与（QnA），它以重叠的方式局部聚集输入，就像卷积一样。QnA背后的关键思想是引入学习的查询，它允许快速高效的实现。我们通过将我们的层合并到分层视觉转换器模型中来验证其有效性。我们展示了速度和内存复杂性的改进，同时实现了与最先进模型相当的精度。最后，我们的层可以很好地扩展窗口大小，所需的内存最多可以减少10倍，而速度比现有方法快5倍。摘要：Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like convolutions. The key idea behind QnA is to introduce learned queries, which allow fast and efficient implementation. We verify the effectiveness of our layer by incorporating it into a hierarchical vision transformer model. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models. Finally, our layer scales especially well with window size, requiring up-to x10 less memory while being up-to x5 faster than existing methods.

人脸|人群计数(1篇)

【1】 Learning Human Motion Prediction via Stochastic Differential Equations 标题：基于随机微分方程的人体运动预测学习链接：https://arxiv.org/abs/2112.11124

作者：Kedi Lyu,Zhenguang Liu,Shuang Wu,Haipeng Chen,Xuhong Zhang,Yuyu Yin 机构：Jilin University, Changchun, Jilin, China, Zhejiang University, Hangzhou, Zhejiang, China, Nanyang Technological University, Singapore, Singapore, Hangzhou Dianzi University 备注：9 pages, 6 figures 摘要：人体运动理解和预测是我们追求机器智能和人机交互系统的一个重要方面。当前的方法通常采用运动学建模方法，严重依赖于先前的解剖学知识和约束。然而，这种方法很难推广到不同的骨骼模型表示，并且在考虑运动的动态范围和复杂性方面也往往不够充分，从而影响预测精度。在这项工作中，我们提出了一种基于随机微分方程和路径积分的运动预测建模新方法。将每个骨骼关节的运动轮廓表示为一个基本随机变量，并用Langevin方程建模。我们开发了一种利用GANs模拟路径积分的策略，这相当于对未来可能的路径进行优化。我们在两个大型基准数据集，人类3.6M和CMU MoCap上进行了实验。值得强调的是，我们的方法平均比目前最先进的方法提高了12.48%的准确率。摘要：Human motion understanding and prediction is an integral aspect in our pursuit of machine intelligence and human-machine interaction systems. Current methods typically pursue a kinematics modeling approach, relying heavily upon prior anatomical knowledge and constraints. However, such an approach is hard to generalize to different skeletal model representations, and also tends to be inadequate in accounting for the dynamic range and complexity of motion, thus hindering predictive accuracy. In this work, we propose a novel approach in modeling the motion prediction problem based on stochastic differential equations and path integrals. The motion profile of each skeletal joint is formulated as a basic stochastic variable and modeled with the Langevin equation. We develop a strategy of employing GANs to simulate path integrals that amounts to optimizing over possible future paths. We conduct experiments in two large benchmark datasets, Human 3.6M and CMU MoCap. It is highlighted that our approach achieves a 12.48% accuracy improvement over current state-of-the-art methods in average.

蒸馏|知识提取(1篇)

【1】 Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix 标题：通过学习教师的通道级语法矩阵实现多通道提取链接：https://arxiv.org/abs/2112.11447

作者：Peng Liu 机构：Multi-Modality Distillation via Learning the teacher’smodality-level Gram MatrixPeng LiuYunnan University 备注：10 pages 摘要：在多模态知识提取研究的背景下，现有的方法主要集中在只学习教师最终输出的问题上。因此，教师网络和学生网络之间仍然存在着深刻的差异。有必要强制学生网络学习教师网络的情态关系信息。为了有效地利用教师向学生传递知识，采用了一种新的模态关系提取范式，通过对不同模态之间的关系信息进行建模，即学习教师模态层次的Gram矩阵。摘要：In the context of multi-modality knowledge distillation research, the existing methods was mainly focus on the problem of only learning teacher final output. Thus, there are still deep differences between the teacher network and the student network. It is necessary to force the student network to learn the modality relationship information of the teacher network. To effectively exploit transfering knowledge from teachers to students, a novel modality relation distillation paradigm by modeling the relationship information among different modality are adopted, that is learning the teacher modality-level Gram Matrix.

超分辨率|去噪|去模糊|去雾(2篇)

【1】 Can We Use Neural Regularization to Solve Depth Super-Resolution? 标题：我们能用神经正则化来解决深度超分辨问题吗？链接：https://arxiv.org/abs/2112.11085

作者：Milena Gazdieva,Oleg Voynov,Alexey Artemov,Youyi Zheng,Luiz Velho,Evgeny Burnaev 机构：Skolkovo Institute of Science and Technology, Moscow, Russia, State Key Lab, Zhejiang University, Hangzhou, China, Instituto Nacional de Matem´atica Pura e Aplicada, Rio de Janeiro, Brazil 备注：9 pages 摘要：使用商品传感器捕获的深度图通常需要超分辨率才能在应用中使用。在这项工作中，我们研究了一种基于Tikhonov正则化的变分问题陈述的超分辨率方法，其中正则化子用深度神经网络参数化。这种方法曾成功地应用于光声层析成像。我们的实验表明，它的应用深度地图超分辨率是困难的，并提供了有关原因的建议。摘要：Depth maps captured with commodity sensors often require super-resolution to be used in applications. In this work we study a super-resolution approach based on a variational problem statement with Tikhonov regularization where the regularizer is parametrized with a deep neural network. This approach was previously applied successfully in photoacoustic tomography. We experimentally show that its application to depth map super-resolution is difficult, and provide suggestions about the reasons for that.

【2】 Point spread function estimation for blind image deblurring problems based on framelet transform 标题：基于Framelet变换的盲图像去模糊问题的点扩散函数估计链接：https://arxiv.org/abs/2112.11004

作者：Reza Parvaz 机构：Department of Mathematics, University of Mohaghegh Ardabili,-, Ardabil, Iran. 摘要：图像处理中最重要的问题之一是由于模糊过程而丢失的图像的近似。这类问题分为非盲问题和盲问题。由于原始图像和点扩散函数估计未知，第二类问题在计算方面比第一类问题更复杂。本文介绍了一种基于$l_0-\alpha l_1$正则化和framelet变换的由粗到精迭代算法来逼近扩展函数估计。由于内核分解到不同的频率，Framelet传输改进了恢复的内核。此外，在所提出的模型中，使用分数梯度算子代替普通梯度算子。对文本、人脸、自然图像等不同类型的图像进行了研究。该方法的输出反映了该算法在从盲问题恢复图像方面的有效性。摘要：One of the most important issues in the image processing is the approximation of the image that has been lost due to the blurring process. These types of matters are divided into non-blind and blind problems. The second type of problem is more complex in terms of calculations than the first problems due to the unknown of original image and point spread function estimation. In the present paper, an algorithm based on coarse-to-fine iterative by $l_0-\alpha l_1$ regularization and framelet transform is introduced to approximate the spread function estimation. Framelet transfer improves the restored kernel due to the decomposition of the kernel to different frequencies. Also in the proposed model fraction gradient operator is used instead of ordinary gradient operator. The proposed method is investigated on different kinds of images such as text, face, natural. The output of the proposed method reflects the effectiveness of the proposed algorithm in restoring the images from blind problems.

点云|SLAM|雷达|激光|深度RGBD相关(4篇)

【1】 Deep Learning Based 3D Point Cloud Regression for Estimating Forest Biomass 标题：基于深度学习的三维点云回归估计森林生物量链接：https://arxiv.org/abs/2112.11335

作者：Stefan Oehmcke,Lei Li,Jaime Revenga,Thomas Nord-Larsen,Katerina Trepekli,Fabian Gieseke,Christian Igel 机构：University of Copenhagen, University of M¨unster 摘要：了解森林生物量存量及其发展对于实施有效的气候变化缓解措施十分重要。它是研究驱动af、re和毁林过程的必要条件，也是碳核算的先决条件。利用机载激光雷达进行遥感可以大范围测量植被生物量。我们提出了一种深度学习系统，用于直接从3D激光雷达点云数据预测木材体积、地上生物量（AGB）以及随后的碳。我们为点云回归设计了不同的神经网络结构，并在国家森林资源清查中通过实地测量获得AGB估计值的地区的遥感数据上对其进行评估。我们采用Minkowski卷积神经网络进行回归得到了最好的结果。与基于点云基本统计数据的最新方法相比，深层神经网络产生了更精确的木材体积、AGB和碳估算，我们预计这一发现将对基于激光雷达的陆地生态系统动力学分析产生重大影响。摘要：Knowledge of forest biomass stocks and their development is important for implementing effective climate change mitigation measures. It is needed for studying the processes driving af-, re-, and deforestation and is a prerequisite for carbon-accounting. Remote sensing using airborne LiDAR can be used to measure vegetation biomass at large scale. We present deep learning systems for predicting wood volume, above-ground biomass (AGB), and subsequently carbon directly from 3D LiDAR point cloud data. We devise different neural network architectures for point cloud regression and evaluate them on remote sensing data of areas for which AGB estimates have been obtained from field measurements in a national forest inventory. Our adaptation of Minkowski convolutional neural networks for regression gave the best results. The deep neural networks produced significantly more accurate wood volume, AGB, and carbon estimates compared to state-of-the-art approaches operating on basic statistics of the point clouds, and we expect this finding to have a strong impact on LiDAR-based analyses of terrestrial ecosystem dynamics.

【2】 High-Fidelity Point Cloud Completion with Low-Resolution Recovery and Noise-Aware Upsampling 标题：低分辨率恢复和噪声感知上采样的高保真点云完成链接：https://arxiv.org/abs/2112.11271

作者：Ren-Wu Li,Bo Wang,Chun-Peng Li,Ling-Xiao Zhang,Lin Gao 机构：Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Tencent America 摘要：完成无序部分点云是一项具有挑战性的任务。现有的方法依靠解码潜在特征来恢复完整的形状，通常会导致完成的点云过度平滑、丢失细节和噪声。我们建议先对低分辨率（低分辨率）点云进行解码和细化，然后执行逐块噪声感知的上采样，而不是一次对整个稀疏点云进行插值，这样会丢失细节，而不是解码整个形状。考虑到最初解码的低分辨率点云可能缺少细节，我们提出了一种迭代细化来恢复几何细节，并提出了一种对称化过程来保留输入部分点云的可信信息。在获得稀疏完整的点云后，我们提出了一种分片上采样策略。与解码整个形状不同，基于面片的上采样允许更好地恢复细节，但是，由于数据差异（即，此处输入的稀疏数据不是来自地面真实值），现有的上采样方法不适用于完成任务。因此，我们提出了一种在稀疏点云和地面真实点云之间生成训练面片对的面片提取方法，以及一种孤立点去除步骤来抑制稀疏点云中的噪声点。结合低分辨率恢复，我们的整个方法能够实现高保真点云完成。提供综合评估，以证明所提议的方法及其各个组成部分的有效性。摘要：Completing an unordered partial point cloud is a challenging task. Existing approaches that rely on decoding a latent feature to recover the complete shape, often lead to the completed point cloud being over-smoothing, losing details, and noisy. Instead of decoding a whole shape, we propose to decode and refine a low-resolution (low-res) point cloud first, and then performs a patch-wise noise-aware upsampling rather than interpolating the whole sparse point cloud at once, which tends to lose details. Regarding the possibility of lacking details of the initially decoded low-res point cloud, we propose an iterative refinement to recover the geometric details and a symmetrization process to preserve the trustworthy information from the input partial point cloud. After obtaining a sparse and complete point cloud, we propose a patch-wise upsampling strategy. Patch-based upsampling allows to better recover fine details unlike decoding a whole shape, however, the existing upsampling methods are not applicable to completion task due to the data discrepancy (i.e., input sparse data here is not from ground-truth). Therefore, we propose a patch extraction approach to generate training patch pairs between the sparse and ground-truth point clouds, and an outlier removal step to suppress the noisy points from the sparse point cloud. Together with the low-res recovery, our whole method is able to achieve high-fidelity point cloud completion. Comprehensive evaluations are provided to demonstrate the effectiveness of the proposed method and its individual components.

【3】 PointCaps: Raw Point Cloud Processing using Capsule Networks with Euclidean Distance Routing 标题：PointCaps：基于欧氏距离路由的胶囊网络原点云处理链接：https://arxiv.org/abs/2112.11258

作者：Dishanika Denipitiyage,Vinoj Jayasundara,Ranga Rodrigo,Chamira U. S. Edussooriya 机构：∗Department of Electronic and Telecommunication Engineering, University of Moratuwa, Sri Lanka, #Department of Electrical and Computer Engineering, Florida International University, Miami, FL, USA 摘要：基于胶囊网络的原始点云处理由于能够保持输入数据的空间一致性而被广泛应用于分类、重建和分割。然而，大多数现有的基于胶囊的网络方法计算量大，无法将整个点云表示为单个胶囊。我们通过提出PointCaps（一种具有参数共享的新型卷积胶囊结构）来解决现有胶囊网络方法中的这些限制。结合PointCaps，我们提出了一种新的欧几里德距离路由算法和一种与类无关的潜在表示。潜在表示捕获点云的物理可解释几何参数，使用动态欧几里德路由，PointCaps很好地表示点的空间（点到部分）关系。与最先进的胶囊网络相比，PointCaps的参数数量要少得多，所需的触发器数量也要少得多，同时可以实现更好的重建，原始点云的分类和分割精度与最先进的胶囊网络相当。摘要：Raw point cloud processing using capsule networks is widely adopted in classification, reconstruction, and segmentation due to its ability to preserve spatial agreement of the input data. However, most of the existing capsule based network approaches are computationally heavy and fail at representing the entire point cloud as a single capsule. We address these limitations in existing capsule network based approaches by proposing PointCaps, a novel convolutional capsule architecture with parameter sharing. Along with PointCaps, we propose a novel Euclidean distance routing algorithm and a class-independent latent representation. The latent representation captures physically interpretable geometric parameters of the point cloud, with dynamic Euclidean routing, PointCaps well-represents the spatial (point-to-part) relationships of points. PointCaps has a significantly lower number of parameters and requires a significantly lower number of FLOPs while achieving better reconstruction with comparable classification and segmentation accuracy for raw point clouds compared to state-of-the-art capsule networks.

【4】 Efficient Registration of Forest Point Clouds by Global Matching of Relative Stem Positions 标题：基于相对树干位置全局匹配的森林点云高效配准链接：https://arxiv.org/abs/2112.11121

作者：Xufei Wang,Zexin Yang,Xiaojun Cheng,Jantien Stoter,Wenbin Xu,Zhenlun Wu,Liangliang Nan 摘要：登记森林环境的点云是激光雷达在精确林业中应用的必要先决条件。目前最先进的森林点云配准方法需要提取单个树木属性，在处理树木密集的真实森林点云时存在效率瓶颈。我们提出了一种自动、鲁棒、高效的森林点云配准方法。我们的方法首先从原始点云中定位树茎，然后根据树茎的相对空间关系进行匹配以确定配准变换。与现有的方法相比，我们的算法不需要额外的单个树属性，并且对环境中的树数具有线性复杂性，允许它对齐大型森林环境的点云。大量实验表明，我们的方法在配准精度和鲁棒性方面优于最先进的方法，并且在效率方面明显优于现有技术。此外，我们还介绍了一个新的基准数据集，该数据集补充了为数不多的现有开放数据集，用于开发和评估森林点云的注册方法。摘要：Registering point clouds of forest environments is an essential prerequisite for LiDAR applications in precision forestry. State-of-the-art methods for forest point cloud registration require the extraction of individual tree attributes, and they have an efficiency bottleneck when dealing with point clouds of real-world forests with dense trees. We propose an automatic, robust, and efficient method for the registration of forest point clouds. Our approach first locates tree stems from raw point clouds and then matches the stems based on their relative spatial relationship to determine the registration transformation. In contrast to existing methods, our algorithm requires no extra individual tree attributes and has linear complexity to the number of trees in the environment, allowing it to align point clouds of large forest environments. Extensive experiments have revealed that our method is superior to the state-of-the-art methods regarding registration accuracy and robustness, and it significantly outperforms existing techniques in terms of efficiency. Besides, we introduce a new benchmark dataset that complements the very few existing open datasets for the development and evaluation of registration methods for forest point clouds.

多模态(1篇)

【1】 Hateful Memes Challenge: An Enhanced Multimodal Framework 标题：仇恨模因挑战：一个增强的多模态框架链接：https://arxiv.org/abs/2112.11244

作者：Aijing Gao,Bingjun Wang,Jiaqi Yin,Yating Tian 机构：Georgia Institute of Technology 摘要：Facebook AI提出的可恶的模因挑战吸引了世界各地的参赛者。挑战的重点是在多模态模因中检测仇恨言语。各种最先进的深度学习模型已应用于此问题，challenge排行榜的表现也在不断提高。在本文中，我们增强了仇恨检测框架，包括利用Detectron进行特征提取，探索具有不同损失函数的VisualBERT和UNITER模型的不同设置，研究仇恨模因与敏感文本特征之间的关联，最后建立集成方法来提高模型性能。我们微调的VisualBERT、UNITER和ensemble方法的AUROC在挑战测试集上分别达到0.765、0.790和0.803，优于基线模型。我们的代码可在https://github.com/yatingtian/hateful-meme 摘要：Hateful Meme Challenge proposed by Facebook AI has attracted contestants around the world. The challenge focuses on detecting hateful speech in multimodal memes. Various state-of-the-art deep learning models have been applied to this problem and the performance on challenge's leaderboard has also been constantly improved. In this paper, we enhance the hateful detection framework, including utilizing Detectron for feature extraction, exploring different setups of VisualBERT and UNITER models with different loss functions, researching the association between the hateful memes and the sensitive text features, and finally building ensemble method to boost model performance. The AUROC of our fine-tuned VisualBERT, UNITER, and ensemble method achieves 0.765, 0.790, and 0.803 on the challenge's test set, respectively, which beats the baseline models. Our code is available at https://github.com/yatingtian/hateful-meme

3D|3D重建等相关(1篇)

【1】 Cloud Sphere: A 3D Shape Representation via Progressive Deformation 标题：云球：一种基于渐进式变形的三维形状表示方法链接：https://arxiv.org/abs/2112.11133

作者：Zongji Wang,Yunfei Liu,Feng Lu 机构：Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering 备注：This paper was submitted first in CVPR 2021 (paper id: 2255), and then was submitted in CVM 2022 (id: 160) 摘要：在三维形状分析领域，人们对形状的几何特性进行了长期的研究。本文不是使用专家设计的描述符或端到端的深度神经网络直接提取代表性特征，而是致力于从形状形成过程中发现独特的信息。具体地说，作为模板的球形点云逐渐变形，以从粗到细的方式拟合目标形状。在形状形成过程中，插入多个检查点，以便于记录和调查中间阶段。对于每个阶段，偏移字段作为阶段感知描述进行评估。整个形状形成过程中的偏移总和可以完全定义目标形状的几何形状。从这个角度来看，可以从模板中廉价地导出逐点形状对应，这有利于各种图形应用。在本文中，提出了基于渐进变形的自动编码器（PDAE），通过从粗到细的形状拟合任务来学习阶段感知描述。实验结果表明，该算法具有高保真度重建三维形状的能力，并且在多阶段变形过程中保持了一致的拓扑结构。基于阶段感知描述的附加应用程序被执行，证明了其通用性。摘要：In the area of 3D shape analysis, the geometric properties of a shape have long been studied. Instead of directly extracting representative features using expert-designed descriptors or end-to-end deep neural networks, this paper is dedicated to discovering distinctive information from the shape formation process. Concretely, a spherical point cloud served as the template is progressively deformed to fit the target shape in a coarse-to-fine manner. During the shape formation process, several checkpoints are inserted to facilitate recording and investigating the intermediate stages. For each stage, the offset field is evaluated as a stage-aware description. The summation of the offsets throughout the shape formation process can completely define the target shape in terms of geometry. In this perspective, one can derive the point-wise shape correspondence from the template inexpensively, which benefits various graphic applications. In this paper, the Progressive Deformation-based Auto-Encoder (PDAE) is proposed to learn the stage-aware description through a coarse-to-fine shape fitting task. Experimental results show that the proposed PDAE has the ability to reconstruct 3D shapes with high fidelity, and consistent topology is preserved in the multi-stage deformation process. Additional applications based on the stage-aware description are performed, demonstrating its universality.

其他神经网络|深度学习|模型|建模(5篇)

【1】 Max-Margin Contrastive Learning 标题：最大裕度对比学习链接：https://arxiv.org/abs/2112.11450

作者：Anshul Shah,Suvrit Sra,Rama Chellappa,Anoop Cherian 机构：Johns Hopkins University, Baltimore, MD, Massachusetts Institute of Technology, Cambridge, MA, Mitsubishi Electric Research Labs, Cambridge, MA 备注：Accepted at AAAI 2022 摘要：标准的对比学习方法通常需要大量的负面因素才能实现有效的无监督学习，并且往往表现出缓慢的收敛性。我们怀疑这种行为是由于选择了次优的底片来提供与正片的对比。我们从支持向量机（SVM）中汲取灵感，提出了最大边际对比学习（MMCL），从而克服了这一困难。我们的方法通过二次优化问题选择否定作为稀疏支持向量，并通过最大化决策裕度来增强对比性。由于支持向量机优化可能需要计算，特别是在端到端的环境中，我们提出了简化方法，以减轻计算负担。我们在标准的视觉基准数据集上验证了我们的方法，证明了与最新技术相比，我们在无监督表示学习方面有更好的性能，同时具有更好的经验收敛特性。摘要：Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learning (MMCL). Our approach selects negatives as the sparse support vectors obtained via a quadratic optimization problem, and contrastiveness is enforced by maximizing the decision margin. As SVM optimization can be computationally demanding, especially in an end-to-end setting, we present simplifications that alleviate the computational burden. We validate our approach on standard vision benchmark datasets, demonstrating better performance in unsupervised representation learning over state-of-the-art, while having better empirical convergence properties.

【2】 PrimSeq: a deep learning-based pipeline to quantitate rehabilitation training 标题：PrimSeq：一种基于深度学习的康复训练量化途径链接：https://arxiv.org/abs/2112.11330

作者：Avinash Parnandi,Aakash Kaku,Anita Venkatesan,Natasha Pandit,Audre Wirtanen,Haresh Rajamohan,Kannan Venkataramanan,Dawn Nilsen,Carlos Fernandez-Granda,Heidi Schambra 机构：Affiliations:, Department of Neurology, New York University Langone Health, New York, USA, Center for Data Science, New York University, New York, USA, Department of Rehabilitation and Regenerative Medicine, Columbia University, New York 摘要：中风康复旨在通过反复练习功能性运动来增加神经可塑性，但由于重复次数不足，对恢复的影响可能最小。目前还不知道最佳训练内容和数量，因为没有实用的工具来衡量这些内容和数量。在这里，我们介绍PrimSeq，这是一个对中风康复训练中的功能性运动进行分类和计数的管道。我们的方法集成了可穿戴传感器以捕捉上半身运动、预测运动序列的深度学习模型以及记录运动的算法。训练后的模型准确地将康复活动分解为部件功能运动，优于竞争性机器学习方法。PrimSeq进一步量化了这些运动，其时间和人力成本仅为人类专家的一小部分。我们展示了PrimSeq在先前未发现的患有一系列上肢运动障碍的中风患者中的功能。我们期望这些进展将支持中风康复定量给药试验所需的严格测量。摘要：Stroke rehabilitation seeks to increase neuroplasticity through the repeated practice of functional motions, but may have minimal impact on recovery because of insufficient repetitions. The optimal training content and quantity are currently unknown because no practical tools exist to measure them. Here, we present PrimSeq, a pipeline to classify and count functional motions trained in stroke rehabilitation. Our approach integrates wearable sensors to capture upper-body motion, a deep learning model to predict motion sequences, and an algorithm to tally motions. The trained model accurately decomposes rehabilitation activities into component functional motions, outperforming competitive machine learning methods. PrimSeq furthermore quantifies these motions at a fraction of the time and labor costs of human experts. We demonstrate the capabilities of PrimSeq in previously unseen stroke patients with a range of upper extremity motor impairment. We expect that these advances will support the rigorous measurement required for quantitative dosing trials in stroke rehabilitation.

【3】 Image quality enhancement of embedded holograms in holographic information hiding using deep neural networks 标题：基于深度神经网络的全息信息隐藏中嵌入全息图的图像质量增强链接：https://arxiv.org/abs/2112.11246

作者：Tomoyoshi Shimobaba,Sota Oshima,Takashi Kakue,and Tomoyoshi Ito 机构：Graduate School of Engineering, Chiba University,-, Yayoi-cho, Inage-ku, Chiba ,-, Japan 摘要：全息信息隐藏是一种将全息图或图像嵌入另一幅全息图的技术，用于全息图的版权保护和隐写术。利用深度神经网络，我们提供了一种提高嵌入式全息图视觉质量的方法。嵌入全息图的亮度设置为主体全息图亮度的一小部分，从而使主体全息图的再现图像几乎没有损坏。然而，由于嵌入全息图的再现像比再现的主体像暗，因此很难感知。在本研究中，我们使用深度神经网络来恢复变暗的图像。摘要：Holographic information hiding is a technique for embedding holograms or images into another hologram, used for copyright protection and steganography of holograms. Using deep neural networks, we offer a way to improve the visual quality of embedded holograms. The brightness of an embedded hologram is set to a fraction of that of the host hologram, resulting in a barely damaged reconstructed image of the host hologram. However, it is difficult to perceive because the embedded hologram's reconstructed image is darker than the reconstructed host image. In this study, we use deep neural networks to restore the darkened image.

【4】 Mapping industrial poultry operations at scale with deep learning and aerial imagery 标题：利用深度学习和航空图像绘制规模的工业家禽作业图链接：https://arxiv.org/abs/2112.10988

作者：Caleb Robinson,Ben Chugg,Brandon Anderson,Juan M. Lavista Ferres,Daniel E. Ho 机构：E. Ho, Microsoft AI for Good Research Lab, Redmond, WA, Stanford RegLab, Stanford, CA 摘要：集中动物饲养作业（CAFO）对空气、水和公共健康构成严重风险，但已证明其监管具有挑战性。美国政府问责局指出，一个基本挑战是缺乏关于CAFO的全面位置信息。我们使用美国农业部的国家农业图像计划（NAIP）1m/像素航空图像检测美国大陆的家禽咖啡馆。我们训练卷积神经网络（CNN）模型来识别单个家禽饲养场，并将性能最佳的模型应用于超过42 TB的图像，以创建第一个国家级、开放源代码的家禽CAFO数据集。我们根据加利福尼亚州10个手动标记县的家禽CAFO设施位置的验证集验证模型预测，并证明该方法具有填补环境监测空白的巨大潜力。摘要：Concentrated Animal Feeding Operations (CAFOs) pose serious risks to air, water, and public health, but have proven to be challenging to regulate. The U.S. Government Accountability Office notes that a basic challenge is the lack of comprehensive location information on CAFOs. We use the USDA's National Agricultural Imagery Program (NAIP) 1m/pixel aerial imagery to detect poultry CAFOs across the continental United States. We train convolutional neural network (CNN) models to identify individual poultry barns and apply the best performing model to over 42 TB of imagery to create the first national, open-source dataset of poultry CAFOs. We validate the model predictions against held-out validation set on poultry CAFO facility locations from 10 hand-labeled counties in California and demonstrate that this approach has significant potential to fill gaps in environmental monitoring.

【5】 Encoding Hierarchical Information in Neural Networks helps in Subpopulation Shift 标题：对神经网络中的分层信息进行编码有助于子种群的转移链接：https://arxiv.org/abs/2112.10844

作者：Amitangshu Mukherjee,Isha Garg,Kaushik Roy 机构：Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN , USA 备注：14 pages, 2 figures 摘要：在过去的十年中，深度神经网络已经被证明擅长于图像分类任务，在准确性方面往往超过人类。然而，标准的神经网络往往无法理解视觉相关任务的层次结构和不同类别之间的依赖关系的概念。另一方面，人类似乎从概念上学习类别，从理解高级概念逐步发展到类别的粒度级别。由于神经网络无法在其学习结构中对这种依赖性进行编码而产生的问题之一是子种群转移的问题——在子种群转移中，使用从训练集类别的转移种群中获取的新的看不见的类来查询模型。由于神经网络将每个类别视为独立于所有其他类别，因此它难以对依赖于更高层次的流动人口进行分类。在这项工作中，我们通过一个新的条件监督训练框架来研究上述问题。我们通过一个结构化的学习过程来处理子种群的转移，该过程通过标签有条件地结合层次信息。此外，我们引入了图形距离的概念来模拟预测失误的灾难性影响。我们表明，以这种结构化的分层方式进行学习可以使网络对子种群转移更具鲁棒性，与子种群转移基准上的标准模型相比，准确度提高约2%，图形距离提高约8.5%。摘要：Over the past decade, deep neural networks have proven to be adept in image classification tasks, often surpassing humans in terms of accuracy. However, standard neural networks often fail to understand the concept of hierarchical structures and dependencies among different classes for vision related tasks. Humans on the other hand, seem to learn categories conceptually, progressively growing from understanding high-level concepts down to granular levels of categories. One of the issues arising from the inability of neural networks to encode such dependencies within its learned structure is that of subpopulation shift -- where models are queried with novel unseen classes taken from a shifted population of the training set categories. Since the neural network treats each class as independent from all others, it struggles to categorize shifting populations that are dependent at higher levels of the hierarchy. In this work, we study the aforementioned problems through the lens of a novel conditional supervised training framework. We tackle subpopulation shift by a structured learning procedure that incorporates hierarchical information conditionally through labels. Furthermore, we introduce a notion of graphical distance to model the catastrophic effect of mispredictions. We show that learning in this structured hierarchical manner results in networks that are more robust against subpopulation shifts, with an improvement of around ~2% in terms of accuracy and around 8.5\% in terms of graphical distance over standard models on subpopulation shift benchmarks.

其他(9篇)

【1】 ADJUST: A Dictionary-Based Joint Reconstruction and Unmixing Method for Spectral Tomography 标题：ADJUST：一种基于字典的光谱层析联合重建与解混方法链接：https://arxiv.org/abs/2112.11406

作者：Mathé T. Zeegers,Ajinkya Kadu,Tristan van Leeuwen,Kees Joost Batenburg 机构： Centrum Wiskunde & Informatica, Science Park , XG Amsterdam, The Netherlands, University of Antwerp, Groenenborgerlaan , Antwerp, Belgium, Mathematical Institute, Utrecht University, Budapestlaan , CD Utrecht, The Netherlands 备注：This paper is under consideration at Inverse Problems. 28 pages, 16 figures 摘要：多光谱探测器的发展正在引起X射线计算机断层扫描（CT）的范式转变。从这些探测器获取的光谱信息可用于提取感兴趣物体的体积材料成分图。如果事先知道材料及其光谱响应，则图像重建步骤相当简单。但是，如果它们未知，则需要联合估计地图和响应。光谱CT中的传统工作流程包括执行体积重建，然后进行材料分解，反之亦然。然而，这些方法固有地受到关节重建问题的不适定性的影响。为了解决这个问题，我们提出了“基于字典的光谱层析联合重建和分解方法”（ADJUST）。我们的公式依赖于形成CT中常见材料的光谱特征字典和物体中存在的材料数量的先验知识。特别是，我们根据空间材质贴图、光谱字典和字典元素的材质指示器对光谱体积进行线性分解。我们提出了一种记忆有效的加速交替近端梯度法来寻找由此产生的双凸问题的近似解。通过对几个合成模型的数值演示，我们观察到，与其他最先进的方法相比，ADJUST的性能非常好。此外，我们还讨论了针对有限测量模式进行调整的鲁棒性。摘要：Advances in multi-spectral detectors are causing a paradigm shift in X-ray Computed Tomography (CT). Spectral information acquired from these detectors can be used to extract volumetric material composition maps of the object of interest. If the materials and their spectral responses are known a priori, the image reconstruction step is rather straightforward. If they are not known, however, the maps as well as the responses need to be estimated jointly. A conventional workflow in spectral CT involves performing volume reconstruction followed by material decomposition, or vice versa. However, these methods inherently suffer from the ill-posedness of the joint reconstruction problem. To resolve this issue, we propose `A Dictionary-based Joint reconstruction and Unmixing method for Spectral Tomography' (ADJUST). Our formulation relies on forming a dictionary of spectral signatures of materials common in CT and prior knowledge of the number of materials present in an object. In particular, we decompose the spectral volume linearly in terms of spatial material maps, a spectral dictionary, and the indicator of materials for the dictionary elements. We propose a memory-efficient accelerated alternating proximal gradient method to find an approximate solution to the resulting bi-convex problem. From numerical demonstrations on several synthetic phantoms, we observe that ADJUST performs exceedingly well when compared to other state-of-the-art methods. Additionally, we address the robustness of ADJUST against limited measurement patterns.

【2】 Shape from Polarization for Complex Scenes in the Wild 标题：野外复杂场景的偏振形状链接：https://arxiv.org/abs/2112.11377

作者：Chenyang Lei,Chenyang Qi,Jiaxin Xie,Na Fan,Vladlen Koltun,Qifeng Chen 机构：HKUST, Apple 摘要：我们提出了一种基于物理先验的数据驱动的单偏振图像场景水平法线估计方法。现有的偏振形状（SfP）工作主要集中于估计单个对象的法线，而不是野外复杂场景的法线。高质量场景级SfP的一个关键障碍是在复杂场景中缺少真实世界的SfP数据。因此，我们提供了第一个具有成对输入偏振图像和地面真实法线贴图的真实场景级SfP数据集。然后，我们提出了一个基于学习的框架，该框架具有多头自我注意模块和观看编码，用于处理场景级SfP中复杂材质和非正交投影导致的偏振模糊度增加。由于偏振光和表面法线之间的关系不受距离的影响，我们训练的模型可以推广到远场室外场景。实验结果表明，在两个数据集上，我们的方法明显优于现有的SfP模型。我们的数据集和源代码将在\url公开{https://github.com/ChenyangLEI/sfp-wild}. 摘要：We present a new data-driven approach with physics-based priors to scene-level normal estimation from a single polarization image. Existing shape from polarization (SfP) works mainly focus on estimating the normal of a single object rather than complex scenes in the wild. A key barrier to high-quality scene-level SfP is the lack of real-world SfP data in complex scenes. Hence, we contribute the first real-world scene-level SfP dataset with paired input polarization images and ground-truth normal maps. Then we propose a learning-based framework with a multi-head self-attention module and viewing encoding, which is designed to handle increasing polarization ambiguities caused by complex materials and non-orthographic projection in scene-level SfP. Our trained model can be generalized to far-field outdoor scenes as the relationship between polarized light and surface normals is not affected by distance. Experimental results demonstrate that our approach significantly outperforms existing SfP models on two datasets. Our dataset and source code will be publicly available at \url{https://github.com/ChenyangLEI/sfp-wild}.

【3】 Transferable End-to-end Room Layout Estimation via Implicit Encoding 标题：基于隐式编码的可转移端到端房间布局估计链接：https://arxiv.org/abs/2112.11340

作者：Hao Zhao,Rene Ranftl,Yurong Chen,Hongbin Zha 机构：Peking University ,Intel Labs 备注：Project: this https URL 摘要：我们研究的问题估计房间布局从一个单一的全景图像。以往的工作主要分为两个阶段：特征提取和参数化模型拟合。在这里，我们提出了一种端到端的方法，直接从输入的全景图像预测参数化布局。它利用隐式编码过程，将参数化布局嵌入到潜在空间中。然后学习从图像到该潜在空间的映射，使得端到端房间布局估计成为可能。然而，端到端方法尽管有许多有趣的特性，但仍有一些臭名昭著的缺点。一个广泛提出的批评是，他们受到数据集偏见的困扰，不会转移到不熟悉的领域。我们的研究与这一共同信念相呼应。为此，我们建议使用语义边界预测映射作为中间域。它在四个基准（Structured3D、PanoContext、S3DIS和Matterport3D）上带来了显著的性能提升，尤其是在Zero-Shot传输设置中。代码、数据和模型将发布。摘要：We study the problem of estimating room layouts from a single panorama image. Most former works have two stages: feature extraction and parametric model fitting. Here we propose an end-to-end method that directly predicts parametric layouts from an input panorama image. It exploits an implicit encoding procedure that embeds parametric layouts into a latent space. Then learning a mapping from images to this latent space makes end-to-end room layout estimation possible. However end-to-end methods have several notorious drawbacks despite many intriguing properties. A widely raised criticism is that they are troubled with dataset bias and do not transfer to unfamiliar domains. Our study echos this common belief. To this end, we propose to use semantic boundary prediction maps as an intermediate domain. It brings significant performance boost on four benchmarks (Structured3D, PanoContext, S3DIS, and Matterport3D), notably in the zero-shot transfer setting. Code, data, and models will be released.

【4】 Multispectral image fusion by super pixel statistics 标题：基于超像素统计的多光谱图像融合链接：https://arxiv.org/abs/2112.11329

作者：Nati Ofir 摘要：多光谱图像融合是遥感和图像处理的一个基本问题。这个问题通过经典和深度学习方法解决。本文重点介绍了经典的解决方案，并介绍了一种新的解决方案。该方法根据融合图像的内容进行多光谱图像融合。它依赖于基于融合输入中分段超像素信息水平的分析。具体来说，我解决了可见光彩色RGB到近红外（NIR）融合的任务。RGB图像捕捉场景的颜色，而NIR捕捉细节并看到烟雾和云层之外的东西。由于每个通道感知场景的不同信息，因此它们的融合具有挑战性和趣味性。所提出的方法旨在产生一种融合，融合了每种光谱的优点。该手稿实验表明，与其他经典融合方法相比，该方法在视觉上信息丰富，可以在嵌入式设备上快速运行，无需大量计算资源。摘要：Multispectral image fusion is a fundamental problem of remote sensing and image processing. This problem is addressed by both classic and deep learning approaches. This paper is focused on the classic solutions and introduces a new novel approach to this family. The proposed method carries out multispectral image fusion based on the content of the fused images. It relies on analysis based on the level of information on segmented superpixels in the fused inputs. Specifically, I address the task of visible color RGB to Near-Infrared (NIR) fusion. The RGB image captures the color of the scene while the NIR captures details and sees beyond haze and clouds. Since each channel senses different information of the scene, their fusion is challenging and interesting. The proposed method is designed to produce a fusion that contains both advantages of each spectra. This manuscript experiments show that the proposed method is visually informative with respect to other classic fusion methods which can be run fastly on embedded devices with no need for heavy computation resources.

【5】 Improving Robustness with Image Filtering 标题：利用图像滤波提高鲁棒性链接：https://arxiv.org/abs/2112.11235

作者：Matteo Terzi,Mattia Carletti,Gian Antonio Susto 机构：University of Padova 摘要：对抗鲁棒性是深度学习和计算机视觉研究中最具挑战性的问题之一。所有最先进的技术都需要一个耗时的过程来创建巧妙的扰动图像。由于成本问题，人们提出了许多解决方案来避免对抗性训练。然而，所有这些尝试都被证明是无效的，因为攻击者设法利用像素之间的虚假相关性触发模型隐式学习的脆弱特征。本文首先介绍了一种新的图像过滤方案，称为图像图提取器（IGE），它通过图结构提取图像的基本节点及其连接。通过利用IGE表示，我们构建了一种新的防御方法，即过滤作为防御，该方法不允许攻击者纠缠像素以创建恶意模式。此外，我们还表明，使用过滤图像的数据增强有效地提高了模型对数据损坏的鲁棒性。我们在CIFAR-10、CIFAR-100和ImageNet上验证了我们的技术。摘要：Adversarial robustness is one of the most challenging problems in Deep Learning and Computer Vision research. All the state-of-the-art techniques require a time-consuming procedure that creates cleverly perturbed images. Due to its cost, many solutions have been proposed to avoid Adversarial Training. However, all these attempts proved ineffective as the attacker manages to exploit spurious correlations among pixels to trigger brittle features implicitly learned by the model. This paper first introduces a new image filtering scheme called Image-Graph Extractor (IGE) that extracts the fundamental nodes of an image and their connections through a graph structure. By leveraging the IGE representation, we build a new defense method, Filtering As a Defense, that does not allow the attacker to entangle pixels to create malicious patterns. Moreover, we show that data augmentation with filtered images effectively improves the model's robustness to data corruption. We validate our techniques on CIFAR-10, CIFAR-100, and ImageNet.

【6】 RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality 标题：RepMLPNet：具有重新参数化局部性的分层视觉MLP 链接：https://arxiv.org/abs/2112.11081

作者：Xiaohan Ding,Honghao Chen,Xiangyu Zhang,Jungong Han,Guiguang Ding 机构： Beijing National Research Center for Information Science and Technology (BNRist);, School of Software, Tsinghua University, Beijing, China, Institute of Automation, Chinese Academy of Sciences, MEGVII Technology 备注：The code and models are available at this https URL arXiv admin note: text overlap with arXiv:2105.01883 摘要：与卷积层相比，完全连接（FC）层在建模长距离依赖方面更好，但在捕获局部模式方面更差，因此通常不太适合用于图像识别。在本文中，我们提出了一种方法，局部注入，通过将并行conv核的训练参数合并到FC核中，将局部先验合并到FC层。局部注入可以看作是一种新的结构再参数化方法，因为它通过变换参数来等价地转换结构。在此基础上，我们提出了一种多层感知器（MLP）块RepMLP块，该块使用三层FC来提取特征，并提出了一种新的结构RepMLPNet。分层设计将RepMLPNet与其他同时提出的视觉mlp区分开来。当它生成不同层次的特征图时，它可以作为下游任务（如语义分割）的主干模型。我们的结果表明：1）局部注入是MLP模型的一种通用方法；2）与其他MLP相比，RepMLPNet具有良好的精度-效率权衡；3） RepMLPNet是第一个无缝转移到城市景观语义分割的MLP。有关代码和模型，请访问https://github.com/DingXiaoH/RepMLP. 摘要：Compared to convolutional layers, fully-connected (FC) layers are better at modeling the long-range dependencies but worse at capturing the local patterns, hence usually less favored for image recognition. In this paper, we propose a methodology, Locality Injection, to incorporate local priors into an FC layer via merging the trained parameters of a parallel conv kernel into the FC kernel. Locality Injection can be viewed as a novel Structural Re-parameterization method since it equivalently converts the structures via transforming the parameters. Based on that, we propose a multi-layer-perceptron (MLP) block named RepMLP Block, which uses three FC layers to extract features, and a novel architecture named RepMLPNet. The hierarchical design distinguishes RepMLPNet from the other concurrently proposed vision MLPs. As it produces feature maps of different levels, it qualifies as a backbone model for downstream tasks like semantic segmentation. Our results reveal that 1) Locality Injection is a general methodology for MLP models; 2) RepMLPNet has favorable accuracy-efficiency trade-off compared to the other MLPs; 3) RepMLPNet is the first MLP that seamlessly transfer to Cityscapes semantic segmentation. The code and models are available at https://github.com/DingXiaoH/RepMLP.

【7】 A Theoretical View of Linear Backpropagation and Its Convergence 标题：线性反向传播及其收敛性的理论观点链接：https://arxiv.org/abs/2112.11018

作者：Ziang Li,Yiwen Guo,Haodi Liu,Changshui Zhang 机构： Zhang are with the Institute for Artificial Intelli-gence, Tsinghua University (THUAI), Department of Automation, TsinghuaUniversity 摘要：反向传播广泛用于计算深度神经网络（DNN）中的梯度。反向传播通常与随机梯度下降（SGD）或其变体一起应用，在包括DNN训练和对抗性攻击/防御在内的各种机器学习任务中，反向传播被认为是一种事实上的选择。最近，Guo等人引入了一种称为LinBP的线性BP，用于生成更多可转移的黑箱对抗攻击的对抗示例。然而，尚未对其进行理论研究，也缺乏这种方法的收敛性分析。本文对郭等人的论文进行了补充和扩展，对涉及对抗性攻击和模型训练等学习任务的神经网络中的LinBP进行了理论分析。我们证明，有点令人惊讶的是，与BP相比，LinBP可以在相同的超参数设置下更快地收敛于这些任务。我们通过大量实验证实了我们的理论结果。摘要：Backpropagation is widely used for calculating gradients in deep neural networks (DNNs). Applied often along with stochastic gradient descent (SGD) or its variants, backpropagation is considered as a de-facto choice in a variety of machine learning tasks including DNN training and adversarial attack/defense. Recently, a linear variant of BP named LinBP was introduced for generating more transferable adversarial examples for black-box adversarial attacks, by Guo et al. Yet, it has not been theoretically studied and the convergence analysis of such a method is lacking. This paper serves as a complement and somewhat an extension to Guo et al.'s paper, by providing theoretical analyses on LinBP in neural-network-involved learning tasks including adversarial attack and model training. We demonstrate that, somewhat surprisingly, LinBP can lead to faster convergence in these tasks in the same hyper-parameter settings, compared to BP. We confirm our theoretical results with extensive experiments.

【8】 Generalizing Interactive Backpropagating Refinement for Dense Prediction 标题：推广交互式反向传播精化算法进行稠密预测链接：https://arxiv.org/abs/2112.10969

作者：Fanqing Lin,Brian Price,Tony Martinez 机构：Brigham Young University, Adobe Research 摘要：随着深度神经网络成为计算机视觉领域中用于密集预测任务的最先进方法，许多方法被开发用于在给定视觉输入的情况下自动估计目标输出。尽管所提出的自动方法的估计精度在不断提高，但交互细化对于进一步校正常常是必要的。最近，特征反向传播细化方案（\text{\textit{f}-BRS}）被提出用于交互式分割任务，该方案能够有效地优化插入到预训练网络中的一小组辅助变量，以产生与用户输入更好地一致的对象分割。然而，建议的辅助变量仅包含通道尺度和偏差，限制了优化仅限于全局优化。在这项工作中，为了将反向传播求精推广到广泛的密集预测任务中，我们引入了一组G-BRS（广义反向传播求精方案）层，可对以下任务进行全局和局部求精：交互式分段、语义分段、，图像抠图和单目深度估计。在SBD、Cityscapes、Mapillary Vista、Composition-1k和NYU-Depth-V2上的实验表明，我们的方法只需点击几下，就可以成功地推广并显著提高现有预训练最先进模型的性能。摘要：As deep neural networks become the state-of-the-art approach in the field of computer vision for dense prediction tasks, many methods have been developed for automatic estimation of the target outputs given the visual inputs. Although the estimation accuracy of the proposed automatic methods continues to improve, interactive refinement is oftentimes necessary for further correction. Recently, feature backpropagating refinement scheme (\text{\textit{f}-BRS}) has been proposed for the task of interactive segmentation, which enables efficient optimization of a small set of auxiliary variables inserted into the pretrained network to produce object segmentation that better aligns with user inputs. However, the proposed auxiliary variables only contain channel-wise scale and bias, limiting the optimization to global refinement only. In this work, in order to generalize backpropagating refinement for a wide range of dense prediction tasks, we introduce a set of G-BRS (Generalized Backpropagating Refinement Scheme) layers that enable both global and localized refinement for the following tasks: interactive segmentation, semantic segmentation, image matting and monocular depth estimation. Experiments on SBD, Cityscapes, Mapillary Vista, Composition-1k and NYU-Depth-V2 show that our method can successfully generalize and significantly improve performance of existing pretrained state-of-the-art models with only a few clicks.

【9】 DRPN: Making CNN Dynamically Handle Scale Variation 标题：DRPN：让CNN动态处理规模变化链接：https://arxiv.org/abs/2112.10963

作者：Jingchao Peng,Haitao Zhao,Zhengwei Hu,Yi Zhuang,Bofan Wang 机构：East China University of Science and Technology, Automation Department, School of Information Science and Engineering 摘要：根据我们对红外目标的观测，在序列帧内经常发生严重的尺度变化。本文提出了一种动态再参数化网络（DRPN）来处理红外数据中的尺度变化，平衡小目标和大目标的检测精度。DRPN采用不同卷积核大小的多分支和动态卷积策略。具有不同大小卷积核的多个分支具有不同大小的感受野。动态卷积策略使DRPN自适应地加权多个分支。DRPN可以根据目标的尺度变化动态调整感受野。此外，为了在测试阶段保持有效的推理，训练后通过重新参数化技术将多分支结构进一步转换为单分支结构。在FLIR、KAIST和飞机下数据集上的大量实验证明了我们提出的DRPN的有效性。实验结果表明，以DRPN为基本结构的探测器比SKNet或TridentNet具有更好的性能。摘要：Based on our observations of infrared targets, serious scale variation along within sequence frames has high-frequently occurred. In this paper, we propose a dynamic re-parameterization network (DRPN) to deal with the scale variation and balance the detection precision between small targets and large targets in infrared datasets. DRPN adopts the multiple branches with different sizes of convolution kernels and the dynamic convolution strategy. Multiple branches with different sizes of convolution kernels have different sizes of receptive fields. Dynamic convolution strategy makes DRPN adaptively weight multiple branches. DRPN can dynamically adjust the receptive field according to the scale variation of the target. Besides, in order to maintain effective inference in the test phase, the multi-branch structure is further converted to a single-branch structure via the re-parameterization technique after training. Extensive experiments on FLIR, KAIST, and InfraPlane datasets demonstrate the effectiveness of our proposed DRPN. The experimental results show that detectors using the proposed DRPN as the basic structure rather than SKNet or TridentNet obtained the best performances.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-22，如有侵权请联系 cloudcommunity@tencent.com 删除

linux