计算机视觉学术速递[7.8]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:42:07

1.2K0

cs.CV 方向，今日共计62篇

Transformer(6篇)

【1】 Long Short-Term Transformer for Online Action Detection 标题：用于在线动作检测的长短期Transformer

作者：Mingze Xu,Yuanjun Xiong,Hao Chen,Xinyu Li,Wei Xia,Zhuowen Tu,Stefano Soatto 机构：AmazonAWS AI 备注：Technical report 链接：https://arxiv.org/abs/2107.03377 摘要：在本文中，我们提出了一种新的用于在线动作检测的时间建模算法长-短期变换器（LSTR），它采用了一种能够对长时间序列数据进行建模的长-短期记忆机制。它由一个LSTR编码器和一个LSTR解码器组成，LSTR编码器能够动态地利用来自广泛的长时间窗口（例如2048个长达8分钟的长距离帧）的粗尺度历史信息，LSTR解码器聚焦于短时间窗口（例如。，32个短程帧（8秒），以模拟正在进行的事件的精细尺度特征。与以往的工作相比，LSTR以较少的启发式算法设计为长视频建模提供了一种有效的方法。LSTR在标准的在线动作检测基准、THUMOS'14、TVSeries和HACS段上取得了比现有的最新方法显著改进的结果。大量的实证分析验证了长、短期记忆的设置和LSTR的设计选择。摘要：In this paper, we present Long Short-term TRansformer (LSTR), a new temporal modeling algorithm for online action detection, by employing a long- and short-term memories mechanism that is able to model prolonged sequence data. It consists of an LSTR encoder that is capable of dynamically exploiting coarse-scale historical information from an extensively long time window (e.g., 2048 long-range frames of up to 8 minutes), together with an LSTR decoder that focuses on a short time window (e.g., 32 short-range frames of 8 seconds) to model the fine-scale characterization of the ongoing event. Compared to prior work, LSTR provides an effective and efficient method to model long videos with less heuristic algorithm design. LSTR achieves significantly improved results on standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS Segment, over the existing state-of-the-art approaches. Extensive empirical analysis validates the setup of the long- and short-term memories and the design choices of LSTR.

【2】 Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World 标题：Trans4Trans：用于透明对象分割的高效转换器，帮助视障人士在现实世界中导航

作者：Jiaming Zhang,Kailun Yang,Angela Constantinescu,Kunyu Peng,Karin Müller,Rainer Stiefelhagen 机构：Karlsruhe Institute of Technology, "Obstacle", "Forward", CNN model, Trans,Trans 备注：8 figures, 6 tables 链接：https://arxiv.org/abs/2107.03172 摘要：普通的全玻璃外墙和透明物体构成了建筑障碍，阻碍了视力低下或失明的人的行动，例如，除非正确感知和反应，否则在玻璃门后面检测到的路径是不可接近的。然而，传统的辅助技术很少涉及这些安全关键对象的分割。为了解决这一问题，我们构建了一个可穿戴系统，该系统采用了一种新型的透明双头Transformer（Trans4Trans）模型，能够分割普通和透明的物体，并进行实时寻路，以帮助人们更安全地独自行走。特别是，两个解码器所创建的Transformer解析模块（TPM）实现了有效的联合学习，从不同的数据集。此外，由基于对称Transformer的编码器和解码器组成的高效Trans4Trans模型，计算量小，易于部署在便携式gpu上。我们的Trans4Trans模型在Stanford2D3D和Trans10K-v2数据集的测试集上优于最新的方法，获得的mIoU分别为45.13%和75.14%。通过各种预测试和在室内外场景中进行的用户研究，我们的辅助系统的可用性和可靠性得到了广泛的验证。摘要：Common fully glazed facades and transparent objects present architectural barriers and impede the mobility of people with low vision or blindness, for instance, a path detected behind a glass door is inaccessible unless it is correctly perceived and reacted. However, segmenting these safety-critical objects is rarely covered by conventional assistive technologies. To tackle this issue, we construct a wearable system with a novel dual-head Transformer for Transparency (Trans4Trans) model, which is capable of segmenting general and transparent objects and performing real-time wayfinding to assist people walking alone more safely. Especially, both decoders created by our proposed Transformer Parsing Module (TPM) enable effective joint learning from different datasets. Besides, the efficient Trans4Trans model composed of symmetric transformer-based encoder and decoder, requires little computational expenses and is readily deployed on portable GPUs. Our Trans4Trans model outperforms state-of-the-art methods on the test sets of Stanford2D3D and Trans10K-v2 datasets and obtains mIoU of 45.13% and 75.14%, respectively. Through various pre-tests and a user study conducted in indoor and outdoor scenarios, the usability and reliability of our assistive system have been extensively verified.

【3】 Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition 标题：基于压缩和激励的学习视觉变换在面部表情识别中的应用

作者：Mouath Aouayeb,Wassim Hamidouche,Catherine Soladie,Kidiyo Kpalma,Renaud Seguier 机构：National Institute for Applied Sciencesof Rennes, CentraleSupélec - Rennes Campus, Rennes, France 链接：https://arxiv.org/abs/2107.03107 摘要：在过去的几十年中，随着各种面部表情数据库的开放，面部表情识别（FER）的研究受到了广泛的关注。现有数据库的多种来源对人脸识别任务提出了一些挑战。这些挑战通常由卷积神经网络（CNN）结构来解决。与CNN模型不同，最近提出了一种基于注意机制的Transformer模型来处理视觉任务。Transformers的一个主要问题是需要大量的训练数据，而与其他vision应用程序相比，大多数FER数据库是有限的。因此，我们建议在本篇论文中学习视觉转换器与挤压与激励（SE）块共同完成FER任务。在CK+、JAFFE、RAF-DB和SFEW等不同的FER数据库上对该方法进行了评价。实验表明，该模型在CK+和SFEW上的性能优于现有的方法，在JAFFE和RAF-DB上也取得了很好的效果。摘要：As various databases of facial expressions have been made accessible over the last few decades, the Facial Expression Recognition (FER) task has gotten a lot of interest. The multiple sources of the available databases raised several challenges for facial recognition task. These challenges are usually addressed by Convolution Neural Network (CNN) architectures. Different from CNN models, a Transformer model based on attention mechanism has been presented recently to address vision tasks. One of the major issue with Transformers is the need of a large data for training, while most FER databases are limited compared to other vision applications. Therefore, we propose in this paper to learn a vision Transformer jointly with a Squeeze and Excitation (SE) block for FER task. The proposed method is evaluated on different publicly available FER databases including CK+, JAFFE,RAF-DB and SFEW. Experiments demonstrate that our model outperforms state-of-the-art methods on CK+ and SFEW and achieves competitive results on JAFFE and RAF-DB.

【4】 SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers 标题：SpectralFormer：用Transformer重新思考高光谱图像分类

作者：Danfeng Hong,Zhu Han,Jing Yao,Lianru Gao,Bing Zhang,Antonio Plaza,Jocelyn Chanussot 机构： Gao are with the Key Laboratory of Digital Earth Science 链接：https://arxiv.org/abs/2107.02988 摘要：高光谱（HS）图像的特点是近似连续的光谱信息，通过捕捉细微的光谱差异，实现对物质的精细识别。卷积神经网络（CNNs）具有良好的局部上下文建模能力，已被证明是HS图像分类中一种强大的特征提取工具。然而，由于CNNs固有的网络主干的限制，CNNs不能很好地挖掘和表示光谱特征的序列属性。为了解决这个问题，我们重新考虑了HS图像分类的顺序角度与Transformer，并提出了一种新的骨干网络称为{SpectralFormer}。除了经典变换中的带式表示外，SpectralFormer还能够从HS图像的相邻带中学习光谱局部序列信息，产生分组谱嵌入。更重要的是，为了减少在逐层传播过程中丢失有价值信息的可能性，我们设计了一种跨层跳转连接，通过自适应学习跨层融合“软”残差，从浅层到深层传递类记忆成分。值得注意的是，所提出的频谱变换器是一个高度灵活的主干网络，它可以同时适用于像素和贴片输入。通过大量的实验，我们评估了所提出的频谱变换器在三个HS数据集上的分类性能，显示了它比经典Transformer的优越性，并且与最先进的骨干网络相比，取得了显著的改进。此工作的代码将在\url上提供{https://sites.google.com/view/danfeng-hong}为了再现性。摘要：Hyperspectral (HS) images are characterized by approximately contiguous spectral information, enabling the fine identification of materials by capturing subtle spectral discrepancies. Owing to their excellent locally contextual modeling ability, convolutional neural networks (CNNs) have been proven to be a powerful feature extractor in HS image classification. However, CNNs fail to mine and represent the sequence attributes of spectral signatures well due to the limitations of their inherent network backbone. To solve this issue, we rethink HS image classification from a sequential perspective with transformers, and propose a novel backbone network called \ul{SpectralFormer}. Beyond band-wise representations in classic transformers, SpectralFormer is capable of learning spectrally local sequence information from neighboring bands of HS images, yielding group-wise spectral embeddings. More significantly, to reduce the possibility of losing valuable information in the layer-wise propagation process, we devise a cross-layer skip connection to convey memory-like components from shallow to deep layers by adaptively learning to fuse "soft" residuals across layers. It is worth noting that the proposed SpectralFormer is a highly flexible backbone network, which can be applicable to both pixel- and patch-wise inputs. We evaluate the classification performance of the proposed SpectralFormer on three HS datasets by conducting extensive experiments, showing the superiority over classic transformers and achieving a significant improvement in comparison with state-of-the-art backbone networks. The codes of this work will be available at \url{https://sites.google.com/view/danfeng-hong} for the sake of reproducibility.

【5】 GLiT: Neural Architecture Search for Global and Local Image Transformer 标题：GLIT：全局和局部图像转换器的神经结构搜索

作者：Boyu Chen,Peixia Li,Chuming Li,Baopu Li,Lei Bai,Chen Lin,Ming Sun,Junjie yan,Wanli Ouyang 机构： The University of Sydney, SenseTime Group Limited, BAIDU USA LLC, University of Oxford 链接：https://arxiv.org/abs/2107.02960 摘要：我们引入第一种神经结构搜索（NAS）方法来寻找一种更好的Transformer结构用于图像识别。最近，没有基于CNN的主干的Transformer在图像识别方面取得了令人印象深刻的性能。然而，Transformer是为NLP任务设计的，因此直接用于图像识别时可能是次优的。为了提高Transformer的视觉表现能力，提出了一种新的搜索空间和搜索算法。具体地说，我们引入了一个局部性模块，该模块以较少的计算代价显式地对图像中的局部相关性进行建模。通过局部性模块，定义了搜索空间，使搜索算法在全局信息和局部信息之间自由权衡，优化了各模块的底层设计选择。针对视觉变换器搜索空间大的问题，提出了一种层次神经结构搜索方法，利用进化算法从两个层次分别搜索最优视觉变换器。在ImageNet数据集上的大量实验表明，与ResNet家族（如ResNet101）和用于图像分类的基线ViT相比，我们的方法可以找到更具鉴别能力和效率的Transformer变体。摘要：We introduce the first Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition. Recently, transformers without CNN-based backbones are found to achieve impressive performance for image recognition. However, the transformer is designed for NLP tasks and thus could be sub-optimal when directly used for image recognition. In order to improve the visual representation ability for transformers, we propose a new search space and searching algorithm. Specifically, we introduce a locality module that models the local correlations in images explicitly with fewer computational cost. With the locality module, our search space is defined to let the search algorithm freely trade off between global and local information as well as optimizing the low-level design choice in each module. To tackle the problem caused by huge search space, a hierarchical neural architecture search method is proposed to search the optimal vision transformer from two levels separately with the evolutionary algorithm. Extensive experiments on the ImageNet dataset demonstrate that our method can find more discriminative and efficient transformer variants than the ResNet family (e.g., ResNet101) and the baseline ViT for image classification.

【6】 Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries 标题：用于冠状动脉CCTA显著狭窄检测的Transformer网络

作者：Xinghua Ma,Gongning Luo,Wei Wang,Kuanquan Wang 机构：Harbin Institute of Technology, Harbin, China 链接：https://arxiv.org/abs/2107.03035 摘要：长期以来，冠状动脉疾病（CAD）一直是世界范围内心血管疾病患者生命的主要威胁。因此，CAD的自动化诊断在临床医学中具有不可或缺的意义。然而，冠状动脉斑块的复杂性使得冠状动脉CT血管造影（CCTA）中冠状动脉狭窄的自动检测成为一项困难的任务。在本文中，我们提出一个Transformer网络（TR-Net）来自动检测严重狭窄（即管腔狭窄>50%），同时实际完成CAD的计算机辅助诊断。提出的TR网络引入了一种新型的Transformer，并将卷积层和Transformer编码器紧密地结合在一起，使得它们的优势可以在任务中得到体现。TR-Net通过对语义信息序列的分析，能够充分理解多平面重组（MPR）图像各位置图像信息之间的关系，并基于局部和全局信息准确地检测出显著狭窄。我们评估我们的TR网络数据集上的76名患者来自不同的病人注释有经验的放射科医生。实验结果表明，我们的TR-Net在ACC（0.92）、Spec（0.96）、PPV（0.84）、F1（0.79）和MCC（0.74）指标上均优于现有的方法。源代码可从以下链接公开获取(https://github.com/XinghuaMa/TR-Net). 摘要：Coronary artery disease (CAD) has posed a leading threat to the lives of cardiovascular disease patients worldwide for a long time. Therefore, automated diagnosis of CAD has indispensable significance in clinical medicine. However, the complexity of coronary artery plaques that cause CAD makes the automatic detection of coronary artery stenosis in Coronary CT angiography (CCTA) a difficult task. In this paper, we propose a Transformer network (TR-Net) for the automatic detection of significant stenosis (i.e. luminal narrowing > 50%) while practically completing the computer-assisted diagnosis of CAD. The proposed TR-Net introduces a novel Transformer, and tightly combines convolutional layers and Transformer encoders, allowing their advantages to be demonstrated in the task. By analyzing semantic information sequences, TR-Net can fully understand the relationship between image information in each position of a multiplanar reformatted (MPR) image, and accurately detect significant stenosis based on both local and global information. We evaluate our TR-Net on a dataset of 76 patients from different patients annotated by experienced radiologists. Experimental results illustrate that our TR-Net has achieved better results in ACC (0.92), Spec (0.96), PPV (0.84), F1 (0.79) and MCC (0.74) indicators compared with the state-of-the-art methods. The source code is publicly available from the link (https://github.com/XinghuaMa/TR-Net).

检测相关(4篇)

【1】 Video-Based Camera Localization Using Anchor View Detection and Recursive 3D Reconstruction 标题：基于锚视检测和递归三维重建的摄像机视频定位

作者：Hajime Taira,Koki Onbe,Naoyuki Miyashita,Masatoshi Okutomi 机构：Tokyo Institute of Technology, Olympus Corporation 备注：This paper have been accepted and will be appeared in the proceedings of 17th International Conference on Machine Vision Applications (MVA2021) 链接：https://arxiv.org/abs/2107.03068 摘要：在本文中，我们介绍了一种新的摄像机定位策略，用于在工业零件检测等具有挑战性的工业环境中捕获的图像序列。为了处理破坏标准三维重建管道的特殊现象，我们通过选择序列中与某个位置大致相连的关键帧（称为锚），利用场景的先验知识。然后，我们的方法按时间顺序寻找每一帧的位置，同时递归地更新一个增强的三维模型，该模型可以提供当前的摄像机位置和周围的三维结构。在一个实际的工业场景中，我们的方法可以定位输入序列中99%以上的帧，而标准的定位方法无法重建完整的摄像机轨迹。摘要：In this paper we introduce a new camera localization strategy designed for image sequences captured in challenging industrial situations such as industrial parts inspection. To deal with peculiar appearances that hurt standard 3D reconstruction pipeline, we exploit pre-knowledge of the scene by selecting key frames in the sequence (called as anchors) which are roughly connected to a certain location. Our method then seek the location of each frame in time-order, while recursively updating an augmented 3D model which can provide current camera location and surrounding 3D structure. In an experiment on a practical industrial situation, our method can localize over 99% frames in the input sequence, whereas standard localization methods fail to reconstruct a complete camera trajectory.

【2】 A convolutional neural network for teeth margin detection on 3-dimensional dental meshes 标题：基于卷积神经网络的三维齿面边缘检测

作者：Hu Chen,Hong Li,Bifu Hu,Kenan Ma,Yuchun Sun 机构：Center of Digital Dentistry, Department of Prosthodontics, National Engineering Laboratory for Digital and material, technology of stomatology, Research Center of Engineering and Technology for Digital Dentistry, Peking University 备注：11 pages, 4 figures 链接：https://arxiv.org/abs/2107.03030 摘要：提出了一种用于三维牙齿网格顶点分类的卷积神经网络，并将其用于牙齿边缘检测。构造扩展层，收集相邻顶点特征的统计值，利用卷积神经网络计算每个顶点的新特征。提出了一种端到端的神经网络，以顶点特征（包括坐标、曲率和距离）为输入，输出每个顶点分类标签。利用1156个齿科网格设计和训练了具有不同扩展层参数的网络结构和无扩展层的基线网络。在145个牙齿网格上验证了该方法的准确性、召回率和准确度，并对最佳网络结构进行了评价，最后在另外144个牙齿网格上进行了测试。所有扩展层的网络性能均优于基线，最佳网络在验证数据集和测试数据集上的精度均达到0.877。摘要：We proposed a convolutional neural network for vertex classification on 3-dimensional dental meshes, and used it to detect teeth margins. An expanding layer was constructed to collect statistic values of neighbor vertex features and compute new features for each vertex with convolutional neural networks. An end-to-end neural network was proposed to take vertex features, including coordinates, curvatures and distance, as input and output each vertex classification label. Several network structures with different parameters of expanding layers and a base line network without expanding layers were designed and trained by 1156 dental meshes. The accuracy, recall and precision were validated on 145 dental meshes to rate the best network structures, which were finally tested on another 144 dental meshes. All networks with our expanding layers performed better than baseline, and the best one achieved an accuracy of 0.877 both on validation dataset and test dataset.

【3】 VIN: Voxel-based Implicit Network for Joint 3D Object Detection and Segmentation for Lidars 标题：VIN：基于体素的激光雷达联合三维目标检测与分割的隐式网络

作者：Yuanxin Zhong,Minghan Zhu,Huei Peng 机构：Mechanical Engineering, University of Michigan, Ann Arbor, USA 链接：https://arxiv.org/abs/2107.02980 摘要：提出了一种用于三维目标检测和点云分割的统一神经网络结构。我们利用检测和分割标签的丰富监督，而不是仅仅使用其中一个标签。此外，基于广泛应用于三维场景和目标理解的隐函数，提出了一种基于单级目标检测器的扩展算法。扩展分支以目标检测模块的最终特征图作为输入，生成隐式函数，为每个点对应的体素中心生成语义分布。我们在大型户外数据集nusceneslidarseg上演示了我们的结构的性能。我们的解决方案在三维目标检测和点云分割方面都取得了与最先进的方法相比较的结果，与目标检测解决方案相比，几乎没有额外的计算量。通过实验验证了该方法对弱监督语义分割的有效性。摘要：A unified neural network structure is presented for joint 3D object detection and point cloud segmentation in this paper. We leverage rich supervision from both detection and segmentation labels rather than using just one of them. In addition, an extension based on single-stage object detectors is proposed based on the implicit function widely used in 3D scene and object understanding. The extension branch takes the final feature map from the object detection module as input, and produces an implicit function that generates semantic distribution for each point for its corresponding voxel center. We demonstrated the performance of our structure on nuScenes-lidarseg, a large-scale outdoor dataset. Our solution achieves competitive results against state-of-the-art methods in both 3D object detection and point cloud segmentation with little additional computation load compared with object detection solutions. The capability of efficient weakly supervision semantic segmentation of the proposed method is also validated by experiments.

【4】 Disentangle Your Dense Object Detector 标题：解开你的密集物体探测器

作者：Zehui Chen,Chenhongyi Yang,Qiaofei Li,Feng Zhao,Zhengjun Zha,Feng Wu 机构： University of Science and Technology of China, Hefei, China, University of Edinburgh, Edinburgh, United Kingdom, SenseTime, Shanghai, China 链接：https://arxiv.org/abs/2107.02963 摘要：基于深度学习的稠密目标检测器在过去的几年中取得了巨大的成功，并被应用于视频理解等多媒体应用中。然而，目前密集探测器的训练管道被破坏了，许多连接可能不成立。在本文中，我们研究了三个这样的重要连接：1）只使用分类头中指定为阳性的样本来训练回归头；2）分类和回归具有相同的输入特征和并行头结构定义的计算域；在计算损失时，对分布在不同特征金字塔层上的样本进行平均处理。我们首先进行了一系列的试点实验，证明解开这些连接可以带来持久的性能改进。然后，基于这些发现，我们提出了解纠缠稠密目标检测器（DDOD），其中设计了简单有效的解纠缠机制，并将其集成到当前最先进的稠密目标检测器中。在MS-COCO基准上的大量实验表明，我们的方法在RetinaNet、FCOS和ATSS基线上可以得到2.0map、2.4map和2.2map的绝对改进，而额外开销可以忽略不计。值得注意的是，我们的最佳模型在COCO test dev set上达到55.0 mAP，在wide FACE的硬子集上达到93.5 AP，在这两个竞争基准上实现了最新的性能。代码位于https://github.com/zehuichen123/DDOD. 摘要：Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding. However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. In this paper, we investigate three such important conjunctions: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. We first carry out a series of pilot experiments to show disentangling such conjunctions can lead to persistent performance improvement. Then, based on these findings, we propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art dense object detectors. Extensive experiments on MS COCO benchmark show that our approach can lead to 2.0 mAP, 2.4 mAP and 2.2 mAP absolute improvements on RetinaNet, FCOS, and ATSS baselines with negligible extra overhead. Notably, our best model reaches 55.0 mAP on the COCO test-dev set and 93.5 AP on the hard subset of WIDER FACE, achieving new state-of-the-art performance on these two competitive benchmarks. Code is available at https://github.com/zehuichen123/DDOD.

分类|识别相关(9篇)

【1】 IntraLoss: Further Margin via Gradient-Enhancing Term for Deep Face Recognition 标题：IntraLoss：深度人脸识别中梯度增强项的进一步边缘

作者：Chengzhi Jiang,Yanzhou Su,Wen Wang,Haiwei Bai,Haijun Liu,Jian Cheng 链接：https://arxiv.org/abs/2107.03352 摘要：现有的基于分类的人脸识别方法已经取得了显著的进展，在超球流形中引入大边缘来学习有区别的人脸表示。但是，特征分布被忽略。差的特征分布将消除由边缘方案带来的性能改进。近年来的研究主要集中在类间分布的不均衡性上，通过惩罚同一性与其最近邻的夹角，形成等分布的特征表示。但问题不止于此，我们还发现了类内分布的各向异性。在本文中，我们提出了“梯度增强项”，集中在类内的分布特征。这种方法称为IntraLoss，它在各向异性区域内显式地进行梯度增强，使得类内分布继续缩小，从而得到各向同性的、更紧凑的类内分布和更进一步的身份边缘。在LFW、YTF和CFP-FP上的实验结果表明，通过梯度增强，我们的方法优于现有的方法，证明了我们方法的优越性。此外，我们的方法具有直观的几何解释，可以很容易地与现有的方法相结合来解决以前忽略的问题。摘要：Existing classification-based face recognition methods have achieved remarkable progress, introducing large margin into hypersphere manifold to learn discriminative facial representations. However, the feature distribution is ignored. Poor feature distribution will wipe out the performance improvement brought about by margin scheme. Recent studies focus on the unbalanced inter-class distribution and form a equidistributed feature representations by penalizing the angle between identity and its nearest neighbor. But the problem is more than that, we also found the anisotropy of intra-class distribution. In this paper, we propose the `gradient-enhancing term' that concentrates on the distribution characteristics within the class. This method, named IntraLoss, explicitly performs gradient enhancement in the anisotropic region so that the intra-class distribution continues to shrink, resulting in isotropic and more compact intra-class distribution and further margin between identities. The experimental results on LFW, YTF and CFP-FP show that our outperforms state-of-the-art methods by gradient enhancement, demonstrating the superiority of our method. In addition, our method has intuitive geometric interpretation and can be easily combined with existing methods to solve the previously ignored problems.

【2】 Categorical Relation-Preserving Contrastive Knowledge Distillation for Medical Image Classification 标题：保持范畴关系的对比知识抽取在医学图像分类中的应用

作者：Xiaohan Xing,Yuenan Hou,Hang Li,Yixuan Yuan,Hongsheng Li,Max Q. -H. Meng 机构：Q.-H. Meng, Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong, China, Department of Information Engineering, The Chinese University of Hong Kong, School of Informatics, Xiamen University, Xiamen, China 链接：https://arxiv.org/abs/2107.03225 摘要：用于训练深度分类模型的医学图像的数量通常非常少，使得这些深度模型容易过度拟合训练数据。研究表明，知识提取（KD），尤其是平均教师框架对扰动更具鲁棒性，有助于缓解过拟合效应。然而，由于医学图像具有较高的类内方差和类间不平衡性，直接将KD从计算机视觉转换到医学图像分类的效果较差。针对这些问题，本文提出了一种以常用的均值-教师模型为指导的保持范畴关系的对比知识提取（CRCKD）算法。具体来说，我们提出了一个新颖的课堂引导对比提取（CCD）模块，在教师和学生模型中拉近同一班级的正图像对，同时分离不同班级的负图像对。通过这种正则化，学生模型的特征分布表现出更高的类内相似度和类间方差。此外，我们还提出了一种类别关系保持（CRP）损失方法，以稳健且均衡的方式提取教师的关系知识。在CCD和CRP的共同作用下，CRCKD算法能够更全面地提取相关知识。在HAM10000和APTOS数据集上的大量实验证明了该方法的优越性。摘要：The amount of medical images for training deep classification models is typically very scarce, making these deep models prone to overfit the training data. Studies showed that knowledge distillation (KD), especially the mean-teacher framework which is more robust to perturbations, can help mitigate the over-fitting effect. However, directly transferring KD from computer vision to medical image classification yields inferior performance as medical images suffer from higher intra-class variance and class imbalance. To address these issues, we propose a novel Categorical Relation-preserving Contrastive Knowledge Distillation (CRCKD) algorithm, which takes the commonly used mean-teacher model as the supervisor. Specifically, we propose a novel Class-guided Contrastive Distillation (CCD) module to pull closer positive image pairs from the same class in the teacher and student models, while pushing apart negative image pairs from different classes. With this regularization, the feature distribution of the student model shows higher intra-class similarity and inter-class variance. Besides, we propose a Categorical Relation Preserving (CRP) loss to distill the teacher's relational knowledge in a robust and class-balanced manner. With the contribution of the CCD and CRP, our CRCKD algorithm can distill the relational knowledge more comprehensively. Extensive experiments on the HAM10000 and APTOS datasets demonstrate the superiority of the proposed CRCKD method.

【3】 Urban Tree Species Classification Using Aerial Imagery 标题：基于航空影像的城市树种分类

作者：Emily Waters,Mahdi Maktabdar Oghaz,Lakshmi Babu Saheer 机构：AngliaRuskinUniversity 备注：International Conference on Machine Learning (ICML 2021), Workshop on Tackling Climate Change with Machine Learning 链接：https://arxiv.org/abs/2107.03182 摘要：城市树木有助于调节温度，减少能源消耗，改善城市空气质量，降低风速，减轻城市热岛效应。城市树木在减缓气候变化和全球变暖方面也发挥着关键作用，它捕获和储存大气中的二氧化碳，而二氧化碳是温室气体的最大贡献者。利用航空图像进行树木自动检测和物种分类是可持续森林和城市树木管理的有力工具。因此，本研究首先提供了一个利用Google-Map航空影像生成城市树木标签数据集的管道，然后研究了VGG和ResNet等先进的深度卷积神经网络模型在不同参数下如何处理城市树木航空影像的分类问题。实验结果表明，我们的最佳模型在6个树种上的平均准确率达到60%。摘要：Urban trees help regulate temperature, reduce energy consumption, improve urban air quality, reduce wind speeds, and mitigating the urban heat island effect. Urban trees also play a key role in climate change mitigation and global warming by capturing and storing atmospheric carbon-dioxide which is the largest contributor to greenhouse gases. Automated tree detection and species classification using aerial imagery can be a powerful tool for sustainable forest and urban tree management. Hence, This study first offers a pipeline for generating labelled dataset of urban trees using Google Map's aerial images and then investigates how state of the art deep Convolutional Neural Network models such as VGG and ResNet handle the classification problem of urban tree aerial images under different parameters. Experimental results show our best model achieves an average accuracy of 60% over 6 tree species.

【4】 Action Units Recognition Using Improved Pairwise Deep Architecture 标题：基于改进的成对深度结构的动作单元识别

作者：Junya Saito,Xiaoyu Mi,Akiyoshi Uchida,Sachihiro Youoku,Takahisa Yamamoto,Kentaro Murase 机构：Advanced Converging Technologies Laboratories, Fujitsu Reseach, Fujitsu Limited, Kanagawa, Japan 链接：https://arxiv.org/abs/2107.03143 摘要：面部动作单位（AU）代表一组面部肌肉活动，AU的各种组合可以代表广泛的情绪。AU识别经常被应用于许多领域，包括市场营销、医疗保健、教育等。尽管许多研究已经开发出各种方法来提高识别精度，但对金的识别仍然是一个重大的挑战。在野外情感行为分析（ABAW）2020比赛中，我们提出了一种新的自动动作单元（AUs）识别方法，该方法采用一种成对的深层结构来推导每个AU的伪强度，然后将其转换为预测强度。今年，我们在去年的框架中引入了一项新技术，以进一步减少由于临时人脸遮挡（如临时人脸遮挡，如人脸隐藏或大人脸方向）而导致的AU识别错误。在今年比赛的验证数据集中，我们得到了0.65分。摘要：Facial Action Units (AUs) represent a set of facial muscular activities and various combinations of AUs can represent a wide range of emotions. AU recognition is often used in many applications, including marketing, healthcare, education, and so forth. Although a lot of studies have developed various methods to improve recognition accuracy, it still remains a major challenge for AU recognition. In the Affective Behavior Analysis in-the-wild (ABAW) 2020 competition, we proposed a new automatic Action Units (AUs) recognition method using a pairwise deep architecture to derive the Pseudo-Intensities of each AU and then convert them into predicted intensities. This year, we introduced a new technique to last year's framework to further reduce AU recognition errors due to temporary face occlusion such as temporary face occlusion such as face hiding or large face orientation. We obtained a score of 0.65 in the validation data set for this year's competition.

【5】 Rotation Transformation Network: Learning View-Invariant Point Cloud for Classification and Segmentation 标题：旋转变换网络：学习视点-用于分类和分割的不变点云

作者：Shuang Deng,Bo Liu,Qiulei Dong,Zhanyi Hu 机构：∗National Laboratory of Pattern Recognition, Institute of Automation, †School of Artificial Intelligence, University of Chinese Academy of Sciences, China, ‡Center for Excellence in Brain Science and Intelligence Technology 备注：None 链接：https://arxiv.org/abs/2107.03105 摘要：最近的许多研究表明，空间操作模块可以提高深度神经网络（DNNs）在三维点云分析中的性能。在本文中，我们的目的是提供一个洞察空间操作模块。首先，我们发现物体的转动自由度（RDF）越小，这些dnn就越容易处理这些物体。然后，我们研究了流行的T-Net模块的效果，发现它不能降低对象的RDF。基于上述两个问题，我们提出了一种旋转变换网络RTN，该网络可以将输入三维物体的RDF降为0，并且可以无缝地嵌入到许多已有的dnn中进行点云分析。在三维点云分类和分割任务上的大量实验结果表明，所提出的RTN方法可以显著地提高现有方法的性能。摘要：Many recent works show that a spatial manipulation module could boost the performances of deep neural networks (DNNs) for 3D point cloud analysis. In this paper, we aim to provide an insight into spatial manipulation modules. Firstly, we find that the smaller the rotational degree of freedom (RDF) of objects is, the more easily these objects are handled by these DNNs. Then, we investigate the effect of the popular T-Net module and find that it could not reduce the RDF of objects. Motivated by the above two issues, we propose a rotation transformation network for point cloud analysis, called RTN, which could reduce the RDF of input 3D objects to 0. The RTN could be seamlessly inserted into many existing DNNs for point cloud analysis. Extensive experimental results on 3D point cloud classification and segmentation tasks demonstrate that the proposed RTN could improve the performances of several state-of-the-art methods significantly.

【6】 Group Sampling for Unsupervised Person Re-identification 标题：无人监督人员再识别的整群抽样方法

作者：Xumeng Han,Xuehui Yu,Nan Jiang,Guorong Li,Jian Zhao,Qixiang Ye,Zhenjun Han 机构：University of Chinese Academy of Sciences, Institute of North Electronic Equipment 链接：https://arxiv.org/abs/2107.03024 摘要：无监督人员再识别（re-ID）仍然是一项具有挑战性的任务，其中分类器和特征表示很容易被噪声伪标签误导而导致过度拟合。在本文中，我们提出了一种简单而有效的方法，称为群抽样，以减轻噪声伪标签在无监督的个人识别模型中的负面影响。分组抽样的思想是在同一个小批量中从同一个类中采集一组样本，这样就可以在减少单个样本影响的同时，对分组归一化样本进行训练。群抽样通过保证样本被更好地划分为正确的类来更新伪标签生成管道。群抽样使分类器训练和表征学习规则化，以渐进的方式实现特征表征的统计稳定性。在Market-1501、DukeMTMC-reID和MSMT17上进行的定性和定量实验表明，分组抽样可使技术水平提高2.2%～6.1%。代码位于https://github.com/wavinflaghxm/GroupSampling. 摘要：Unsupervised person re-identification (re-ID) remains a challenging task, where the classifier and feature representation could be easily misled by the noisy pseudo labels towards deteriorated over-fitting. In this paper, we propose a simple yet effective approach, termed Group Sampling, to alleviate the negative impact of noisy pseudo labels within unsupervised person re-ID models. The idea behind Group Sampling is that it can gather a group of samples from the same class in the same mini-batch, such that the model is trained upon group normalized samples while alleviating the effect of a single sample. Group sampling updates the pipeline of pseudo label generation by guaranteeing the samples to be better divided into the correct classes. Group Sampling regularizes classifier training and representation learning, leading to the statistical stability of feature representation in a progressive fashion. Qualitative and quantitative experiments on Market-1501, DukeMTMC-reID, and MSMT17 show that Grouping Sampling improves the state-of-the-arts by up to 2.2%~6.1%. Code is available at https://github.com/wavinflaghxm/GroupSampling.

【7】 E-PixelHop: An Enhanced PixelHop Method for Object Classification 标题：E-PixelHop：一种改进的PixelHop目标分类方法

作者：Yijing Yang,Vasileios Magoulianitis,C. -C. Jay Kuo 机构：University of Southern California, Los Angeles, California, USA, C.-C. Jay Kuo 备注：12 pages, 7 figures 链接：https://arxiv.org/abs/2107.02966 摘要：本文基于最近发展起来的基于连续子空间学习（SSL）框架的PixelHop和PixelHop++，提出了一种增强的对象分类方法E-PixelHop。E-PixelHop由以下步骤组成。首先，利用主成分分析法对彩色图像的颜色通道进行解耦，将RGB三个颜色通道投影到两个主成分子空间上，分别进行分类处理。第二，为了解决多尺度特征的重要性，我们在每个具有不同感受野的跳上进行像素级分类。第三，为了进一步提高像素级的分类精度，我们提出了一种有监督的标签平滑（SLS）方法来保证预测的一致性。第四，将来自每个跳和每个颜色子空间的像素级决策融合在一起进行图像级决策。第五，为了进一步提高性能，我们将E-PixelHop描述为两级流水线。在第一阶段，对每一类进行多类分类以得到软决策，其中概率最大的前两类称为混淆类。然后，在第二阶段进行二值分类。本文的主要贡献在于第1、3和5步，并以CIFAR-10数据集的分类为例，验证了上述E-PixelHop关键组件的有效性。摘要：Based on PixelHop and PixelHop++, which are recently developed using the successive subspace learning (SSL) framework, we propose an enhanced solution for object classification, called E-PixelHop, in this work. E-PixelHop consists of the following steps. First, to decouple the color channels for a color image, we apply principle component analysis and project RGB three color channels onto two principle subspaces which are processed separately for classification. Second, to address the importance of multi-scale features, we conduct pixel-level classification at each hop with various receptive fields. Third, to further improve pixel-level classification accuracy, we develop a supervised label smoothing (SLS) scheme to ensure prediction consistency. Forth, pixel-level decisions from each hop and from each color subspace are fused together for image-level decision. Fifth, to resolve confusing classes for further performance boosting, we formulate E-PixelHop as a two-stage pipeline. In the first stage, multi-class classification is performed to get a soft decision for each class, where the top 2 classes with the highest probabilities are called confusing classes. Then,we conduct a binary classification in the second stage. The main contributions lie in Steps 1, 3 and 5.We use the classification of the CIFAR-10 dataset as an example to demonstrate the effectiveness of the above-mentioned key components of E-PixelHop.

【8】 Deep Learning based Micro-expression Recognition: A Survey 标题：基于深度学习的微表情识别研究综述

作者：Yante Li,Jinsheng Wei,Seyednavid Mohammadifoumani,Yang Liu,Guoying Zhao 机构： University of Oulu, Wei is with the School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Liu is also with School of Computer Science and Engineering, SouthChina University of Technology 备注：23 pages, 12 figures 链接：https://arxiv.org/abs/2107.02823 摘要：微表情（Micro-expressions，MEs）是一种非自愿的面部动作，在医疗、国家安全、审讯和许多人机交互系统中具有重要的实际意义。早期的MER方法主要基于传统的外观和几何特征。近年来，随着深度学习在各个领域的成功应用，神经网络受到越来越多的关注。与宏观表情不同，微表情是自发的、微妙的、快速的面部动作，导致数据采集困难，因此数据集规模较小。基于DL的MER由于具有以上特性而变得具有挑战性。针对这些数据，人们提出了各种DL方法来解决ME问题，提高MER性能。在这项调查中，我们提供了一个全面的审查深微表情识别（MER），包括数据集，深微表情管道，和基准的最有影响力的方法。这项调查为该领域定义了一个新的分类法，包括基于DL的MER的所有方面。对每一方面的基本方法和最新进展进行了总结和讨论。此外，我们还总结了设计健壮的深部MER系统所面临的挑战和可能的方向。据我们所知，这是对深部MER方法的首次综述，可以作为未来MER研究的参考点。摘要：Micro-expressions (MEs) are involuntary facial movements revealing people's hidden feelings in high-stake situations and have practical importance in medical treatment, national security, interrogations and many human-computer interaction systems. Early methods for MER mainly based on traditional appearance and geometry features. Recently, with the success of deep learning (DL) in various fields, neural networks have received increasing interests in MER. Different from macro-expressions, MEs are spontaneous, subtle, and rapid facial movements, leading to difficult data collection, thus have small-scale datasets. DL based MER becomes challenging due to above ME characters. To data, various DL approaches have been proposed to solve the ME issues and improve MER performance. In this survey, we provide a comprehensive review of deep micro-expression recognition (MER), including datasets, deep MER pipeline, and the bench-marking of most influential methods. This survey defines a new taxonomy for the field, encompassing all aspects of MER based on DL. For each aspect, the basic approaches and advanced developments are summarized and discussed. In addition, we conclude the remaining challenges and and potential directions for the design of robust deep MER systems. To the best of our knowledge, this is the first survey of deep MER methods, and this survey can serve as a reference point for future MER research.

【9】 GAN-based Data Augmentation for Chest X-ray Classification 标题：基于GaN的胸部X线分类数据增强

作者：Shobhita Sundaram,Neha Hulkund 机构：Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 备注：Spotlight Talk at KDD 2021 - Applied Data Science for Healthcare Workshop 链接：https://arxiv.org/abs/2107.02970 摘要：计算机视觉中的一个常见问题——特别是在医学应用中——是缺乏足够多样化的大型训练数据集。这些数据集经常遭受严重的类不平衡。因此，网络往往过拟合，无法推广到新的例子。生成性对抗网络（Generative敌对网络，GANs）提供了一种新的综合数据扩充方法。在这项工作中，我们评估了使用基于GAN的数据增强来人工扩展胸部X光片的CheXpert数据集。我们将性能与传统的增广进行了比较，发现基于GAN的增广对于代表性不足的类具有更高的下游性能。此外，我们看到这个结果在低数据方案中是明显的。这表明，基于GAN的增强是一个很有前途的研究领域，可以在数据收集成本过高的情况下提高网络性能。摘要：A common problem in computer vision -- particularly in medical applications -- is a lack of sufficiently diverse, large sets of training data. These datasets often suffer from severe class imbalance. As a result, networks often overfit and are unable to generalize to novel examples. Generative Adversarial Networks (GANs) offer a novel method of synthetic data augmentation. In this work, we evaluate the use of GAN- based data augmentation to artificially expand the CheXpert dataset of chest radiographs. We compare performance to traditional augmentation and find that GAN-based augmentation leads to higher downstream performance for underrepresented classes. Furthermore, we see that this result is pronounced in low data regimens. This suggests that GAN-based augmentation a promising area of research to improve network performance when data collection is prohibitively expensive.

分割|语义相关(8篇)

【1】 Hierarchical Semantic Segmentation using Psychometric Learning 标题：基于心理测量学习的层次化语义分割

作者：Lu Yin,Vlado Menkovski,Shiwei Liu,Mykola Pechenizkiy 机构：Eindhoven University of Technology, Eindhoven , MB, Netherlands 备注：17 pages, 12 figures 链接：https://arxiv.org/abs/2107.03212 摘要：语义图像分割的目标是为部分图像数据赋予意义。机器学习方法，特别是有监督的学习方法，广泛应用于各种任务的语义分割。有监督学习方法的主要挑战之一是表达和收集专家们对图像数据中存在的意义的丰富知识。为此，通常会指定一组固定的标签，专家的任务是用给定的标签注释图像中的像素、面片或片段。然而，通常情况下，类集合并不能完全捕获图像中丰富的语义信息。例如，在诸如组织学图像之类的医学成像中，可以根据病理学家的专业知识对细胞的不同部分进行分组和再分组。为了实现图像中概念的精确语义表示，我们需要获得注释者的全部知识深度。在这项工作中，我们提出了一种新的方法来收集分割注释的专家基于心理测试。我们的方法由心理测试程序、主动查询选择、查询增强和深度度量学习模型组成，以实现一个补丁级的图像嵌入，从而实现图像的语义分割。通过对合成图像、航空图像和组织学图像的评价，说明了该方法的优点。摘要：Assigning meaning to parts of image data is the goal of semantic image segmentation. Machine learning methods, specifically supervised learning is commonly used in a variety of tasks formulated as semantic segmentation. One of the major challenges in the supervised learning approaches is expressing and collecting the rich knowledge that experts have with respect to the meaning present in the image data. Towards this, typically a fixed set of labels is specified and experts are tasked with annotating the pixels, patches or segments in the images with the given labels. In general, however, the set of classes does not fully capture the rich semantic information present in the images. For example, in medical imaging such as histology images, the different parts of cells could be grouped and sub-grouped based on the expertise of the pathologist. To achieve such a precise semantic representation of the concepts in the image, we need access to the full depth of knowledge of the annotator. In this work, we develop a novel approach to collect segmentation annotations from experts based on psychometric testing. Our method consists of the psychometric testing procedure, active query selection, query enhancement, and a deep metric learning model to achieve a patch-level image embedding that allows for semantic segmentation of images. We show the merits of our method with evaluation on the synthetically generated image, aerial image and histology image.

【2】 HIDA: Towards Holistic Indoor Understanding for the Visually Impaired via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor 标题：HIDA：通过使用可穿戴固态LiDAR传感器进行语义实例分割，为视障人士提供全面的室内理解

作者：Huayao Liu,Ruiping Liu,Kailun Yang,Jiaming Zhang,Kunyu Peng,Rainer Stiefelhagen 机构：Karlsruhe Institute of Technology 备注：10 figures, 5 tables 链接：https://arxiv.org/abs/2107.03180 摘要：对于视障人士来说，独立探索未知空间或在室内环境中寻找物体是一项日常但富有挑战性的任务。然而，常见的二维辅助系统缺乏物体之间的深度关系，难以获得精确的空间布局和物体的相对位置。为了解决这些问题，我们提出了HIDA，一个基于固态LiDAR传感器的三维点云实例分割的轻量级辅助系统，用于室内整体探测和回避。整个系统由三个硬件组成，两个交互功能（避障和目标定位）和一个语音用户界面。基于语音引导，用户通过现场扫描获取室内环境变化的最新状态的点云。此外，我们还设计了一个点云分割模型，该模型采用了双轻量译码器进行语义和偏移量预测，满足了整个系统的效率要求。在三维实例分割之后，我们通过去除孤立点并将所有点投影到顶视图二维地图上的方法对分割后的点云进行后处理。该系统综合了上述信息，通过声反馈与用户进行直观的交互。提出的三维实例分割模型在scannetv2数据集上取得了最先进的性能。用户研究中的各种任务的综合现场测试验证了我们的系统的可用性和有效性，帮助视障人士在室内进行整体理解、避障和物体搜索。摘要：Independently exploring unknown spaces or finding objects in an indoor environment is a daily but challenging task for visually impaired people. However, common 2D assistive systems lack depth relationships between various objects, resulting in difficulty to obtain accurate spatial layout and relative positions of objects. To tackle these issues, we propose HIDA, a lightweight assistive system based on 3D point cloud instance segmentation with a solid-state LiDAR sensor, for holistic indoor detection and avoidance. Our entire system consists of three hardware components, two interactive functions~(obstacle avoidance and object finding) and a voice user interface. Based on voice guidance, the point cloud from the most recent state of the changing indoor environment is captured through an on-site scanning performed by the user. In addition, we design a point cloud segmentation model with dual lightweight decoders for semantic and offset predictions, which satisfies the efficiency of the whole system. After the 3D instance segmentation, we post-process the segmented point cloud by removing outliers and projecting all points onto a top-view 2D map representation. The system integrates the information above and interacts with users intuitively by acoustic feedback. The proposed 3D instance segmentation model has achieved state-of-the-art performance on ScanNet v2 dataset. Comprehensive field tests with various tasks in a user study verify the usability and effectiveness of our system for assisting visually impaired people in holistic indoor understanding, obstacle avoidance and object search.

【3】 GA-NET: Global Attention Network for Point Cloud Semantic Segmentation 标题：GA-Net：面向点云语义分割的全局注意力网络

作者：Shuang Deng,Qiulei Dong 机构： Institute of Automation, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences 备注：None 链接：https://arxiv.org/abs/2107.03101 摘要：如何从三维点云中学习长程相关性是三维点云分析中一个具有挑战性的问题。针对这一问题，本文提出了一种用于点云语义分割的全局注意网络GA-Net，它由一个独立于点的全局注意模块和一个依赖于点的全局注意模块组成，用于获取三维点云的上下文信息。与点无关的全局注意力模块只需为所有3D点共享一个全局注意力地图。在点相关全局注意模块中，针对每个点，利用一个仅使用两个随机抽样子集的随机交叉注意块来学习所有点的上下文信息。此外，我们设计了一个新的点自适应聚合块来代替线性跳跃连接来聚合更多的鉴别特征。在三个三维公共数据集上的大量实验结果表明，我们的方法在大多数情况下都优于现有的方法。摘要：How to learn long-range dependencies from 3D point clouds is a challenging problem in 3D point cloud analysis. Addressing this problem, we propose a global attention network for point cloud semantic segmentation, named as GA-Net, consisting of a point-independent global attention module and a point-dependent global attention module for obtaining contextual information of 3D point clouds in this paper. The point-independent global attention module simply shares a global attention map for all 3D points. In the point-dependent global attention module, for each point, a novel random cross attention block using only two randomly sampled subsets is exploited to learn the contextual information of all the points. Additionally, we design a novel point-adaptive aggregation block to replace linear skip connection for aggregating more discriminate features. Extensive experimental results on three 3D public datasets demonstrate that our method outperforms state-of-the-art methods in most cases.

【4】 WeClick: Weakly-Supervised Video Semantic Segmentation with Click Annotations 标题：WeClick：带点击标注的弱监督视频语义分割

作者：Peidong Liu,Zibin He,Xiyu Yan,Yong Jiang,Shutao Xia,Feng Zheng,Maowei Hu 机构：Tsinghua Shenzhen International Graduate School, Tsinghua University, PCL Research Center of Networks and Communications, Peng Cheng Laboratory, Department of Computer Science and Engineering, Southern University of Science and Technology 备注：Accepted by ACM MM2021 链接：https://arxiv.org/abs/2107.03088 摘要：与单调乏味的每像素遮罩注释相比，通过单击注释数据要容易得多，一张图像只需几秒钟。然而，应用点击学习视频语义分割模型的研究还不多见。在这项工作中，我们提出了一个有效的弱监督视频语义分割管道，称为WeClick，通过只需一次单击就可以分割语义类的一个实例，从而节省了费力的注释工作。由于点击无法捕捉到详细的语义信息，直接使用点击标签进行训练会导致分割预测效果不佳。为了缓解这一问题，我们设计了一种新的记忆流知识提取策略，利用大量未标记视频帧中的时间信息（称为记忆流），通过估计运动将相邻预测提取到目标帧中。此外，本文还采用了香草知识提取的方法进行模型压缩。在这种情况下，WeClick在训练阶段学习具有低成本click注释的紧凑视频语义分割模型，而在推理阶段获得实时、准确的模型。在Cityscapes和Camvid上的实验结果表明，WeClick的性能优于现有的方法，比基线提高了10.24%mIoU，实现了实时执行。摘要：Compared with tedious per-pixel mask annotating, it is much easier to annotate data by clicks, which costs only several seconds for an image. However, applying clicks to learn video semantic segmentation model has not been explored before. In this work, we propose an effective weakly-supervised video semantic segmentation pipeline with click annotations, called WeClick, for saving laborious annotating effort by segmenting an instance of the semantic class with only a single click. Since detailed semantic information is not captured by clicks, directly training with click labels leads to poor segmentation predictions. To mitigate this problem, we design a novel memory flow knowledge distillation strategy to exploit temporal information (named memory flow) in abundant unlabeled video frames, by distilling the neighboring predictions to the target frame via estimated motion. Moreover, we adopt vanilla knowledge distillation for model compression. In this case, WeClick learns compact video semantic segmentation models with the low-cost click annotations during the training phase yet achieves real-time and accurate models during the inference period. Experimental results on Cityscapes and Camvid show that WeClick outperforms the state-of-the-art methods, increases performance by 10.24% mIoU than baseline, and achieves real-time execution.

【5】 Learning Stixel-based Instance Segmentation 标题：基于样本点的学习实例分割

作者：Monty Santarossa,Lukas Schneider,Claudius Zelenka,Lars Schmarje,Reinhard Koch,Uwe Franke 机构：KielUniversity 备注：Accepted for publication in IEEE Intelligent Vehicles Symposium 链接：https://arxiv.org/abs/2107.03070 摘要：在自动驾驶中，像素已经被成功地应用于各种各样的视觉任务，包括实例分割。然而，由于像素在图像中的稀疏性，到目前为止，它们很少作为深度学习算法的输入，限制了它们在这种方法中的应用。在这项工作中，我们提出了一种新的方法，直接在像素上进行快速实例分割StixelPointNet。通过将Stixel表示视为类似于点云的非结构化数据，像PointNet这样的体系结构能够从Stixel中学习特征。我们使用边界盒检测器提出候选实例，从输入图像中提取相关的像素。在这些像素上，点网模型学习二值分割，然后在最后的选择步骤中统一整个图像。StixelPointNet在Stixel水平上达到了最先进的性能，比基于像素的分割方法快得多，并且表明通过我们的方法Stixel域可以被引入到许多新的3D深度学习任务中。摘要：Stixels have been successfully applied to a wide range of vision tasks in autonomous driving, recently including instance segmentation. However, due to their sparse occurrence in the image, until now Stixels seldomly served as input for Deep Learning algorithms, restricting their utility for such approaches. In this work we present StixelPointNet, a novel method to perform fast instance segmentation directly on Stixels. By regarding the Stixel representation as unstructured data similar to point clouds, architectures like PointNet are able to learn features from Stixels. We use a bounding box detector to propose candidate instances, for which the relevant Stixels are extracted from the input image. On these Stixels, a PointNet models learns binary segmentations, which we then unify throughout the whole image in a final selection step. StixelPointNet achieves state-of-the-art performance on Stixel-level, is considerably faster than pixel-based segmentation methods, and shows that with our approach the Stixel domain can be introduced to many new 3D Deep Learning tasks.

【6】 Learn to Learn Metric Space for Few-Shot Segmentation of 3D Shapes 标题：学习度量空间用于三维形状的Few-Shot分割

作者：Xiang Li,Lingjing Wang,Yi Fang 机构：New York University Abu Dhabi 备注：14 pages 链接：https://arxiv.org/abs/2107.02972 摘要：近年来，基于监督学习的三维形状分割方法层出不穷，在各种基准数据集上取得了显著的效果。这些有监督的方法需要大量的带注释的数据来训练深层神经网络，以保证对不可见测试集的泛化能力。本文介绍了一种基于元学习的Few-Shot三维形状分割方法，该方法只为不可见的类提供少量的标记样本。为了达到这个目的，我们将形状分割视为度量空间中的点标记问题。具体来说，我们首先设计一个元度量学习器，将输入形状转换为嵌入空间，然后我们的模型学习基于点嵌入的每个对象类的一个合适的度量空间。然后，针对每个类，我们设计了一个度量学习器，从几个支持形状中提取零件特定的原型表示，我们的模型通过将每个点匹配到所学习的度量空间中最近的原型来对查询形状执行逐点分割。基于度量的损失函数用于动态修改点嵌入之间的距离，从而在最小化零件间相似性的同时最大化零件内相似性。为了充分利用支持信息，采用了一个双分段分支，隐式地鼓励了支持原型和查询原型之间的一致性。我们在ShapeNet部分数据集上证明了我们提出的方法在Few-Shot场景下的优越性能，与建立良好的基线和最先进的半监督方法相比。摘要：Recent research has seen numerous supervised learning-based methods for 3D shape segmentation and remarkable performance has been achieved on various benchmark datasets. These supervised methods require a large amount of annotated data to train deep neural networks to ensure the generalization ability on the unseen test set. In this paper, we introduce a meta-learning-based method for few-shot 3D shape segmentation where only a few labeled samples are provided for the unseen classes. To achieve this, we treat the shape segmentation as a point labeling problem in the metric space. Specifically, we first design a meta-metric learner to transform input shapes into embedding space and our model learns to learn a proper metric space for each object class based on point embeddings. Then, for each class, we design a metric learner to extract part-specific prototype representations from a few support shapes and our model performs per-point segmentation over the query shapes by matching each point to its nearest prototype in the learned metric space. A metric-based loss function is used to dynamically modify distances between point embeddings thus maximizes in-part similarity while minimizing inter-part similarity. A dual segmentation branch is adopted to make full use of the support information and implicitly encourages consistency between the support and query prototypes. We demonstrate the superior performance of our proposed on the ShapeNet part dataset under the few-shot scenario, compared with well-established baseline and state-of-the-art semi-supervised methods.

【7】 AGD-Autoencoder: Attention Gated Deep Convolutional Autoencoder for Brain Tumor Segmentation 标题：AGD自动编码器：用于脑肿瘤分割的注意力门控深卷积自动编码器

作者：Tim Cvetko 备注：8 pages, 2 figures 链接：https://arxiv.org/abs/2107.03323 摘要：脑肿瘤分割是医学图像分析中一个具有挑战性的问题。其目的是在功能磁共振成像筛查中生成能准确识别脑肿瘤区域的显著掩模。本文提出了一种新的注意门（AG）脑肿瘤分割模型，该模型利用边缘检测单元和注意门网络来突出和分割fMRI图像中的显著区域。这一特点使我们能够消除必须明确指向受损区域（外部组织定位）的必要性，并根据经典的计算机视觉技术进行分类（分类）。AGs可以很容易地集成到深卷积神经网络（CNNs）中。最小的计算开销是必需的，而AGs显著提高了敏感性得分。我们发现，边缘检测器和注意门机制提供了一个足够的方法，使大脑分割的IOU达到0.78 摘要：Brain tumor segmentation is a challenging problem in medical image analysis. The endpoint is to generate the salient masks that accurately identify brain tumor regions in an fMRI screening. In this paper, we propose a novel attention gate (AG model) for brain tumor segmentation that utilizes both the edge detecting unit and the attention gated network to highlight and segment the salient regions from fMRI images. This feature enables us to eliminate the necessity of having to explicitly point towards the damaged area(external tissue localization) and classify(classification) as per classical computer vision techniques. AGs can easily be integrated within the deep convolutional neural networks(CNNs). Minimal computional overhead is required while the AGs increase the sensitivity scores significantly. We show that the edge detector along with an attention gated mechanism provide a sufficient enough method for brain segmentation reaching an IOU of 0.78

【8】 Image Complexity Guided Network Compression for Biomedical Image Segmentation 标题：图像复杂度引导的网络压缩在生物医学图像分割中的应用

作者：Suraj Mishra,Danny Z. Chen,X. Sharon Hu 机构： University of Notre Dame, Department ofComputer Science and Engineering 备注：ACM JETC 链接：https://arxiv.org/abs/2107.02927 摘要：压缩是使卷积神经网络（CNNs）符合特定计算资源约束的标准过程。然而，搜索压缩的体系结构通常涉及一系列耗时的训练/验证实验，以确定网络大小和性能精度之间的良好折衷。为了解决这个问题，我们提出了一种图像复杂度引导的生物医学图像分割网络压缩技术。在给定任何资源约束的情况下，我们的框架利用数据复杂性和网络结构来快速估计不需要网络训练的压缩模型。具体地说，我们将数据集的复杂性映射到压缩导致的目标网络精度下降。这种映射使我们能够根据计算出的数据集复杂性，预测不同网络大小的最终精度。因此，可以选择同时满足网络大小和分段精度要求的解决方案。最后，该映射用于确定生成压缩网络的卷积逐层乘法因子。我们使用5个数据集进行实验，采用3种常用的CNN生物医学图像分割结构作为代表网络。我们提出的框架对于生成压缩的分割网络是有效的，保留了高达$\约95\%$的全尺寸网络分割精度，同时利用了$\约32x$的全尺寸网络可训练权重（平均减少）。摘要：Compression is a standard procedure for making convolutional neural networks (CNNs) adhere to some specific computing resource constraints. However, searching for a compressed architecture typically involves a series of time-consuming training/validation experiments to determine a good compromise between network size and performance accuracy. To address this, we propose an image complexity-guided network compression technique for biomedical image segmentation. Given any resource constraints, our framework utilizes data complexity and network architecture to quickly estimate a compressed model which does not require network training. Specifically, we map the dataset complexity to the target network accuracy degradation caused by compression. Such mapping enables us to predict the final accuracy for different network sizes, based on the computed dataset complexity. Thus, one may choose a solution that meets both the network size and segmentation accuracy requirements. Finally, the mapping is used to determine the convolutional layer-wise multiplicative factor for generating a compressed network. We conduct experiments using 5 datasets, employing 3 commonly-used CNN architectures for biomedical image segmentation as representative networks. Our proposed framework is shown to be effective for generating compressed segmentation networks, retaining up to $\approx 95\%$ of the full-sized network segmentation accuracy, and at the same time, utilizing $\approx 32x$ fewer network trainable weights (average reduction) of the full-sized networks.

Zero/Few Shot|迁移|域适配|自适应(4篇)

【1】 Differentiable Architecture Pruning for Transfer Learning 标题：用于迁移学习的可微体系结构剪枝

作者：Nicolo Colombo,Yang Gao 机构：Department of Computer Science, Royal Holloway University of London, Egham Hill, Egham TW,EX, UK 备注：19 pages (main + appendix), 7 figures and 1 table, Workshop @ ICML 2021, 24th July 2021 链接：https://arxiv.org/abs/2107.03375 摘要：我们提出了一种新的基于梯度的方法来从给定的大模型中提取子结构。与现有的剪枝方法无法分离网络结构和相应的权值相反，我们的结构剪枝方案产生了可转移的新结构，可以成功地重新训练以解决不同的任务。我们关注的是一个迁移学习设置，在这个设置中，架构可以在一个大的数据集上进行训练，但是很少有数据点可用于在新任务上对它们进行微调。我们定义了一种新的基于梯度的算法，该算法独立于附加的权值来训练任意低复杂度的体系结构。给定一个由现有大型神经网络模型定义的搜索空间，我们将结构搜索任务转化为一个复杂度惩罚的子集选择问题，并通过一个双温度松弛方案进行求解。我们提供了理论上的收敛性保证，并在实际数据上验证了所提出的迁移学习策略。摘要：We propose a new gradient-based approach for extracting sub-architectures from a given large model. Contrarily to existing pruning methods, which are unable to disentangle the network architecture and the corresponding weights, our architecture-pruning scheme produces transferable new structures that can be successfully retrained to solve different tasks. We focus on a transfer-learning setup where architectures can be trained on a large data set but very few data points are available for fine-tuning them on new tasks. We define a new gradient-based algorithm that trains architectures of arbitrarily low complexity independently from the attached weights. Given a search space defined by an existing large neural model, we reformulate the architecture search task as a complexity-penalized subset-selection problem and solve it through a two-temperature relaxation scheme. We provide theoretical convergence guarantees and validate the proposed transfer-learning strategy on real data.

【2】 Mitigating Generation Shifts for Generalized Zero-Shot Learning 标题：一种用于广义Zero-Shot学习的缓解世代漂移的方法

作者：Zhi Chen,Yadan Luo,Sen Wang,Ruihong Qiu,Jingjing Li,Zi Huang 机构：The University of Queensland,University of Electronic Science and Technology of China 备注：ACM Multimedia 2021 链接：https://arxiv.org/abs/2107.03163 摘要：广义零炮学习（GZSL）是利用语义信息（如属性）来识别可见和不可见样本的任务，其中不可见的类在训练过程中是不可观察的。基于从看不见的样本中学习到的知识，自然可以导出生成模型，并为看不见的类生成幻觉训练样本。然而，这些模型中的大多数都受到“代移”的影响，即合成的样本可能会偏离不可见数据的真实分布。本文对这一问题进行了深入的分析，提出了一种新的基于多个条件仿射耦合层的生成转移缓解流（GSMFlow）框架，用于高效地学习不可见数据合成。特别是，我们发现了三个潜在的问题，触发产生转移，即语义不一致，方差衰减和结构排列，并分别加以解决。首先，为了增强生成的样本和相应属性之间的相关性，我们将语义信息显式地嵌入到每个耦合层的变换中。其次，为了恢复合成的不可见特征的内在方差，我们引入了一种视觉扰动策略来分散生成数据的类内方差，从而帮助调整分类器的决策边界。第三，为了避免语义空间中的结构置换，我们提出了一种相对定位策略来操纵属性嵌入，引导其充分保留类间的几何结构。实验结果表明，GSMFlow在传统的和广义的Zero-Shot设置下都达到了最先进的识别性能。我们的代码位于：https://github.com/uqzhichen/GSMFlow 摘要：Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. It is natural to derive generative models and hallucinate training samples for unseen classes based on the knowledge learned from the seen samples. However, most of these models suffer from the `generation shifts', where the synthesized samples may drift from the real distribution of unseen data. In this paper, we conduct an in-depth analysis on this issue and propose a novel Generation Shifts Mitigating Flow (GSMFlow) framework, which is comprised of multiple conditional affine coupling layers for learning unseen data synthesis efficiently and effectively. In particular, we identify three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance decay, and structural permutation and address them respectively. First, to reinforce the correlations between the generated samples and the respective attributes, we explicitly embed the semantic information into the transformations in each of the coupling layers. Second, to recover the intrinsic variance of the synthesized unseen features, we introduce a visual perturbation strategy to diversify the intra-class variance of generated data and hereby help adjust the decision boundary of the classifier. Third, to avoid structural permutation in the semantic space, we propose a relative positioning strategy to manipulate the attribute embeddings, guiding which to fully preserve the inter-class geometric structure. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings. Our code is available at: https://github.com/uqzhichen/GSMFlow

【3】 Learning Invariant Representation with Consistency and Diversity for Semi-supervised Source Hypothesis Transfer 标题：半监督信源假设转移的具有一致性和多样性的学习不变表示

作者：Xiaodong Wang,Junbao Zhuo,Shuhao Cui,Shuhui Wang 机构：School of Software and Microelectronics, Peking, University, Institute of Computing Technology, Chinese Academy of, Science 备注：10 pages, 4 figures 链接：https://arxiv.org/abs/2107.03008 摘要：半监督域自适应（SSDA）是利用从源域学习到的可转移信息和少量标记的目标数据来解决目标域中的任务。然而，源数据在实际场景中并不总是可访问的，这限制了SSDA在现实环境中的应用。本文提出了一种新的基于信源训练模型的域自适应任务半监督信源假设转移（SSHT），该任务只需少量的监督，就能很好地推广到目标域。在SSHT中，我们面临两个挑战：（1）标记目标数据不足可能导致目标特征接近决策边界，增加误分类的风险(2）数据在源域中通常是不平衡的，因此用这些数据训练的模型是有偏差的。偏态模型容易将少数类别的样本划分为多数类别，导致预测的多样性较低。为了解决上述问题，我们提出了一致性和多样性学习（CDL），这是一个简单而有效的SSHT框架，通过促进两个随机增加的未标记数据之间的预测一致性，并在使模型适应目标域时保持预测的多样性。鼓励一致性正则化给记忆少量标记的目标数据带来了困难，从而提高了学习模型的泛化能力。我们进一步将批核范数最大化方法融入到我们的方法中，以增强可辨别性和多样性。实验结果表明，在DomainNet、Office-Home和Office-31数据集上，该方法优于现有的SSDA方法和无监督模型自适应方法。代码可在https://github.com/Wang-xd1899/SSHT. 摘要：Semi-supervised domain adaptation (SSDA) aims to solve tasks in target domain by utilizing transferable information learned from the available source domain and a few labeled target data. However, source data is not always accessible in practical scenarios, which restricts the application of SSDA in real world circumstances. In this paper, we propose a novel task named Semi-supervised Source Hypothesis Transfer (SSHT), which performs domain adaptation based on source trained model, to generalize well in target domain with a few supervisions. In SSHT, we are facing two challenges: (1) The insufficient labeled target data may result in target features near the decision boundary, with the increased risk of mis-classification; (2) The data are usually imbalanced in source domain, so the model trained with these data is biased. The biased model is prone to categorize samples of minority categories into majority ones, resulting in low prediction diversity. To tackle the above issues, we propose Consistency and Diversity Learning (CDL), a simple but effective framework for SSHT by facilitating prediction consistency between two randomly augmented unlabeled data and maintaining the prediction diversity when adapting model to target domain. Encouraging consistency regularization brings difficulty to memorize the few labeled target data and thus enhances the generalization ability of the learned model. We further integrate Batch Nuclear-norm Maximization into our method to enhance the discriminability and diversity. Experimental results show that our method outperforms existing SSDA methods and unsupervised model adaptation methods on DomainNet, Office-Home and Office-31 datasets. The code is available at https://github.com/Wang-xd1899/SSHT.

【4】 A Deep Residual Star Generative Adversarial Network for multi-domain Image Super-Resolution 标题：一种用于多域图像超分辨率的深度残差星生成对抗网络

作者：Rao Muhammad Umer,Asad Munir,Christian Micheloni 机构：Dept. of Computer Science, University of Udine, Udine, Italy 备注：5 pages, 6th International Conference on Smart and Sustainable Technologies 2021. arXiv admin note: text overlap with arXiv:2009.03693, arXiv:2005.00953 链接：https://arxiv.org/abs/2107.03145 摘要：近年来，利用深度卷积神经网络（DCNNs）实现的最新单图像超分辨率（SISR）方法取得了令人瞩目的性能。现有的SR方法由于固定的退化设置（即通常是低分辨率（LR）图像的双三次缩小）而性能有限。然而，在现实环境中，LR退化过程是未知的，可以是双三次LR、双线性LR、最近邻LR或真实LR。因此，大多数SR方法在处理单个网络中的多个降级设置时是无效和低效的。为了解决多重退化问题，即多域图像的超分辨率问题，我们提出了一种深度超分辨率残差StarGAN（SR2*GAN）算法，该算法只需一个模型就可以对多个LR域的LR图像进行超分辨率处理。该方案是在一个类似StarGAN的网络拓扑结构中训练的，该拓扑结构由一个生成器和一个鉴别器网络组成。与其他先进的方法相比，我们在定量和定性实验中证明了我们提出的方法的有效性。摘要：Recently, most of state-of-the-art single image super-resolution (SISR) methods have attained impressive performance by using deep convolutional neural networks (DCNNs). The existing SR methods have limited performance due to a fixed degradation settings, i.e. usually a bicubic downscaling of low-resolution (LR) image. However, in real-world settings, the LR degradation process is unknown which can be bicubic LR, bilinear LR, nearest-neighbor LR, or real LR. Therefore, most SR methods are ineffective and inefficient in handling more than one degradation settings within a single network. To handle the multiple degradation, i.e. refers to multi-domain image super-resolution, we propose a deep Super-Resolution Residual StarGAN (SR2*GAN), a novel and scalable approach that super-resolves the LR images for the multiple LR domains using only a single model. The proposed scheme is trained in a StarGAN like network topology with a single generator and discriminator networks. We demonstrate the effectiveness of our proposed approach in quantitative and qualitative experiments compared to other state-of-the-art methods.

半弱无监督|主动学习|不确定性(2篇)

【1】 Self-supervised Outdoor Scene Relighting 标题：自主监控的室外场景重照明

作者：Ye Yu,Abhimitra Meka,Mohamed Elgharib,Hans-Peter Seidel,Christian Theobalt,William A. P. Smith 机构： University of York, United Kingdom, Max Planck Institute for Informatics, Saarland Informatics Campus, Germany 备注：Published in ECCV '20, this http URL 链接：https://arxiv.org/abs/2107.03106 摘要：室外场景重新照明是一个具有挑战性的问题，需要很好的理解场景几何，照明和反照率。当前的技术是完全监督的，需要高质量的合成渲染来训练解决方案。这种渲染是使用从有限的数据中获得的先验知识合成的。相比之下，我们提出了一种自我监督的重新照明方法。我们的方法只对从互联网上收集的图片进行训练，没有任何用户监督。这种几乎无穷无尽的训练数据源允许训练一个通用的重新照明解决方案。我们的方法首先将图像分解为反照率、几何结构和光照。然后通过修改照明参数产生一种新的重新照明。我们的解决方案捕捉阴影使用专用的阴影预测地图，不依赖于精确的几何估计。我们主观和客观地评价我们的技术使用一个新的数据集与地面真相重新照明。结果表明，我们的技术能够产生照片真实和物理合理的结果，推广到看不见的场景。摘要：Outdoor scene relighting is a challenging problem that requires good understanding of the scene geometry, illumination and albedo. Current techniques are completely supervised, requiring high quality synthetic renderings to train a solution. Such renderings are synthesized using priors learned from limited data. In contrast, we propose a self-supervised approach for relighting. Our approach is trained only on corpora of images collected from the internet without any user-supervision. This virtually endless source of training data allows training a general relighting solution. Our approach first decomposes an image into its albedo, geometry and illumination. A novel relighting is then produced by modifying the illumination parameters. Our solution capture shadow using a dedicated shadow prediction map, and does not rely on accurate geometry estimation. We evaluate our technique subjectively and objectively using a new dataset with ground-truth relighting. Results show the ability of our technique to produce photo-realistic and physically plausible results, that generalizes to unseen scenes.

【2】 Deep Mesh Prior: Unsupervised Mesh Restoration using Graph Convolutional Networks 标题：深度网格优先：基于图卷积网络的无监督网格恢复

作者：Shota Hattori,Tatsuya Yatagawa,Yutaka Ohtake,Hiromasa Suzuki 机构：School of Engineering, The University of Tokyo 备注：10 pages, 9 figures and 2 tables 链接：https://arxiv.org/abs/2107.02909 摘要：本文通过无监督学习自相似性，解决了网格恢复问题，即去噪和完成问题。为此，本文提出了一种基于网格的图卷积网络学习自相似性的方法，称之为深度网格先验方法。该网络以单个不完全网格作为输入数据，直接输出重构后的网格，无需使用大规模数据集进行训练。我们的方法不使用任何中间表示，例如隐式字段，因为整个过程都在网格上进行。我们证明了我们的无监督方法的性能与使用大规模数据集的最新方法一样好，甚至更好。摘要：This paper addresses mesh restoration problems, i.e., denoising and completion, by learning self-similarity in an unsupervised manner. For this purpose, the proposed method, which we refer to as Deep Mesh Prior, uses a graph convolutional network on meshes to learn the self-similarity. The network takes a single incomplete mesh as input data and directly outputs the reconstructed mesh without being trained using large-scale datasets. Our method does not use any intermediate representations such as an implicit field because the whole process works on a mesh. We demonstrate that our unsupervised method performs equally well or even better than the state-of-the-art methods using large-scale datasets.

时序|行为识别|姿态|视频|运动估计(6篇)

【1】 Is 2D Heatmap Representation Even Necessary for Human Pose Estimation? 标题：人体姿势估计需要2D热图表示吗？

作者：Yanjie Li,Sen Yang,Shoukui Zhang,Zhicheng Wang,Wankou Yang,Shu-Tao Xia,Erjin Zhou 机构：Tsinghua University, MEGVII Technology, Southeast University, PCL Research Center of Networks and Communications, Peng Cheng Laboratory 备注：Code will be made publicly available at this https URL 链接：https://arxiv.org/abs/2107.03332 摘要：二维热图表示由于其高性能，多年来一直主导着人体姿态估计。然而，基于热图的方法有一些缺点：1）在低分辨率图像中，性能急剧下降，这在现实场景中经常遇到。2）为了提高定位精度，可能需要多个上采样层来从低到高恢复特征地图的分辨率，这需要大量的计算。3）为了减小缩小的热图的量化误差，通常需要额外的坐标细化。为了解决这些问题，我们提出了一个\textbf{Sim}简单但很有前途的\textbf{D}混合的\textbf{R}关键点坐标表示（\emph{SimDR}），将人类关键点定位重新定义为分类任务。详细地说，我们建议解开关键点位置的水平和垂直坐标的表示，从而在不需要额外的上采样和细化的情况下得到一个更有效的方案。在COCO数据集上进行的综合实验表明，所提出的无热图方法在所有测试输入分辨率上都优于基于热图的方法，特别是在较低分辨率下有很大的优势。代码将在\url公开{https://github.com/leeyegy/SimDR}. 摘要：The 2D heatmap representation has dominated human pose estimation for years due to its high performance. However, heatmap-based approaches have some drawbacks: 1) The performance drops dramatically in the low-resolution images, which are frequently encountered in real-world scenarios. 2) To improve the localization precision, multiple upsample layers may be needed to recover the feature map resolution from low to high, which are computationally expensive. 3) Extra coordinate refinement is usually necessary to reduce the quantization error of downscaled heatmaps. To address these issues, we propose a \textbf{Sim}ple yet promising \textbf{D}isentangled \textbf{R}epresentation for keypoint coordinate (\emph{SimDR}), reformulating human keypoint localization as a task of classification. In detail, we propose to disentangle the representation of horizontal and vertical coordinates for keypoint location, leading to a more efficient scheme without extra upsampling and refinement. Comprehensive experiments conducted over COCO dataset show that the proposed \emph{heatmap-free} methods outperform \emph{heatmap-based} counterparts in all tested input resolutions, especially in lower resolutions by a large margin. Code will be made publicly available at \url{https://github.com/leeyegy/SimDR}.

【2】 FasterPose: A Faster Simple Baseline for Human Pose Estimation 标题：FasterPose：一种更快、更简单的人体姿态估计基线

作者：Hanbin Dai,Hailin Shi,Wu Liu,Linfang Wang,Yinglu Liu,Tao Mei 备注：14 pages 链接：https://arxiv.org/abs/2107.03215 摘要：人体姿态估计的性能取决于关键点定位的空间精度。现有的方法大多是通过学习输入图像的高分辨率表示来追求空间精度。通过实验分析，我们发现HR表示法与低分辨率（LR）表示法相比，计算量急剧增加，但精度的提高仍然很小。在本文中，我们提出了一个具有LR表示的经济高效的姿态估计网络设计范例FasterPose。虽然LR设计大大降低了模型的复杂度，但如何有效地训练网络的空间精度是一个随之而来的挑战。研究了FasterPose的训练行为，提出了一种新的回归交叉熵（RCE）损失函数，加快了算法的收敛速度，提高了算法的精度。RCE损失将一般的交叉熵损失从二进制监督推广到一个连续的范围，从而使位姿估计网络的训练能够受益于sigmoid函数。这样，在不损失空间精度的情况下，可以从LR特征推断出输出热图，同时大大降低了计算成本和模型尺寸。与以前占主导地位的姿态估计网络相比，该方法减少了58%的跳频次数，同时提高了1.3%的精度。大量的实验表明，FasterPose在COCO和MPII这两个常用基准上取得了令人满意的结果，一致地验证了其在实际应用中的有效性和效率，特别是在非GPU场景中的低延迟和低能耗应用。摘要：The performance of human pose estimation depends on the spatial accuracy of keypoint localization. Most existing methods pursue the spatial accuracy through learning the high-resolution (HR) representation from input images. By the experimental analysis, we find that the HR representation leads to a sharp increase of computational cost, while the accuracy improvement remains marginal compared with the low-resolution (LR) representation. In this paper, we propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose. Whereas the LR design largely shrinks the model complexity, yet how to effectively train the network with respect to the spatial accuracy is a concomitant challenge. We study the training behavior of FasterPose, and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence and promoting the accuracy. The RCE loss generalizes the ordinary cross-entropy loss from the binary supervision to a continuous range, thus the training of pose estimation network is able to benefit from the sigmoid function. By doing so, the output heatmap can be inferred from the LR features without loss of spatial accuracy, while the computational cost and model size has been significantly reduced. Compared with the previously dominant network of pose estimation, our method reduces 58% of the FLOPs and simultaneously gains 1.3% improvement of accuracy. Extensive experiments show that FasterPose yields promising results on the common benchmarks, i.e., COCO and MPII, consistently validating the effectiveness and efficiency for practical utilization, especially the low-latency and low-energy-budget applications in the non-GPU scenarios.

【3】 Cross-View Exocentric to Egocentric Video Synthesis 标题：从偏心到自我的交叉视点视频合成

作者：Gaowen Liu,Hao Tang,Hugo Latapie,Jason Corso,Yan Yan 机构：Cisco Systems, San Jose, CA, USA, DISI, University of Trento, Trento, Italy, Stevens Institute of Technology, Hoboken, NJ, USA, Illinois Institute of Technology, Chicago, IL, USA 备注：ACM MM 2021 链接：https://arxiv.org/abs/2107.03120 摘要：交叉视图视频合成任务试图从另一个截然不同的视图生成一个视图的视频序列。在这篇论文中，我们研究了从第三人称视角到第一人称视角的视频生成任务。这是有挑战性的，因为自我中心的观点有时与外部中心的观点截然不同。因此，在两个不同的视图之间转换外观是一项非常重要的任务。特别地，我们提出了一种新的双向时空注意力融合生成对抗网络（STA-GAN），从外部中心的角度学习时空信息，生成以自我为中心的视频序列。该算法由时间分支、空间分支和注意融合三部分组成。首先，时间和空间分支生成一系列假帧及其相应的特征。在时间和空间分支的下游和上游方向上生成假帧。然后，将生成的四个不同的伪帧及其相应的特征（两个方向上的空间和时间分支）输入到一个新的多代注意融合模块中，生成最终的视频序列。同时，我们还提出了一种新的时间和空间双重鉴别器来提高网络优化的鲁棒性。在Side2Ego和Top2Ego数据集上的大量实验表明，所提出的STA-GAN算法的性能明显优于现有的算法。摘要：Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. In this paper, we investigate the exocentric (third-person) view to egocentric (first-person) view video generation task. This is challenging because egocentric view sometimes is remarkably different from the exocentric view. Thus, transforming the appearances across the two different views is a non-trivial task. Particularly, we propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information to generate egocentric video sequences from the exocentric view. The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion. First, the temporal and spatial branches generate a sequence of fake frames and their corresponding features. The fake frames are generated in both downstream and upstream directions for both temporal and spatial branches. Next, the generated four different fake frames and their corresponding features (spatial and temporal branches in two directions) are fed into a novel multi-generation attention fusion module to produce the final video sequence. Meanwhile, we also propose a novel temporal and spatial dual-discriminator for more robust network optimization. Extensive experiments on the Side2Ego and Top2Ego datasets show that the proposed STA-GAN significantly outperforms the existing methods.

【4】 Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation 标题：用于人体姿态估计的贪婪偏移量引导的关键点分组

作者：Jia Li,Linhua Xiang,Jiwei Chen,Zengfu Wang 机构：† Department of Automation, University of Science and Technology of China, Hefei, China, ‡ Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, China 备注：Accepted by PRCV2021, the official PRCV2021 version is different 链接：https://arxiv.org/abs/2107.03098 摘要：针对多人姿态估计问题，提出了一种简单可靠的自底向上方法，在精度和效率之间进行了很好的折衷。给定一幅图像，我们使用沙漏网络来不分青红皂白地从不同的人身上推断出所有的关键点，以及连接属于同一个人的相邻关键点的引导偏移量。然后，我们贪婪地将候选关键点分组为多个人体姿势（如果有的话），利用预测的引导偏移。我们将此过程称为贪婪偏移引导关键点分组（GOG）。此外，我们重新讨论了多人关键点坐标的编解码方法，并揭示了影响精度的一些重要因素。实验证明，所引入的器件对系统性能有明显的改善。我们的方法可与在公平条件下具有挑战性的COCO数据集的最新技术相比较。源代码和我们预先训练的模型在网上公开。摘要：We propose a simple yet reliable bottom-up approach with a good trade-off between accuracy and efficiency for the problem of multi-person pose estimation. Given an image, we employ an Hourglass Network to infer all the keypoints from different persons indiscriminately as well as the guiding offsets connecting the adjacent keypoints belonging to the same persons. Then, we greedily group the candidate keypoints into multiple human poses (if any), utilizing the predicted guiding offsets. And we refer to this process as greedy offset-guided keypoint grouping (GOG). Moreover, we revisit the encoding-decoding method for the multi-person keypoint coordinates and reveal some important facts affecting accuracy. Experiments have demonstrated the obvious performance improvements brought by the introduced components. Our approach is comparable to the state of the art on the challenging COCO dataset under fair conditions. The source code and our pre-trained model are publicly available online.

【5】 Maintaining a Reliable World Model using Action-aware Perceptual Anchoring 标题：使用动作感知感知锚定维护可靠的世界模型

作者：Ying Siu Liang,Dongkyu Choi,Kenneth Kwok 机构： Institute of High Performance Com-puting 备注：None 链接：https://arxiv.org/abs/2107.03038 摘要：可靠的感知对于与世界互动的机器人至关重要。但是传感器本身往往不足以提供这种能力，而且由于环境中的各种条件，它们很容易出错。此外，机器人需要保持其周围环境的模型，即使当物体消失在视野之外，不再可见。这需要将感知信息锚定到表示环境中对象的符号上。在本文中，我们提出了一个动作感知锚定模型，使机器人能够以持久的方式跟踪对象。基于规则的方法考虑了诱导偏差，对低层目标检测的结果进行高层推理，提高了机器人对复杂任务的感知能力。我们评估了我们的模型与现有的基线模型的对象持久性，并表明它优于这些飞贼定位任务使用1371个视频数据集。我们还将我们的动作感知锚定整合到认知架构的背景中，并在通用机器人上的真实变速箱装配任务中展示其优势。摘要：Reliable perception is essential for robots that interact with the world. But sensors alone are often insufficient to provide this capability, and they are prone to errors due to various conditions in the environment. Furthermore, there is a need for robots to maintain a model of its surroundings even when objects go out of view and are no longer visible. This requires anchoring perceptual information onto symbols that represent the objects in the environment. In this paper, we present a model for action-aware perceptual anchoring that enables robots to track objects in a persistent manner. Our rule-based approach considers inductive biases to perform high-level reasoning over the results from low-level object detection, and it improves the robot's perceptual capability for complex tasks. We evaluate our model against existing baseline models for object permanence and show that it outperforms these on a snitch localisation task using a dataset of 1,371 videos. We also integrate our action-aware perceptual anchoring in the context of a cognitive architecture and demonstrate its benefits in a realistic gearbox assembly task on a Universal Robot.

【6】 PoseRN: A 2D pose refinement network for bias-free multi-view 3D human pose estimation 标题：PoseRN：一种用于无偏多视角三维人体姿态估计的二维姿态精化网络

作者：Akihiko Sayo,Diego Thomas,Hiroshi Kawasaki,Yuta Nakashima,Katsushi Ikeuchi 机构：⋆ Kyushu University, Japan, † Osaka University, Japan, ‡ Microsoft Corp, USA 链接：https://arxiv.org/abs/2107.03000 摘要：我们提出了一个新的二维姿态优化网络，学习预测人类偏差估计的二维姿态。二维姿态估计中存在偏差，这是由于基于注释者感知的二维关节位置注释与运动捕捉（MoCap）系统定义的注释之间的差异造成的。这些偏差是在公开的二维姿态数据集中精心设计的，不能用现有的误差减少方法来消除。我们提出的姿态优化网络可以有效地消除人体在二维姿态估计中的偏差，实现高精度的多视点三维人体姿态估计。摘要：We propose a new 2D pose refinement network that learns to predict the human bias in the estimated 2D pose. There are biases in 2D pose estimations that are due to differences between annotations of 2D joint locations based on annotators' perception and those defined by motion capture (MoCap) systems. These biases are crafted into publicly available 2D pose datasets and cannot be removed with existing error reduction approaches. Our proposed pose refinement network allows us to efficiently remove the human bias in the estimated 2D poses and achieve highly accurate multi-view 3D human pose estimation.

医学相关(3篇)

【1】 MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering 标题：MuVAM：一种基于多视点注意力的医学视觉答疑模型

作者：Haiwei Pan,Shuning He,Kejia Zhang,Bo Qu,Chunling Chen,Kun Shi 链接：https://arxiv.org/abs/2107.03216 摘要：医学视觉问答（VQA）是一项多模式的具有挑战性的任务，被计算机视觉和自然语言处理研究界广泛关注。针对目前大多数医学VQA模型只关注视觉内容，忽略了文本的重要性，提出了一种基于多视点注意的医学视觉问答模型（MuVAM），该模型在文本描述的基础上集成了医学图像的高层语义。首先，针对视觉和文本两种模式，采用不同的方法提取图像特征和问题。其次，本文提出了一种多视角的注意机制，包括图像到问题（I2Q）注意和词到文本（W2T）注意。多视角注意可以将问题与图像和文字联系起来，以便更好地分析问题，得到准确的答案。第三，提出了一种复合损失模型，对多模态特征融合后的答案进行准确预测，提高了视觉和文本跨模态特征的相似度。它包括分类损失和图像问题互补损失。最后，针对VQA-RAD数据集中的数据错误和标签缺失，我们与医学专家合作对该数据集进行修正和完善，然后构建一个增强的数据集VQA-RADPh。在这两个数据集上的实验表明，MuVAM的有效性优于现有的方法。摘要：Medical Visual Question Answering (VQA) is a multi-modal challenging task widely considered by research communities of the computer vision and natural language processing. Since most current medical VQA models focus on visual content, ignoring the importance of text, this paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering which integrates the high-level semantics of medical images on the basis of text description. Firstly, different methods are utilized to extract the features of the image and the question for the two modalities of vision and text. Secondly, this paper proposes a multi-view attention mechanism that include Image-to-Question (I2Q) attention and Word-to-Text (W2T) attention. Multi-view attention can correlate the question with image and word in order to better analyze the question and get an accurate answer. Thirdly, a composite loss is presented to predict the answer accurately after multi-modal feature fusion and improve the similarity between visual and textual cross-modal features. It consists of classification loss and image-question complementary (IQC) loss. Finally, for data errors and missing labels in the VQA-RAD dataset, we collaborate with medical experts to correct and complete this dataset and then construct an enhanced dataset, VQA-RADPh. The experiments on these two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.

【2】 Bone Surface Reconstruction and Clinical Features Estimation from Sparse Landmarks and Statistical Shape Models: A feasibility study on the femur 标题：基于稀疏标志和统计形状模型的骨表面重建及临床特征评估：股骨的可行性研究

作者：Alireza Asvadi,Guillaume Dardenne,Jocelyne Troccaz,Valerie Burdin 机构：University of Western Brittany, UBO, Brest France;, University Hospital of Brest, Brest, France;, Univ. Grenoble Alpes, CNRS, Grenoble INP, TIMC-IMAG, F-, Grenoble, France;, IMT Atlantique, Mines Telecom Institute, Brest, France; 链接：https://arxiv.org/abs/2107.03292 摘要：在这项研究中，我们探讨了一种方法，允许确定股骨表面以及其机械轴从一些容易识别的骨标志。因此，使用统计形状模型（SSM）从这些标志点重建整个股骨。因此，本研究的目的是评估标志物的数量、位置和准确性对股骨重建的影响及其相关的机械轴的确定，这是下肢分析中考虑的重要临床参数。从我们的内部数据集和公开数据集创建了两个股骨统计模型。这两种方法都是根据平均点对点表面距离误差和通过股骨的机械轴进行评估的。此外，还研究了在皮肤上使用标志物替代骨性标志物的临床效果。从骨性标志点预测股骨近端比皮肤上的标志点更准确，但两者的机械轴角度偏差均小于3.5度。关于无创测定机械轴的结果是非常令人鼓舞的，并且可以为骨科或功能康复的下肢分析打开非常有趣的临床前景。摘要：In this study, we investigated a method allowing the determination of the femur bone surface as well as its mechanical axis from some easy-to-identify bony landmarks. The reconstruction of the whole femur is therefore performed from these landmarks using a Statistical Shape Model (SSM). The aim of this research is therefore to assess the impact of the number, the position, and the accuracy of the landmarks for the reconstruction of the femur and the determination of its related mechanical axis, an important clinical parameter to consider for the lower limb analysis. Two statistical femur models were created from our in-house dataset and a publicly available dataset. Both were evaluated in terms of average point-to-point surface distance error and through the mechanical axis of the femur. Furthermore, the clinical impact of using landmarks on the skin in replacement of bony landmarks is investigated. The predicted proximal femurs from bony landmarks were more accurate compared to on-skin landmarks while both had less than 3.5 degrees mechanical axis angle deviation error. The results regarding the non-invasive determination of the mechanical axis are very encouraging and could open very interesting clinical perspectives for the analysis of the lower limb either for orthopedics or functional rehabilitation.

【3】 End-to-End Simultaneous Learning of Single-particle Orientation and 3D Map Reconstruction from Cryo-electron Microscopy Data 标题：从低温电子显微镜数据进行单粒子定位和三维地图重建的端到端同步学习

作者：Youssef S. G. Nashed,Frederic Poitevin,Harshit Gupta,Geoffrey Woollard,Michael Kagan,Chuck Yoon,Daniel Ratner 机构： Department of LCLS Data Analytics, Department of Mathematics, University of British Columbia 备注：13 pages, 4 figures 链接：https://arxiv.org/abs/2107.02958 摘要：低温电子显微镜（cryo-EM）可以从同一生物分子的不同拷贝上获得任意方向的图像。在这里，我们提出了一种端到端的无监督方法，它从低温电磁数据中学习单个粒子的取向，同时从随机初始化开始重建生物分子的平均三维图。该方法依赖于自动编码器架构，其中潜在空间被显式地解释为解码器根据线性投影模型形成图像所使用的方向。我们在模拟数据上对我们的方法进行了评估，结果表明该方法能够从噪声和CTF污染的未知粒子方向的二维投影图像重建三维粒子图。摘要：Cryogenic electron microscopy (cryo-EM) provides images from different copies of the same biomolecule in arbitrary orientations. Here, we present an end-to-end unsupervised approach that learns individual particle orientations from cryo-EM data while reconstructing the average 3D map of the biomolecule, starting from a random initialization. The approach relies on an auto-encoder architecture where the latent space is explicitly interpreted as orientations used by the decoder to form an image according to the linear projection model. We evaluate our method on simulated data and show that it is able to reconstruct 3D particle maps from noisy- and CTF-corrupted 2D projection images of unknown particle orientations.

GAN|对抗|攻击|生成相关(2篇)

【1】 FBC-GAN: Diverse and Flexible Image Synthesis via Foreground-Background Composition 标题：FBC-GAN：基于前景-背景合成的多样化灵活图像合成

作者：Kaiwen Cui,Gongjie Zhang,Fangneng Zhan,Jiaxing Huang,Shijian Lu 机构：Nanyang Technological University, Singapore 链接：https://arxiv.org/abs/2107.03166 摘要：生成性对抗网络已经成为图像合成的事实标准。然而，现有的GANs算法在不考虑前景-背景分解的情况下，往往会捕捉到前景和背景之间过多的内容相关性，从而限制了图像生成的多样性。提出了一种新的前景背景合成GAN（FBC-GAN），它通过同时独立地生成前景对象和背景场景，然后以风格和几何一致性进行合成，从而实现图像的生成。通过这种显式设计，FBC-GAN可以生成具有在内容上相互独立的前景和背景的图像，从而解除不希望学习的内容相关性约束，实现更好的多样性。它还提供了极好的灵活性，允许相同的前景对象具有不同的背景场景，相同的背景场景具有不同的前景对象，或者相同的前景对象和背景场景具有不同的对象位置、大小和姿势。它还可以合成从不同数据集中采集的前景对象和背景场景。在多个数据集上的大量实验表明，与现有的方法相比，FBC-GAN具有很强的视觉真实感和优越的多样性。摘要：Generative Adversarial Networks (GANs) have become the de-facto standard in image synthesis. However, without considering the foreground-background decomposition, existing GANs tend to capture excessive content correlation between foreground and background, thus constraining the diversity in image generation. This paper presents a novel Foreground-Background Composition GAN (FBC-GAN) that performs image generation by generating foreground objects and background scenes concurrently and independently, followed by composing them with style and geometrical consistency. With this explicit design, FBC-GAN can generate images with foregrounds and backgrounds that are mutually independent in contents, thus lifting the undesirably learned content correlation constraint and achieving superior diversity. It also provides excellent flexibility by allowing the same foreground object with different background scenes, the same background scene with varying foreground objects, or the same foreground object and background scene with different object positions, sizes and poses. It can compose foreground objects and background scenes sampled from different datasets as well. Extensive experiments over multiple datasets show that FBC-GAN achieves competitive visual realism and superior diversity as compared with state-of-the-art methods.

【2】 Controlled Caption Generation for Images Through Adversarial Attacks 标题：基于对抗性攻击的受控图像字幕生成

作者：Nayyer Aafaq,Naveed Akhtar,Wei Liu,Mubarak Shah,Ajmal Mian 机构：University of Western Australia, Stirling Highway, WA, University of Central Florida, Central Florida Blvd, Orlando, Florida, USA 链接：https://arxiv.org/abs/2107.03050 摘要：深层次的学习很容易受到敌对例子的影响。然而，它在image caption生成中的对抗敏感性还没有得到充分的研究。我们研究视觉和语言模型的对抗性例子，这些模型通常采用由两个主要组件组成的编解码器框架：用于图像特征提取的卷积神经网络（即CNN）和用于字幕生成的递归神经网络（RNN）。特别地，我们调查了对视觉编码器的隐藏层的攻击，该层被馈送到随后的递归网络。现有的方法要么攻击视觉编码器的分类层，要么从语言模型中反向传播梯度。相比之下，我们提出了一种基于GAN的算法，用于制作神经图像字幕的对抗性示例，该算法模拟CNN的内部表示，从而使输入图像的深度特征能够通过递归网络控制错误字幕的生成。我们的贡献为理解具有语言成分的视觉系统的对抗性攻击提供了新的见解。该方法采用两种策略进行综合评价。第一个检查是否神经图像字幕系统可以误导输出目标图像字幕。第二部分分析了在预测字幕中加入关键词的可能性。实验表明，该算法能够在CNN隐藏层的基础上生成有效的对抗性图像，从而愚弄字幕框架。此外，我们发现所提出的攻击是高度可转移的。我们的工作为神经图像字幕带来了新的鲁棒性含义。摘要：Deep learning is found to be vulnerable to adversarial examples. However, its adversarial susceptibility in image caption generation is under-explored. We study adversarial examples for vision and language models, which typically adopt an encoder-decoder framework consisting of two major components: a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation. In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network. The existing methods either attack the classification layer of the visual encoder or they back-propagate the gradients from the language model. In contrast, we propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN such that the resulting deep features of the input image enable a controlled incorrect caption generation through the recurrent network. Our contribution provides new insights for understanding adversarial attacks on vision systems with language component. The proposed method employs two strategies for a comprehensive evaluation. The first examines if a neural image captioning system can be misled to output targeted image captions. The second analyzes the possibility of keywords into the predicted captions. Experiments show that our algorithm can craft effective adversarial images based on the CNN hidden layers to fool captioning framework. Moreover, we discover the proposed attack to be highly transferable. Our work leads to new robustness implications for neural image captioning.

跟踪(1篇)

【1】 Deep Convolutional Correlation Iterative Particle Filter for Visual Tracking 标题：深卷积相关迭代粒子过滤视觉跟踪算法

作者：Reza Jalil Mozhdehi,Henry Medeiros 机构：Electrical and Computer Engineering Department, Marquette University, WI, USA 备注：28 pages, 9 figures, 1 table 链接：https://arxiv.org/abs/2107.02984 摘要：提出了一种基于迭代粒子滤波、深度卷积神经网络和相关滤波器的视觉跟踪新框架。迭代粒子滤波使粒子能够自我校正并收敛到正确的目标位置。我们采用了一种新的策略，通过K-均值聚类来评估迭代后粒子的可能性。我们的方法确保了对后验分布的一致支持。因此，我们不需要在每个视频帧上执行重采样，提高了先验分布信息的利用率。在两个不同的基准数据集上的实验结果表明，我们的跟踪器相对于现有的方法具有良好的性能。摘要：This work proposes a novel framework for visual tracking based on the integration of an iterative particle filter, a deep convolutional neural network, and a correlation filter. The iterative particle filter enables the particles to correct themselves and converge to the correct target position. We employ a novel strategy to assess the likelihood of the particles after the iterations by applying K-means clustering. Our approach ensures a consistent support for the posterior distribution. Thus, we do not need to perform resampling at every video frame, improving the utilization of prior distribution information. Experimental results on two different benchmark datasets show that our tracker performs favorably against state-of-the-art methods.

图像视频检索|Re-id相关(1篇)

【1】 Partial 3D Object Retrieval using Local Binary QUICCI Descriptors and Dissimilarity Tree Indexing 标题：基于局部二值QUICCI描述子和相异树索引的局部三维对象检索

作者：Bart Iver van Blokland,Theoharis Theoharis 备注：19 pages, 17 figures, to be published in Computers & Graphics 链接：https://arxiv.org/abs/2107.03368 摘要：提出了一种基于快速交集计数变化图像（QUICCI）二值局部描述子和一种新的索引树的完整的局部3D对象检索方法。本文展示了对QUICCI查询描述符的修改如何使其成为部分检索的理想选择。提出了相异树索引结构，大大加快了局部描述子的大空间搜索速度；这适用于QUICCI和其他二进制描述符。索引利用描述符中位的分布来进行有效检索。在SHREC'16数据集的人工部分对检索管道进行了测试，得到了接近理想的检索结果。摘要：A complete pipeline is presented for accurate and efficient partial 3D object retrieval based on Quick Intersection Count Change Image (QUICCI) binary local descriptors and a novel indexing tree. It is shown how a modification to the QUICCI query descriptor makes it ideal for partial retrieval. An indexing structure called Dissimilarity Tree is proposed which can significantly accelerate searching the large space of local descriptors; this is applicable to QUICCI and other binary descriptors. The index exploits the distribution of bits within descriptors for efficient retrieval. The retrieval pipeline is tested on the artificial part of SHREC'16 dataset with near-ideal retrieval results.

蒸馏|知识提取(2篇)

【1】 Novel Visual Category Discovery with Dual Ranking Statistics and Mutual Knowledge Distillation 标题：基于双重排序统计和互知识提取的新型视觉类别发现

作者：Bingchen Zhao,Kai Han 机构：Tongji University, University of Bristol 链接：https://arxiv.org/abs/2107.03358 摘要：在本文中，我们解决了新的视觉类别发现问题，即利用一个包含其他不同但相关类别图像的标记数据集，将新类别中未标记的图像分组到不同的语义分区中。这是一个比传统的半监督学习更现实和更具挑战性的设置。针对这个问题，我们提出了一个两分支的学习框架，一个分支关注局部信息，另一个分支关注整体特征。为了将知识从标记数据转移到未标记数据，我们提出在两个分支上使用双重排序统计量来生成伪标签，以便对未标记数据进行训练。我们进一步引入了一种互知识提取方法，允许信息交换，并鼓励两个分支在发现新类别时达成一致，使我们的模型能够享受全局和局部特征的好处。我们在通用对象分类的公共基准以及更具挑战性的细粒度视觉识别数据集上全面评估了我们的方法，实现了最先进的性能。摘要：In this paper, we tackle the problem of novel visual category discovery, i.e., grouping unlabelled images from new classes into different semantic partitions by leveraging a labelled dataset that contains images from other different but relevant categories. This is a more realistic and challenging setting than conventional semi-supervised learning. We propose a two-branch learning framework for this problem, with one branch focusing on local part-level information and the other branch focusing on overall characteristics. To transfer knowledge from the labelled data to the unlabelled, we propose using dual ranking statistics on both branches to generate pseudo labels for training on the unlabelled data. We further introduce a mutual knowledge distillation method to allow information exchange and encourage agreement between the two branches for discovering new categories, allowing our model to enjoy the benefits of global and local features. We comprehensively evaluate our method on public benchmarks for generic object classification, as well as the more challenging datasets for fine-grained visual recognition, achieving state-of-the-art performance.

【2】 Plot2Spectra: an Automatic Spectra Extraction Tool 标题：Plot2Spectra：一种自动光谱提取工具

作者：Weixin Jiang,Eric Schwenker,Trevor Spreadbury,Kai Li,Maria K. Y. Chan,Oliver Cossairt 机构：Maria K.Y. Chan, Department of Computer Science, Northwestern University, Illinois, USA, Center for Nanoscale Materials, Argonne National Laboratory, Department of Materials Science and, Engineering, Facebook Reality Lab, Facebook Inc., California, USA 链接：https://arxiv.org/abs/2107.02827 摘要：不同类型的光谱，如X射线吸收近边结构（XANES）和拉曼光谱，对分析不同材料的特性起着非常重要的作用。在科学文献中，XANES/Raman数据通常绘制成折线图，当最终用户是人类读者时，这是一种视觉上合适的表示信息的方法。然而，由于缺乏自动化工具，这种图形不利于直接的程序分析。本文研制了一种绘图数字化仪Plot2Spectra，用于从光谱图像中自动提取数据点，为大规模的数据采集和分析提供了可能。具体来说，绘图数字化仪是一个两阶段的框架。在第一个轴对齐阶段，我们采用一个无锚检测器来检测绘图区域，然后用一个基于边缘的约束来细化检测到的边界框来定位两个轴的位置。我们还应用场景文本检测器来提取和解释x轴以下的所有记号信息。在第二个绘图数据提取阶段，我们首先使用语义分割将属于绘图线的像素从背景中分离出来，然后将光流约束合并到绘图线像素中，将它们分配给它们编码的适当的线（数据实例）。通过大量的实验验证了所提出的绘图数字化仪的有效性，表明该工具可以加速材料特性的发现和机器学习。摘要：Different types of spectroscopies, such as X-ray absorption near edge structure (XANES) and Raman spectroscopy, play a very important role in analyzing the characteristics of different materials. In scientific literature, XANES/Raman data are usually plotted in line graphs which is a visually appropriate way to represent the information when the end-user is a human reader. However, such graphs are not conducive to direct programmatic analysis due to the lack of automatic tools. In this paper, we develop a plot digitizer, named Plot2Spectra, to extract data points from spectroscopy graph images in an automatic fashion, which makes it possible for large scale data acquisition and analysis. Specifically, the plot digitizer is a two-stage framework. In the first axis alignment stage, we adopt an anchor-free detector to detect the plot region and then refine the detected bounding boxes with an edge-based constraint to locate the position of two axes. We also apply scene text detector to extract and interpret all tick information below the x-axis. In the second plot data extraction stage, we first employ semantic segmentation to separate pixels belonging to plot lines from the background, and from there, incorporate optical flow constraints to the plot line pixels to assign them to the appropriate line (data instance) they encode. Extensive experiments are conducted to validate the effectiveness of the proposed plot digitizer, which shows that such a tool could help accelerate the discovery and machine learning of materials properties.

超分辨率|去噪|去模糊|去雾(2篇)

【1】 Blind Image Super-Resolution: A Survey and Beyond 标题：盲图像超分辨率：综述与展望

作者：Anran Liu,Yihao Liu,Jinjin Gu,Yu Qiao,Chao Dong 机构： TheUniversity of Sydney 链接：https://arxiv.org/abs/2107.03055 摘要：盲图像超分辨率（SR）技术是一种针对低分辨率图像的超分辨率技术，具有重要的现实意义。近年来，人们提出了许多新颖有效的解决方案，特别是借助于强大的深度学习技术。尽管经过多年的努力，它仍然是一个具有挑战性的研究问题。本文系统地回顾了近年来盲图像SR的研究进展，提出了一种分类方法，根据退化建模的方法和求解SR模型所用的数据，将现有的方法分为三类。这种分类法有助于总结和区分现有的方法。我们希望能对当前的研究状况提供一些见解，同时也能揭示出值得探索的新的研究方向。此外，本文还对常用的数据集和以往与盲图像相关的比赛进行了总结。最后，对各种方法进行了比较，并用合成图像和真实测试图像详细分析了它们的优缺点。摘要：Blind image super-resolution (SR), aiming to super-resolve low-resolution images with unknown degradation, has attracted increasing attention due to its significance in promoting real-world applications. Many novel and effective solutions have been proposed recently, especially with the powerful deep learning techniques. Despite years of efforts, it still remains as a challenging research problem. This paper serves as a systematic review on recent progress in blind image SR, and proposes a taxonomy to categorize existing methods into three different classes according to their ways of degradation modelling and the data used for solving the SR model. This taxonomy helps summarize and distinguish among existing methods. We hope to provide insights into current research states, as well as to reveal novel research directions worth exploring. In addition, we make a summary on commonly used datasets and previous competitions related to blind image SR. Last but not least, a comparison among different methods is provided with detailed analysis on their merits and demerits using both synthetic and real testing images.

【2】 Structured Denoising Diffusion Models in Discrete State-Spaces 标题：离散状态空间中的结构化去噪扩散模型

作者：Jacob Austin,Daniel Johnson,Jonathan Ho,Danny Tarlow,Rianne van den Berg 机构：Google Research, Brain Team 备注：10 pages plus references and appendices. First two authors contributed equally 链接：https://arxiv.org/abs/2107.03006 摘要：去噪扩散概率模型（DDPM）（Ho et al.2020）在连续状态空间中的图像和波形生成方面显示了令人印象深刻的结果。在这里，我们介绍了离散去噪扩散概率模型（D3PM），离散数据的类扩散生成模型，通过超越具有统一转移概率的腐败过程，推广了Hoogeboom等人2021年的多项式扩散模型。这包括连续空间中模仿高斯核的转移矩阵、嵌入空间中基于最近邻的矩阵以及引入吸收态的矩阵的损坏。第三种方法使我们能够在扩散模型和基于自回归和掩模的生成模型之间建立联系。我们证明了转换矩阵的选择是一个重要的设计决策，它可以改善图像和文本领域的结果。我们还引入了一个新的损失函数，它结合了变分下界和辅助交叉熵损失。对于文本，该模型类在字符级文本生成方面取得了很好的效果，同时可以扩展到LM1B上的大型词汇表。在CIFAR-10图像数据集上，我们的模型接近样本质量，超过了连续空间DDPM模型的对数似然。摘要：Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al. 2021, by going beyond corruption processes with uniform transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states. The third allows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Edge-aware Bidirectional Diffusion for Dense Depth Estimation from Light Fields 标题：边缘感知双向扩散在光场密集深度估计中的应用

作者：Numair Khan,Min H. Kim,James Tompkin 机构： Brown University, KAIST 备注：Project webpage: this http URL 链接：https://arxiv.org/abs/2107.02967 摘要：提出了一种基于稀疏深度边缘和梯度的光场深度图快速精确估计算法。我们提出的方法基于真实深度边缘比纹理边缘对局部约束更敏感的思想，因此可以通过双向扩散过程可靠地消除歧义。首先，我们使用外极平面图像来估计稀疏像素集上的亚像素视差。为了有效地寻找稀疏点，我们提出了一种基于熵的细化方法，从有限的一组定向滤波器组中提取线估计。接下来，为了估计远离稀疏点的扩散方向，我们通过我们的双向扩散方法来优化这些点上的约束。这解决了边缘所属曲面的模糊性，并可靠地将深度与纹理边缘分离，使我们能够以深度边缘和遮挡感知的方式扩散稀疏集，以获得精确的密集深度贴图。摘要：We present an algorithm to estimate fast and accurate depth maps from light fields via a sparse set of depth edges and gradients. Our proposed approach is based around the idea that true depth edges are more sensitive than texture edges to local constraints, and so they can be reliably disambiguated through a bidirectional diffusion process. First, we use epipolar-plane images to estimate sub-pixel disparity at a sparse set of pixels. To find sparse points efficiently, we propose an entropy-based refinement approach to a line estimate from a limited set of oriented filter banks. Next, to estimate the diffusion direction away from sparse points, we optimize constraints at these points via our bidirectional diffusion method. This resolves the ambiguity of which surface the edge belongs to and reliably separates depth from texture edges, allowing us to diffuse the sparse set in a depth-edge and occlusion-aware manner to obtain accurate dense depth maps.

多模态(1篇)

【1】 Multi-modal Affect Analysis using standardized data within subjects in the Wild 标题：利用野外受试者标准化数据进行多模态情感分析

作者：Sachihiro Youoku,Takahisa Yamamoto,Junya Saito,Akiyoshi Uchida,Xiaoyu Mi,Ziqiang Shi,Liu Liu,Zhongling Liu 机构：Advanced Converging Technologies Laboratories, Fujitsu Ltd., Fujitsu R&D Center Co.,LTD 备注：6 pages, 5 figures 链接：https://arxiv.org/abs/2107.03009 摘要：情感识别是人机交互中的一个重要因素。然而，利用野外数据开发的方法在实际应用中还不够精确。本文介绍了将基于表情（EXP）和价唤醒计算的情感识别方法应用于2021年ABAW比赛情感行为分析。当从视频中注释面部表情时，我们认为不仅要从所有人的共同特征来判断，还要从个体时间序列的相对变化来判断。因此，在学习了每一帧的共同特征后，结合每一帧视频的共同特征和标准化特征，利用时间序列数据构建了人脸表情估计模型和情感唤醒模型。此外，利用图像特征、AU、头部姿势和注视等多模态数据学习上述特征。在验证集中，我们的模型的面部表情得分为0.546。验证结果表明，该框架能有效地提高估计精度和鲁棒性。摘要：Human affective recognition is an important factor in human-computer interaction. However, the method development with in-the-wild data is not yet accurate enough for practical usage. In this paper, we introduce the affective recognition method focusing on facial expression (EXP) and valence-arousal calculation that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2021 Contest. When annotating facial expressions from a video, we thought that it would be judged not only from the features common to all people, but also from the relative changes in the time series of individuals. Therefore, after learning the common features for each frame, we constructed a facial expression estimation model and valence-arousal model using time-series data after combining the common features and the standardized features for each video. Furthermore, the above features were learned using multi-modal data such as image features, AU, Head pose, and Gaze. In the validation set, our model achieved a facial expression score of 0.546. These verification results reveal that our proposed framework can improve estimation accuracy and robustness effectively.

其他神经网络|深度学习|模型|建模(1篇)

【1】 Introducing the structural bases of typicality effects in deep learning 标题：浅谈深度学习中典型性效应的结构基础

作者：Omar Vidal Pino,Erickson Rangel Nascimento,Mario Fernando Montenegro Campos 机构：Computer Vision and Robotics Laboratory, Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte,-, Brazil 备注：14 pages (12 + 2 reference); 13 Figures and 2 Tables. arXiv admin note: text overlap with arXiv:1906.03365 链接：https://arxiv.org/abs/2107.03279 摘要：在本文中，我们假设自然语义范畴的典型程度的影响可以通过深度学习模型学习的人工范畴的结构来产生。基于人类表示自然语义范畴的方法，在原型理论的基础上，提出了一种新的计算原型模型（CPM）来表示语义范畴的内部结构。与其他原型学习方法不同的是，我们的数学框架提出了第一种方法来提供深层神经网络对抽象语义概念的建模能力，例如类别中心语义、对象图像的典型程度和家族相似关系。在图像分类、全局语义描述和迁移学习等图像语义处理任务中，我们提出了几种基于典型性概念的方法来评价我们的CPM模型。我们在不同图像数据集（如ImageNet和Coco）上的实验表明，我们的方法可能是一个可接受的命题，目的是赋予机器更大的抽象能力来表示对象类别的语义。摘要：In this paper, we hypothesize that the effects of the degree of typicality in natural semantic categories can be generated based on the structure of artificial categories learned with deep learning models. Motivated by the human approach to representing natural semantic categories and based on the Prototype Theory foundations, we propose a novel Computational Prototype Model (CPM) to represent the internal structure of semantic categories. Unlike other prototype learning approaches, our mathematical framework proposes a first approach to provide deep neural networks with the ability to model abstract semantic concepts such as category central semantic meaning, typicality degree of an object's image, and family resemblance relationship. We proposed several methodologies based on the typicality's concept to evaluate our CPM-model in image semantic processing tasks such as image classification, a global semantic description, and transfer learning. Our experiments on different image datasets, such as ImageNet and Coco, showed that our approach might be an admissible proposition in the effort to endow machines with greater power of abstraction for the semantic representation of objects' categories.

其他(9篇)

【1】 Samplets: A new paradigm for data compression 标题：样本集：一种新的数据压缩范式

作者：Helmut Harbrecht,Michael Multerer 链接：https://arxiv.org/abs/2107.03337 摘要：在这篇文章中，我们通过将Tausch-White小波的构造转移到数据领域来引入样本的新概念。通过这种方法，我们得到了离散数据的多级表示，直接实现了数据压缩、奇异点检测和自适应。应用样本来表示核矩阵，当它们出现在基于核的学习或高斯过程回归中时，我们最终得到准稀疏矩阵。通过对小条目设置阈值，这些矩阵可压缩为O（N logn）相关条目，其中N是数据点的数目。此功能允许使用填充来减少重排序，以获得压缩矩阵的稀疏分解。除了全面介绍样本及其性质外，我们还进行了大量的数值研究来验证这种方法。我们的结果表明，样本标志着一个相当大的步骤的方向，使大数据集可访问的分析。摘要：In this article, we introduce the novel concept of samplets by transferring the construction of Tausch-White wavelets to the realm of data. This way we obtain a multilevel representation of discrete data which directly enables data compression, detection of singularities and adaptivity. Applying samplets to represent kernel matrices, as they arise in kernel based learning or Gaussian process regression, we end up with quasi-sparse matrices. By thresholding small entries, these matrices are compressible to O(N log N) relevant entries, where N is the number of data points. This feature allows for the use of fill-in reducing reorderings to obtain a sparse factorization of the compressed matrices. Besides the comprehensive introduction to samplets and their properties, we present extensive numerical studies to benchmark the approach. Our results demonstrate that samplets mark a considerable step in the direction of making large data sets accessible for analysis.

【2】 KaFiStO: A Kalman Filtering Framework for Stochastic Optimization 标题：KaFiStO：一种随机优化的卡尔曼滤波框架

作者：Aram Davtyan,Sepehr Sameni,Llukman Cerkezi,Givi Meishvilli,Adam Bielski,Paolo Favaro 机构：Computer Vision Group, University of Bern 链接：https://arxiv.org/abs/2107.03331 摘要：优化问题通常是一个确定性问题，通过梯度下降等迭代过程求解。然而，当训练神经网络时，由于样本子集的随机选择，损失函数随（迭代）时间而变化。这种随机化将优化问题转化为随机问题。我们建议考虑一些参考优化的损失作为嘈杂的观察。这种对损失的解释使我们可以采用Kalman滤波作为优化器，因为它的递推公式是用来从噪声测量中估计未知参数的。此外，我们还证明了未知参数演化的Kalman滤波动力学模型可以用来捕捉动量和Adam等先进方法的梯度动力学。我们称这种随机优化方法为KaFiStO。KaFiStO是一种易于实现、可扩展、高效的神经网络训练方法。我们表明，它也产生参数估计，与现有的优化算法相比，在多个神经网络架构和机器学习任务，如计算机视觉和语言建模。摘要：Optimization is often cast as a deterministic problem, where the solution is found through some iterative procedure such as gradient descent. However, when training neural networks the loss function changes over (iteration) time due to the randomized selection of a subset of the samples. This randomization turns the optimization problem into a stochastic one. We propose to consider the loss as a noisy observation with respect to some reference optimum. This interpretation of the loss allows us to adopt Kalman filtering as an optimizer, as its recursive formulation is designed to estimate unknown parameters from noisy measurements. Moreover, we show that the Kalman Filter dynamical model for the evolution of the unknown parameters can be used to capture the gradient dynamics of advanced methods such as Momentum and Adam. We call this stochastic optimization method KaFiStO. KaFiStO is an easy to implement, scalable, and efficient method to train neural networks. We show that it also yields parameter estimates that are on par with or better than existing optimization algorithms across several neural network architectures and machine learning tasks, such as computer vision and language modeling.

【3】 Predicting with Confidence on Unseen Distributions 标题：不可见分布的置信度预测

作者：Devin Guillory,Vaishaal Shankar,Sayna Ebrahimi,Trevor Darrell,Ludwig Schmidt 机构：UC Berkeley, Amazon, Toyota Research Institute 链接：https://arxiv.org/abs/2107.03315 摘要：最近的研究表明，当对来自于接近但不同于训练分布的分布的数据进行评估时，机器学习模型的性能会有很大的不同。因此，预测模型在未知分布上的性能是一个重要的挑战。我们的工作结合了领域适应和预测不确定性文献中的技术，并允许我们在不访问标记数据的情况下预测具有挑战性的未知分布的模型精度。在分布转移的背景下，分布距离常常被用来调整模型并改善其在新领域的性能，然而在这些研究中，精度估计或其他形式的预测不确定性常常被忽略。通过调查广泛的已建立的分布距离，如Frechet距离或最大平均差异，我们确定，他们无法诱导可靠的估计性能下的分布转移。另一方面，我们发现分类器预测的置信度差异（DoC）成功地估计了分类器在各种变化下的性能变化。我们特别研究了综合分布和自然分布变化之间的区别，并观察到尽管DoC简单，但它始终优于其他分布差异的量化方法$DoC$可将几个现实且具有挑战性的分布变化的预测误差减少近一半（$46\%$），例如，在ImageNet Vid Robust和ImageNet格式副本数据集上。摘要：Recent work has shown that the performance of machine learning models can vary substantially when models are evaluated on data drawn from a distribution that is close to but different from the training distribution. As a result, predicting model performance on unseen distributions is an important challenge. Our work connects techniques from domain adaptation and predictive uncertainty literature, and allows us to predict model accuracy on challenging unseen distributions without access to labeled data. In the context of distribution shift, distributional distances are often used to adapt models and improve their performance on new domains, however accuracy estimation, or other forms of predictive uncertainty, are often neglected in these investigations. Through investigating a wide range of established distributional distances, such as Frechet distance or Maximum Mean Discrepancy, we determine that they fail to induce reliable estimates of performance under distribution shift. On the other hand, we find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts. We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference. $DoC$ reduces predictive error by almost half ($46\%$) on several realistic and challenging distribution shifts, e.g., on the ImageNet-Vid-Robust and ImageNet-Rendition datasets.

【4】 Scalable Data Balancing for Unlabeled Satellite Imagery 标题：无标签卫星影像的可伸缩数据平衡

作者：Deep Patel,Erin Gao,Anirudh Koul,Siddha Ganju,Meher Anand Kasam 备注：Accepted to COSPAR 2021 Workshop on Machine Learning for Space Sciences. 5 pages, 9 figures 链接：https://arxiv.org/abs/2107.03227 摘要：数据不平衡是机器学习中普遍存在的问题。在大规模收集和注释的数据集中，数据不平衡可以通过对频繁类的欠采样和稀有类的过采样来手动缓解，也可以通过插补和扩充技术来计划。在这两种情况下，平衡数据都需要标签。换句话说，只有带注释的数据才能平衡。收集完全注释的数据集是一项挑战，特别是对于大型卫星系统，如未标记的NASA的35pb地球图像数据集。尽管美国宇航局的地球图像数据集是未标记的，但我们可以依赖数据源的隐含属性来假设其不平衡性，例如地球图像中的土地和水的分布。我们提出了一种新的迭代方法来平衡未标记的数据。我们的方法利用图像嵌入作为图像标签的代理，可以用来平衡数据，并最终在训练时提高整体精度。摘要：Data imbalance is a ubiquitous problem in machine learning. In large scale collected and annotated datasets, data imbalance is either mitigated manually by undersampling frequent classes and oversampling rare classes, or planned for with imputation and augmentation techniques. In both cases balancing data requires labels. In other words, only annotated data can be balanced. Collecting fully annotated datasets is challenging, especially for large scale satellite systems such as the unlabeled NASA's 35 PB Earth Imagery dataset. Although the NASA Earth Imagery dataset is unlabeled, there are implicit properties of the data source that we can rely on to hypothesize about its imbalance, such as distribution of land and water in the case of the Earth's imagery. We present a new iterative method to balance unlabeled data. Our method utilizes image embeddings as a proxy for image labels that can be used to balance data, and ultimately when trained increases overall accuracy.

【5】 Egocentric Videoconferencing 标题：以自我为中心的视频会议

作者：Mohamed Elgharib,Mohit Mendiratta,Justus Thies,Matthias Nießner,Hans-Peter Seidel,Ayush Tewari,Vladislav Golyanik,Christian Theobalt 机构： Max Planck Institute for Informatics, Technical University of MunichHANS-PETER SEIDEL and AYUSH TEWARI, Max Planck Institute for In-formatics 备注：None 链接：https://arxiv.org/abs/2107.03109 摘要：我们介绍了一种以自我为中心的视频会议方法，它可以实现免提视频通话，例如戴智能眼镜或其他混合现实设备的人。视频会议描绘了有价值的非语言交流和面部表情提示，但通常需要一个前置摄像头。当一个人在移动时，在免提环境中使用正面摄像头是不切实际的。即使长时间坐着的时候把手机摄像头放在脸前也不方便。为了克服这些问题，我们提出了一种低成本的可穿戴的以自我为中心的相机设置，可以集成到智能眼镜中。我们的目标是模仿一个经典的视频通话，因此，我们将这个摄像头的自我中心视角转换为正面视频。为此，我们采用了一种条件生成对抗性神经网络，学习从高度扭曲的自我中心视图到视频会议中常见的正面视图的转换。我们的方法学习直接从以自我为中心的视图传输表情细节，而不需要使用复杂的中间参数表情模型，因为它被相关的人脸重影方法所使用。我们成功地处理了基于参数化混合形状的解决方案难以捕捉的微妙表情，例如舌头运动、眼球运动、眨眼、强烈表情和深度变化运动。为了控制目标视图中刚性头部的运动，我们将生成器设置为一个运动中性面的合成渲染。这使我们能够在不同的头部姿势合成结果。我们的技术使用视频到视频转换网络和时间鉴别器实时生成时间平滑的视频真实感渲染。通过与相关的最新技术方法的比较，我们证明了我们的技术的改进能力。摘要：We introduce a method for egocentric videoconferencing that enables hands-free video calls, for instance by people wearing smart glasses or other mixed-reality devices. Videoconferencing portrays valuable non-verbal communication and face expression cues, but usually requires a front-facing camera. Using a frontal camera in a hands-free setting when a person is on the move is impractical. Even holding a mobile phone camera in the front of the face while sitting for a long duration is not convenient. To overcome these issues, we propose a low-cost wearable egocentric camera setup that can be integrated into smart glasses. Our goal is to mimic a classical video call, and therefore, we transform the egocentric perspective of this camera into a front facing video. To this end, we employ a conditional generative adversarial neural network that learns a transition from the highly distorted egocentric views to frontal views common in videoconferencing. Our approach learns to transfer expression details directly from the egocentric view without using a complex intermediate parametric expressions model, as it is used by related face reenactment methods. We successfully handle subtle expressions, not easily captured by parametric blendshape-based solutions, e.g., tongue movement, eye movements, eye blinking, strong expressions and depth varying movements. To get control over the rigid head movements in the target view, we condition the generator on synthetic renderings of a moving neutral face. This allows us to synthesis results at different head poses. Our technique produces temporally smooth video-realistic renderings in real-time using a video-to-video translation network in conjunction with a temporal discriminator. We demonstrate the improved capabilities of our technique by comparing against related state-of-the art approaches.

【6】 Bi-level Feature Alignment for Versatile Image Translation and Manipulation 标题：用于通用图像平移和操作的两级特征对齐

作者：Fangneng Zhan,Yingchen Yu,Rongliang Wu,Kaiwen Cui,Aoran Xiao,Shijian Lu,Ling Shao 机构： Lu are with the Schoolof Computer Science and Engineering, Nanyang Technological University, Shao is with the Inception Institute of ArtificialIntelligence 备注：Submitted to TPAMI 链接：https://arxiv.org/abs/2107.03021 摘要：生成性对抗网络在图像翻译和操纵方面取得了巨大的成功。然而，在计算机视觉中，具有忠实风格控制的高保真图像生成仍然是一个巨大的挑战。本文提出了一个通用的图像翻译和处理框架，通过显式地建立对应关系，实现了图像生成中准确的语义和风格引导。为了处理建立密集对应所产生的二次复杂度，我们引入了一种双层特征对齐策略，该策略采用top-$k$操作对块特征进行排序，然后在块特征之间进行密集关注，从而大大降低了内存开销。由于top-$k$操作涉及到索引交换，这排除了梯度传播，我们建议用正则化的土方问题来近似不可微的top-$k$操作，以便其梯度可以有效地反向传播。此外，我们还设计了一种新的语义位置编码机制，为每个语义区域建立坐标，在建立对应关系的同时保留纹理结构。在此基础上，我们设计了一个新的置信度特征注入模块，该模块根据建立的对应关系的可靠性自适应地融合特征，以缓解不匹配问题。大量实验表明，该方法在定性和定量上均优于现有方法。代码可在\href获得{https://github.com/fnzhan/RABIT}{https://github.com/fnzhan/RABIT}. 摘要：Generative adversarial networks (GANs) have achieved great success in image translation and manipulation. However, high-fidelity image generation with faithful style control remains a grand challenge in computer vision. This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance in image generation by explicitly building a correspondence. To handle the quadratic complexity incurred by building the dense correspondences, we introduce a bi-level feature alignment strategy that adopts a top-$k$ operation to rank block-wise features followed by dense attention between block features which reduces memory cost substantially. As the top-$k$ operation involves index swapping which precludes the gradient propagation, we propose to approximate the non-differentiable top-$k$ operation with a regularized earth mover's problem so that its gradient can be effectively back-propagated. In addition, we design a novel semantic position encoding mechanism that builds up coordinate for each individual semantic region to preserve texture structures while building correspondences. Further, we design a novel confidence feature injection module which mitigates mismatch problem by fusing features adaptively according to the reliability of built correspondences. Extensive experiments show that our method achieves superior performance qualitatively and quantitatively as compared with the state-of-the-art. The code is available at \href{https://github.com/fnzhan/RABIT}{https://github.com/fnzhan/RABIT}.

【7】 Visual Odometry with an Event Camera Using Continuous Ray Warping and Volumetric Contrast Maximization 标题：使用连续光线翘曲和体积对比度最大化的事件相机进行视觉里程数测量

作者：Yifu Wang,Jiaqi Yang,Xin Peng,Peng Wu,Ling Gao,Kun Huang,Jiaben Chen,Laurent Kneip 机构：Kneip are with the Mobile Perception Lab of School of Information Scienceand Technology at the ShanghaiTech University 链接：https://arxiv.org/abs/2107.03011 摘要：我们提出了一个新的解决方案，跟踪和映射与事件相机。摄像机的运动包含旋转和平移，位移发生在任意结构的环境中。结果，图像匹配可能不再由低维同形扭曲表示，从而使常用扭曲事件图像（IWE）的应用复杂化。我们介绍了一种新的解决方案，通过执行三维对比度最大化。每个事件的光线投射的三维位置随连续时间运动参数化而平滑变化，并通过在体积光线密度场中最大化对比度来找到最佳参数。因此，我们的方法对运动和结构进行联合优化。该方法在AGV运动估计和车载事件摄像机三维重建中的应用证明了该方法的有效性。该方法接近常规摄像机的性能，并最终在具有挑战性的视觉条件下优于传统方法。摘要：We present a new solution to tracking and mapping with an event camera. The motion of the camera contains both rotation and translation, and the displacements happen in an arbitrarily structured environment. As a result, the image matching may no longer be represented by a low-dimensional homographic warping, thus complicating an application of the commonly used Image of Warped Events (IWE). We introduce a new solution to this problem by performing contrast maximization in 3D. The 3D location of the rays cast for each event is smoothly varied as a function of a continuous-time motion parametrization, and the optimal parameters are found by maximizing the contrast in a volumetric ray density field. Our method thus performs joint optimization over motion and structure. The practical validity of our approach is supported by an application to AGV motion estimation and 3D reconstruction with a single vehicle-mounted event camera. The method approaches the performance obtained with regular cameras, and eventually outperforms in challenging visual conditions.

【8】 RAM-VO: Less is more in Visual Odometry 标题：RAM-VO：视觉里程计中的少即是多

作者：Iury Cleveston,Esther L. Colombini 机构：Laboratory of Robotics and Cognitive Systems (LaRoCS), Institute of Computing, University of Campinas, Campinas, S˜ao Paulo, Brazil 链接：https://arxiv.org/abs/2107.02974 摘要：建造能够在没有人监督的情况下运行的车辆需要确定代理人的姿势。视觉里程计（VO）算法仅利用输入图像的视觉变化来估计自我运动。最新的VO方法广泛使用卷积神经网络（CNN）来实现深度学习，这在处理高分辨率图像时增加了大量的成本。此外，在VO任务中，输入数据越多并不意味着预测效果越好；相反，架构可能会过滤掉无用的信息。因此，实现计算效率高、轻量级的体系结构至关重要。在这项工作中，我们提出了RAM-VO，一个扩展的经常性注意模型（RAM）的视觉里程计任务。RAM-VO改进了信息的视觉和时间表示，实现了近端策略优化（PPO）算法来学习鲁棒策略。结果表明，RAM-VO可以用大约300万个参数对单目输入图像进行6个自由度的回归。此外，在KITTI数据集上的实验表明，RAM-VO只使用了5.7%的可用视觉信息就获得了具有竞争力的结果。摘要：Building vehicles capable of operating without human supervision requires the determination of the agent's pose. Visual Odometry (VO) algorithms estimate the egomotion using only visual changes from the input images. The most recent VO methods implement deep-learning techniques using convolutional neural networks (CNN) extensively, which add a substantial cost when dealing with high-resolution images. Furthermore, in VO tasks, more input data does not mean a better prediction; on the contrary, the architecture may filter out useless information. Therefore, the implementation of computationally efficient and lightweight architectures is essential. In this work, we propose the RAM-VO, an extension of the Recurrent Attention Model (RAM) for visual odometry tasks. RAM-VO improves the visual and temporal representation of information and implements the Proximal Policy Optimization (PPO) algorithm to learn robust policies. The results indicate that RAM-VO can perform regressions with six degrees of freedom from monocular input images using approximately 3 million parameters. In addition, experiments on the KITTI dataset demonstrate that RAM-VO achieves competitive results using only 5.7% of the available visual information.

【9】 Poly-NL: Linear Complexity Non-local Layers with Polynomials 标题：Poly-NL：线性复杂度多项式非局部层

作者：Francesca Babiloni,Ioannis Marras,Filippos Kokkinos,Jiankang Deng,Grigorios Chrysos,Stefanos Zafeiriou 机构：Huawei Noah’s Ark Lab, Imperial College London, University College London, ´Ecole Polytechnique F´ed´erale de Lausanne 备注：11 pages, 4 figures 链接：https://arxiv.org/abs/2107.02859 摘要：空间自我注意层以非局部块的形式，通过计算所有可能位置之间的成对相似性，在卷积神经网络中引入长程依赖。这种成对函数支持非局部层的有效性，但也确定了在空间和时间上相对于输入大小二次缩放的复杂性。这是一个严重的限制因素，实际上阻碍了非本地块对中等规模输入的适用性。以前的工作主要集中在通过修改底层矩阵运算来降低复杂度，然而在这项工作中，我们的目标是在保持复杂度线性的同时保留非局部层的完全表达能力。通过将非局部块构造为三阶多项式函数的特例，克服了非局部块的效率限制。这一事实使我们能够通过将成对相似性的任何直接计算替换为元素乘法，来构造新的快速非局部块，能够在不损失性能的情况下将复杂度从二次降低到线性。我们称之为“Poly-NL”的该方法在图像识别、实例分割和人脸检测任务中具有与最先进的性能相竞争的优势，同时具有相当小的计算开销。摘要：Spatial self-attention layers, in the form of Non-Local blocks, introduce long-range dependencies in Convolutional Neural Networks by computing pairwise similarities among all possible positions. Such pairwise functions underpin the effectiveness of non-local layers, but also determine a complexity that scales quadratically with respect to the input size both in space and time. This is a severely limiting factor that practically hinders the applicability of non-local blocks to even moderately sized inputs. Previous works focused on reducing the complexity by modifying the underlying matrix operations, however in this work we aim to retain full expressiveness of non-local layers while keeping complexity linear. We overcome the efficiency limitation of non-local blocks by framing them as special cases of 3rd order polynomial functions. This fact enables us to formulate novel fast Non-Local blocks, capable of reducing the complexity from quadratic to linear with no loss in performance, by replacing any direct computation of pairwise similarities with element-wise multiplications. The proposed method, which we dub as "Poly-NL", is competitive with state-of-the-art performance across image recognition, instance segmentation, and face detection tasks, while having considerably less computational overhead.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-08，如有侵权请联系 cloudcommunity@tencent.com 删除

linux