前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >计算机视觉学术速递[6.23]

计算机视觉学术速递[6.23]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 18:23:12
1.8K0
发布2021-07-02 18:23:12
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CV 方向,今日共计73篇

Transformer(5篇)

【1】 A Latent Transformer for Disentangled and Identity-Preserving Face Editing 标题:一种用于解开纠缠和保持身份的人脸编辑潜变换器

作者:Xu Yao,Alasdair Newson,Yann Gousseau,Pierre Hellier 机构: LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France, InterDigital R&I, avenue des Champs Blancs, Cesson-S´evign´e, France, Original, Projected, + Smile, + Bangs, + Arched Eyebrows, + Age, - Age, + Beard, + Eyeglasses 链接:https://arxiv.org/abs/2106.11895 摘要:高质量的人脸图像编辑在电影后期制作行业是一个具有挑战性的问题,需要高度的控制和身份保护。以前试图解决这个问题的作品可能会受到面部属性的纠缠和人的身份的丧失的影响。此外,许多算法仅限于某一特定任务。为了克服这些局限性,我们建议通过StyleGAN生成器的潜在空间,通过训练一个专用的潜在变换网络,并在损失函数中加入显式解纠缠和身份保持项来编辑面部属性。我们进一步引入了一个管道来将我们的人脸编辑推广到视频中。我们的模型实现了一个分离的,可控的,和身份保持的面部属性编辑,即使在具有挑战性的情况下真实(即,非合成)的图像和视频。我们在图像和视频数据集上进行了大量的实验,结果表明我们的模型在视觉质量和定量评价方面优于其他最先进的方法。 摘要:High quality facial image editing is a challenging problem in the movie post-production industry, requiring a high degree of control and identity preservation. Previous works that attempt to tackle this problem may suffer from the entanglement of facial attributes and the loss of the person's identity. Furthermore, many algorithms are limited to a certain task. To tackle these limitations, we propose to edit facial attributes via the latent space of a StyleGAN generator, by training a dedicated latent transformation network and incorporating explicit disentanglement and identity preservation terms in the loss function. We further introduce a pipeline to generalize our face editing to videos. Our model achieves a disentangled, controllable, and identity-preserving facial attribute editing, even in the challenging case of real (i.e., non-synthetic) images and videos. We conduct extensive experiments on image and video datasets and show that our model outperforms other state-of-the-art methods in visual quality and quantitative evaluation.

【2】 A Comparison for Patch-level Classification of Deep Learning Methods on Transparent Images: from Convolutional Neural Networks to Visual Transformers 标题:透明图像深度学习方法的贴片级分类比较:从卷积神经网络到视觉转换器

作者:Hechen Yang,Chen Li,Peng Zhao,Ao Chen,Xin Zhao,Marcin Grzegorzek 机构:Microscopic Image and Medical Image Analysis Group, MBIE College, Northeastern University, Shenyang, PR China, Institute of Medical Informatics, University of Luebeck, Luebeck, Germany, A R T I C L E I N F O 链接:https://arxiv.org/abs/2106.11582 摘要:目前,透明图像的分析已逐渐成为计算机视觉领域的研究热点。针对透明图像难以分析的问题,比较了不同深度学习的分类性能。我们将透明图像按相同比例裁剪成8*8和224*224像素块,然后根据背景纹理将两个不同像素块划分为前景和背景。我们还使用了4种卷积神经网络和一种新的ViT网络模型来比较前景和背景分类实验。我们的结论是,ViT在分类8*8像素块时表现最差,但在分类224*224时却优于大多数卷积神经网络。 摘要:Nowadays, analysis of transparent images in the field of computer vision has gradually become a hot spot. In this paper, we compare the classification performance of different deep learning for the problem that transparent images are difficult to analyze. We crop the transparent images into 8 * 8 and 224 * 224 pixels patches in the same proportion, and then divide the two different pixels patches into foreground and background according to groundtruch. We also use 4 types of convolutional neural networks and a novel ViT network model to compare the foreground and background classification experiments. We conclude that ViT performs the worst in classifying 8 * 8 pixels patches, but it outperforms most convolutional neural networks in classifying 224 * 224.

【3】 DocFormer: End-to-End Transformer for Document Understanding 标题:DocFormer:用于文档理解的端到端转换器

作者:Srikar Appalaraju,Bhavan Jasani,Bhargava Urala Kota,Yusheng Xie,R. Manmatha 机构:AWS AI 链接:https://arxiv.org/abs/2106.11539 摘要:本文提出了一种基于多模式转换器的可视化文档理解体系结构DocFormer。VDU是一个具有挑战性的问题,它的目的是理解不同格式(表单、收据等)和布局的文档。此外,DocFormer是以无监督的方式使用精心设计的鼓励多模态交互的任务进行预训练的。DocFormer使用文本、视觉和空间特征,并使用一种新颖的多模式自我注意层将它们结合起来。DocFormer还共享跨模式学习的空间嵌入,这使得模型很容易将文本与视觉标记关联起来,反之亦然。DocFormer在4个不同的数据集上进行评估,每个数据集都有很强的基线。DocFormer在所有这些产品上都达到了最先进的效果,有时甚至超过了其尺寸的4倍(参数数量)。 摘要:We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

【4】 MODETR: Moving Object Detection with Transformers 标题:MODETR:基于Transformer的运动目标检测

作者:Eslam Mohamed,Ahmad El-Sallab 机构:Deep Learning Research, Valeo R&D Cairo, EGYPT, Ahmad El Sallab 备注:None 链接:https://arxiv.org/abs/2106.11422 摘要:运动目标检测(MOD)是自主驱动管道的关键任务。MOD通常是通过包含外观和运动线索的两流卷积结构来处理的,而不考虑空间或运动特征之间的相互关系。在本文中,我们通过跨空间和运动流的多头部注意机制来解决这个问题。我们建议使用MODETR;一种运动目标检测变换器网络,包括用于空间和运动模式的多流变换器编码器和使用集合预测产生运动目标边界框的对象变换器解码器。整个体系结构是使用双边损失进行端到端训练的。探索了几种将运动线索与Transformer模型相结合的方法,包括两流RGB和光流(of)方法,以及利用序列信息的多流结构。为了融合时间信息,我们提出了一种新的时间位置编码(TPE)方法来扩展DETR中的空间位置编码(SPE)。为此,我们探讨了两种架构选择,即速度和时间之间的平衡。为了评估我们的网络,我们对KITTI MOD[6]数据集执行MOD任务。结果显示,与最新方法相比,MODTransformer网络的映射显著提高了5%。此外,提议的TPE编码比SPE基线提供了10%的mAP改进。 摘要:Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline. MOD is usually handled via 2-stream convolutional architectures that incorporates both appearance and motion cues, without considering the inter-relations between the spatial or motion features. In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams. We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformer encoders for both spatial and motion modalities, and an object transformer decoder that produces the moving objects bounding boxes using set predictions. The whole architecture is trained end-to-end using bi-partite loss. Several methods of incorporating motion cues with the Transformer model are explored, including two-stream RGB and Optical Flow (OF) methods, and multi-stream architectures that take advantage of sequence information. To incorporate the temporal information, we propose a new Temporal Positional Encoding (TPE) approach to extend the Spatial Positional Encoding(SPE) in DETR. We explore two architectural choices for that, balancing between speed and time. To evaluate the our network, we perform the MOD task on the KITTI MOD [6] data set. Results show significant 5% mAP of the Transformer network for MOD over the state-of-the art methods. Moreover, the proposed TPE encoding provides 10% mAP improvement over the SPE baseline.

【5】 Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation 标题:用于联合运动目标检测和分割的时空多任务学习转换器

作者:Eslam Mohamed,Ahmed El-Sallab 机构:Valeo R&D Cairo, Egypt 链接:https://arxiv.org/abs/2106.11401 摘要:移动物体对于自动驾驶任务具有特殊的重要性。通过对运动目标像素进行分割,可以将运动目标检测定位为运动目标分割;通过生成运动目标的边界框,可以将运动目标检测定位为运动目标检测。本文提出了一种基于Transformers的多任务学习体系结构,通过一个网络共同完成两个任务。由于运动特征对任务的重要性,整个设置基于时空聚集。我们评估了单个任务体系结构与MTL设置的性能,包括早期共享编码器和晚期共享编码器-解码器转换器。对于后者,我们提出了一种新的联合任务查询-解码器转换器,使我们能够在共享模型中有专用的任务头。为了评估我们的方法,我们使用KITTI MOD[29]数据集。结果表明,在单个任务网络中,运动目标检测的mAP提高了1.5%,运动目标分割的IoU提高了2%。 摘要:Moving objects have special importance for Autonomous Driving tasks. Detecting moving objects can be posed as Moving Object Segmentation, by segmenting the object pixels, or Moving Object Detection, by generating a bounding box for the moving targets. In this paper, we present a Multi-Task Learning architecture, based on Transformers, to jointly perform both tasks through one network. Due to the importance of the motion features to the task, the whole setup is based on a Spatio-Temporal aggregation. We evaluate the performance of the individual tasks architecture versus the MTL setup, both with early shared encoders, and late shared encoder-decoder transformers. For the latter, we present a novel joint tasks query decoder transformer, that enables us to have tasks dedicated heads out of the shared model. To evaluate our approach, we use the KITTI MOD [29] data set. Results show1.5% mAP improvement for Moving Object Detection, and 2%IoU improvement for Moving Object Segmentation, over the individual tasks networks.

检测相关(7篇)

【1】 Towards Reducing Labeling Cost in Deep Object Detection 标题:深部目标检测中降低标注成本的研究

作者:Ismail Elezi,Zhiding Yu,Anima Anandkumar,Laura Leal-Taixe,Jose M. Alvarez 机构:Laura Leal-Taixé, TUM, NVIDIA, CALTECH 备注:Includes supplementary material 链接:https://arxiv.org/abs/2106.11921 摘要:深度神经网络在目标检测方面已经达到了很高的精度,但是它们的成功依赖于大量的标记数据。为了减少对标签的依赖,人们提出了各种主动学习策略,通常基于检测器的置信度。然而,这些方法偏向于性能最好的类,并且可能导致获得的数据集不能很好地代表测试集中的数据。在这项工作中,我们提出了一个统一的主动学习框架,该框架考虑了检测器的不确定性和鲁棒性,确保网络在所有类中都能准确地执行。此外,我们的方法能够伪标记非常有信心的预测,抑制潜在的分布漂移,同时进一步提高模型的性能。实验表明,该方法在PASCAL-VOC07+12和MS-COCO上的综合性能优于多种主动学习方法,相对性能提高了7.7%,标记成本降低了82%。 摘要:Deep neural networks have reached very high accuracy on object detection but their success hinges on large amounts of labeled data. To reduce the dependency on labels, various active-learning strategies have been proposed, typically based on the confidence of the detector. However, these methods are biased towards best-performing classes and can lead to acquired datasets that are not good representatives of the data in the testing set. In this work, we propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector, ensuring that the network performs accurately in all classes. Furthermore, our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift while further boosting the performance of the model. Experiments show that our method comprehensively outperforms a wide range of active-learning methods on PASCAL VOC07+12 and MS-COCO, having up to a 7.7% relative improvement, or up to 82% reduction in labeling cost.

【2】 Data Augmentation for Opcode Sequence Based Malware Detection 标题:基于操作码序列的恶意软件检测数据增强

作者:Niall McLaughlin,Jesus Martinez del Rincon 机构:The Centre for Secure Information Technologies (CSIT), Queen’s University Belfast 备注:11 pages, 3 figures 链接:https://arxiv.org/abs/2106.11821 摘要:数据扩充已经成功地应用于深度学习的许多领域,以显著提高模型的性能。通常,数据增强模拟数据中的真实变化,以增加训练集的明显多样性。然而,对于基于操作码的恶意软件分析,深度学习方法已经达到了最先进的性能,如何应用数据扩充目前还不清楚。在本文中,我们研究了不同的数据扩充方法,从使用固定变换的基本方法开始,转向适应数据的方法。提出了一种新的数据扩充方法,该方法利用网络中的一个操作码嵌入层及其相应的操作码嵌入矩阵来实现训练过程中的自适应数据扩充。据我们所知,这是第一篇论文进行了系统的研究,不同的增强方法应用于基于操作码序列的恶意软件分类。 摘要:Data augmentation has been successfully used in many areas of deep-learning to significantly improve model performance. Typically data augmentation simulates realistic variations in data in order to increase the apparent diversity of the training-set. However, for opcode-based malware analysis, where deep learning methods are already achieving state of the art performance, it is not immediately clear how to apply data augmentation. In this paper we study different methods of data augmentation starting with basic methods using fixed transformations and moving to methods that adapt to the data. We propose a novel data augmentation method based on using an opcode embedding layer within the network and its corresponding opcode embedding matrix to perform adaptive data augmentation during training. To the best of our knowledge this is the first paper to carry out a systematic study of different augmentation methods applied to opcode sequence based malware classification.

【3】 Proposal Relation Network for Temporal Action Detection 标题:时间动作检测的建议关系网络

作者:Xiang Wang,Zhiwu Qing,Ziyuan Huang,Yutong Feng,Shiwei Zhang,Jianwen Jiang,Mingqian Tang,Changxin Gao,Nong Sang 机构: Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Alibaba Group 备注:None 链接:https://arxiv.org/abs/2106.11812 摘要:本技术报告介绍了我们在AcitivityNet Challenge 2021中针对时间动作检测任务的解决方案。此任务的目的是定位和识别长时间未剪辑视频中感兴趣的动作。任务的关键挑战来自于行动的时间持续时间的巨大变化,并且目标行动通常嵌入在不相关活动的背景中。该方案建立在BMN的基础上,主要包括三个步骤:1)通过Slowfast、CSN和ViViT进行动作分类和特征编码;2) 建议生成。通过嵌入建议关系网络(PRN)对BMN进行改进,生成高质量的建议;3) 动作检测。我们通过给建议分配相应的分类结果来计算检测结果。最后,我们在不同设置下对结果进行了整合,在测试集上达到了44.7%,在平均地图方面比ActivityNet 2020的冠军结果提高了1.9%。 摘要:This technical report presents our solution for temporal action detection task in AcitivityNet Challenge 2021. The purpose of this task is to locate and identify actions of interest in long untrimmed videos. The crucial challenge of the task comes from that the temporal duration of action varies dramatically, and the target actions are typically embedded in a background of irrelevant activities. Our solution builds on BMN, and mainly contains three steps: 1) action classification and feature encoding by Slowfast, CSN and ViViT; 2) proposal generation. We improve BMN by embedding the proposed Proposal Relation Network (PRN), by which we can generate proposals of high quality; 3) action detection. We calculate the detection results by assigning the proposals with corresponding classification results. Finally, we ensemble the results under different settings and achieve 44.7% on the test set, which improves the champion result in ActivityNet 2020 by 1.9% in terms of average mAP.

【4】 Confidence-Aware Learning for Camouflaged Object Detection 标题:伪装目标检测中的置信度学习

作者:Jiawei Liu,Jing Zhang,Nick Barnes 机构:ANU 链接:https://arxiv.org/abs/2106.11641 摘要:自信学习被证明是防止网络过度自信的有效方法。我们提出了一个基于信任度的伪装目标检测框架,该框架使用动态监控来生成准确的伪装图和表示模型对当前预测感知的有意义的“信任度”。设计了一个伪装目标检测网络来进行伪装预测。然后,将其与输入图像进行拼接,并将其反馈给置信度估计网络,生成一个单通道的置信度图,对置信度估计网络进行动态监控,表示伪装预测与地面真实伪装图的一致性。利用产生的置信度图,引入了以置信度图为指导的置信度感知学习方法,以提高对损失函数中硬/低置信像素的关注度。我们声称,一旦训练,我们的置信度估计网络可以评估像素级的预测精度,而不依赖于地面真实伪装图。在四个伪装目标检测测试数据集上的大量实验结果表明,该模型在解释伪装预测方面具有优越的性能。 摘要:Confidence-aware learning is proven as an effective solution to prevent networks becoming overconfident. We present a confidence-aware camouflaged object detection framework using dynamic supervision to produce both accurate camouflage map and meaningful "confidence" representing model awareness about the current prediction. A camouflaged object detection network is designed to produce our camouflage prediction. Then, we concatenate it with the input image and feed it to the confidence estimation network to produce an one channel confidence map.We generate dynamic supervision for the confidence estimation network, representing the agreement of camouflage prediction with the ground truth camouflage map. With the produced confidence map, we introduce confidence-aware learning with the confidence map as guidance to pay more attention to the hard/low-confidence pixels in the loss function. We claim that, once trained, our confidence estimation network can evaluate pixel-wise accuracy of the prediction without relying on the ground truth camouflage map. Extensive results on four camouflaged object detection testing datasets illustrate the superior performance of the proposed model in explaining the camouflage prediction.

【5】 Creating A New Color Space utilizing PSO and FCM to Perform Skin Detection by using Neural Network and ANFIS 标题:利用粒子群算法和FCM建立新的颜色空间,利用神经网络和ANFIS进行肤色检测

作者:Kobra Nazaria,Samaneh Mazaheri,Bahram Sadeghi Bigham 机构:. 链接:https://arxiv.org/abs/2106.11563 摘要:肤色检测是计算机视觉各种应用中必不可少的一步。这些应用程序将包括人脸检测、在电影和照片中查找色情图像、查找种族、年龄、诊断等。因此,提出一种合适的皮肤检测方法可以解决多个问题。在本研究中,首先使用FCM与PSO演算法建立一个新的色彩空间。然后,利用线性和非线性模式在新的颜色空间中进行皮肤分类。此外,还利用ANFIS和神经网络在RGB和LAB颜色空间中进行了实验。利用马氏距离和欧氏距离算法对RBG颜色空间中的皮肤进行检测。与同一数据库上最精确的方法相比,该方法的准确率提高了18.38%。此外,该方法对COMPAQ数据集的等错误率(1-EER)达到90.05%,对prateepan数据集的测试准确率达到92.93%,与以前在COMPAQ数据库上的方法相比,1-EER提高了0.87%。 摘要:Skin color detection is an essential required step in various applications related to computer vision. These applications will include face detection, finding pornographic images in movies and photos, finding ethnicity, age, diagnosis, and so on. Therefore, proposing a proper skin detection method can provide solution to several problems. In this study, first a new color space is created using FCM and PSO algorithms. Then, skin classification has been performed in the new color space utilizing linear and nonlinear modes. Additionally, it has been done in RGB and LAB color spaces by using ANFIS and neural network. Skin detection in RBG color space has been performed using Mahalanobis distance and Euclidean distance algorithms. In comparison, this method has 18.38% higher accuracy than the most accurate method on the same database. Additionally, this method has achieved 90.05% in equal error rate (1-EER) in testing COMPAQ dataset and 92.93% accuracy in testing Pratheepan dataset, which compared to the previous method on COMPAQ database, 1-EER has increased by %0.87.

【6】 Hand-Drawn Electrical Circuit Recognition using Object Detection and Node Recognition 标题:基于目标检测和节点识别的手绘电路识别

作者:Rachala Rohith Reddy,Mahesh Raveendranatha Panicker 机构:Received: Accepted:, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 备注:11 pages, 15 figures, under review in springer 链接:https://arxiv.org/abs/2106.11559 摘要:近年来,随着神经网络的发展,手绘电路自动生成仿真电路的算法重新兴起。然而,文献中的大多数方法仅限于对不同类型的电子元件进行分类,其中只有少数方法显示了一种从扫描图像重建电路原理图的方法,这对于网络表生成的进一步自动化非常重要。提出了一种基于目标检测和电路节点识别的手绘电路实时自动识别算法。该方法采用只看一次的第5版(YOLOv5)检测电路元件,并采用一种新的基于Hough变换的节点识别方法。采用YOLOv5目标检测算法,平均检测精度(mAP0.5)为98.2%。该方法还可以重建电路原理图,重建精度达到80%。 摘要:With the recent developments in neural networks, there has been a resurgence in algorithms for the automatic generation of simulation ready electronic circuits from hand-drawn circuits. However, most of the approaches in literature were confined to classify different types of electrical components and only a few of those methods have shown a way to rebuild the circuit schematic from the scanned image, which is extremely important for further automation of netlist generation. This paper proposes a real-time algorithm for the automatic recognition of hand-drawn electrical circuits based on object detection and circuit node recognition. The proposed approach employs You Only Look Once version 5 (YOLOv5) for detection of circuit components and a novel Hough transform based approach for node recognition. Using YOLOv5 object detection algorithm, a mean average precision (mAP0.5) of 98.2% is achieved in detecting the components. The proposed method is also able to rebuild the circuit schematic with 80% accuracy.

【7】 GAIA: A Transfer Learning System of Object Detection that Fits Your Needs 标题:GAIA:一种适合您需求的目标检测迁移学习系统

作者:Xingyuan Bu,Junran Peng,Junjie Yan,Tieniu Tan,Zhaoxiang Zhang 机构:University of Chinese Academy of Sciences,Beijing Institute of Technology, Center for Research on Intelligent Perception and Computing, CASIA,SenseTime Group Limited 备注:CVPR2021. The first two authors contribute equally. Code is released at this https URL 链接:https://arxiv.org/abs/2106.11346 摘要:基于大规模数据集的迁移学习在计算机视觉和自然语言处理中发挥着越来越重要的作用。然而,由于存在许多具有独特需求的应用场景,例如特定的延迟约束和专门的数据分布,因此利用针对每个任务需求的大规模预训练是非常昂贵的。本文以目标检测领域为研究对象,提出了一个基于GAIA的迁移学习系统,该系统能够根据不同的下游需求自动、高效地生成个性化的解决方案。GAIA能够提供强大的预训练权重,选择符合下游需求的模型,如延迟约束和指定的数据域,并为任务数据点很少的从业者收集相关数据。有了GAIA,我们在COCO、Objects365、Open Images、Caltech、CityPersons和UODB(包括KITTI、VOC、WiderFace、DOTA、Clipart、Comic等数据集)上取得了令人满意的结果。以COCO为例,GAIA能够高效地制作出涵盖从16ms到53ms的大范围潜伏期的模型,并且在没有哨声和钟声的情况下产生从38.2到46.5的AP。为了使目标检测社区的每一位从业者受益,GAIA在https://github.com/GAIA-vision. 摘要:Transfer learning with pre-training on large-scale datasets has played an increasingly significant role in computer vision and natural language processing recently. However, as there exist numerous application scenarios that have distinctive demands such as certain latency constraints and specialized data distributions, it is prohibitively expensive to take advantage of large-scale pre-training for per-task requirements. In this paper, we focus on the area of object detection and present a transfer learning system named GAIA, which could automatically and efficiently give birth to customized solutions according to heterogeneous downstream needs. GAIA is capable of providing powerful pre-trained weights, selecting models that conform to downstream demands such as latency constraints and specified data domains, and collecting relevant data for practitioners who have very few datapoints for their tasks. With GAIA, we achieve promising results on COCO, Objects365, Open Images, Caltech, CityPersons, and UODB which is a collection of datasets including KITTI, VOC, WiderFace, DOTA, Clipart, Comic, and more. Taking COCO as an example, GAIA is able to efficiently produce models covering a wide range of latency from 16ms to 53ms, and yields AP from 38.2 to 46.5 without whistles and bells. To benefit every practitioner in the community of object detection, GAIA is released at https://github.com/GAIA-vision.

分类|识别相关(7篇)

【1】 PALMAR: Towards Adaptive Multi-inhabitant Activity Recognition in Point-Cloud Technology 标题:Palmar:面向自适应多居民活动识别的点云技术

作者:Mohammad Arif Ul Alam,Md Mahmudur Rahman,Jared Q Widberg 机构:Department of Computer Science, University of Massachusetts Lowell, MA, USA 备注:Accepted in IEEE International Conference on Computer Communications 2021 链接:https://arxiv.org/abs/2106.11902 摘要:随着深度神经网络和基于计算机视觉的人类活动识别技术的发展,点云数据技术(LiDAR、mmWave)因其具有隐私保护的特性而受到广泛关注。鉴于精确的PCD技术的高前景,我们开发了PALMAR,一个多居民活动识别系统,通过使用有效的信号处理和新的机器学习技术来跟踪个人,从而开发了一个自适应的多居民跟踪和HAR系统。更具体地说,我们提出(i)基于体素化特征表示的实时PCD微调方法,(ii)高效聚类(DBSCAN和BIRCH),基于自适应顺序隐马尔可夫模型的多人跟踪和交叉模糊度降低技术以及(iii)新的基于自适应深度学习的域自适应技术,以提高HAR在数据稀缺和多样性(设备、位置和种群多样性)情况下的准确性。我们使用(i)三台设备(3D激光雷达和79ghz毫米波)从6名参与者那里采集的实时PCD对我们的框架和系统进行了实验评估,(ii)一个公开的三维激光雷达活动数据(28名参与者)和(iii)一个嵌入式硬件原型系统,该系统在多居民(96%)场景中提供了良好的HAR性能,多人跟踪比最先进的框架提高了63%,而不会损失边缘计算设备的显著系统性能。 摘要:With the advancement of deep neural networks and computer vision-based Human Activity Recognition, employment of Point-Cloud Data technologies (LiDAR, mmWave) has seen a lot interests due to its privacy preserving nature. Given the high promise of accurate PCD technologies, we develop, PALMAR, a multiple-inhabitant activity recognition system by employing efficient signal processing and novel machine learning techniques to track individual person towards developing an adaptive multi-inhabitant tracking and HAR system. More specifically, we propose (i) a voxelized feature representation-based real-time PCD fine-tuning method, (ii) efficient clustering (DBSCAN and BIRCH), Adaptive Order Hidden Markov Model based multi-person tracking and crossover ambiguity reduction techniques and (iii) novel adaptive deep learning-based domain adaptation technique to improve the accuracy of HAR in presence of data scarcity and diversity (device, location and population diversity). We experimentally evaluate our framework and systems using (i) a real-time PCD collected by three devices (3D LiDAR and 79 GHz mmWave) from 6 participants, (ii) one publicly available 3D LiDAR activity data (28 participants) and (iii) an embedded hardware prototype system which provided promising HAR performances in multi-inhabitants (96%) scenario with a 63% improvement of multi-person tracking than state-of-art framework without losing significant system performances in the edge computing device.

【2】 Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition 标题:基于笔画级分解的零射汉字识别

作者:Jingye Chen,Bin Li,Xiangyang Xue 机构:Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University 备注:7 pages, 7 figures 链接:https://arxiv.org/abs/2106.11613 摘要:汉字识别由于其广泛的应用而引起了人们的广泛研究兴趣。虽然已经研究了多年,但该领域的一些问题尚未完全解决,如零炮问题。以往的基于字符和根的方法并没有从根本上解决零炮问题,因为在数据饥饿的情况下,测试集中的一些字符或根可能不会出现在训练集中。受这样一个事实的启发:如果人类已经学会了某些字符的笔画顺序,就可以概括出如何书写以前看不见的字符,我们提出了一种基于笔画的方法,将每个字符分解成一系列笔画,这些笔画是汉字最基本的单位。然而,我们观察到笔划序列和汉字之间存在一对多的关系。为了解决这个问题,我们采用了一种基于匹配的策略来将预测的笔划序列转换成特定的字符。我们对所提出的方法在手写字符、印刷艺术字符和场景字符上进行了评估。实验结果表明,该方法在字符零拍和字根零拍任务上均优于现有方法。此外,该方法还可以很容易地推广到其他字符可以分解成笔画的语言中。 摘要:Chinese character recognition has attracted much research interest due to its wide applications. Although it has been studied for many years, some issues in this field have not been completely resolved yet, e.g. the zero-shot problem. Previous character-based and radical-based methods have not fundamentally addressed the zero-shot problem since some characters or radicals in test sets may not appear in training sets under a data-hungry condition. Inspired by the fact that humans can generalize to know how to write characters unseen before if they have learned stroke orders of some characters, we propose a stroke-based method by decomposing each character into a sequence of strokes, which are the most basic units of Chinese characters. However, we observe that there is a one-to-many relationship between stroke sequences and Chinese characters. To tackle this challenge, we employ a matching-based strategy to transform the predicted stroke sequence to a specific character. We evaluate the proposed method on handwritten characters, printed artistic characters, and scene characters. The experimental results validate that the proposed method outperforms existing methods on both character zero-shot and radical zero-shot tasks. Moreover, the proposed method can be easily generalized to other languages whose characters can be decomposed into strokes.

【3】 Multi-layered Semantic Representation Network for Multi-label Image Classification 标题:用于多标签图像分类的多层语义表示网络

作者:Xiwen Qu,Hao Che,Jun Huang,Linchuan Xu,Xiao Zheng 机构:School of Computer Science and Technology, Anhui University of Technology, China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China, Australian National University, Australia 链接:https://arxiv.org/abs/2106.11596 摘要:多标签图像分类(Multi-label image classification,MLIC)是一项基础性和实用性的工作,其目的是为一幅图像分配多个可能的标签。近年来,许多基于深度卷积神经网络(CNN)的方法被提出,通过建立标签相关性模型来发现标签的语义和学习图像的语义表示。本文通过对标签关联建模和语义表示学习的改进,提出了这一研究方向。一方面,除了每个标签的局部语义外,我们建议进一步研究多个标签共享的全局语义。另一方面,现有的方法主要学习CNN的最后一个卷积层的语义表示。但是人们注意到,CNN的不同层的图像表示捕获了不同层次或尺度的特征,并且具有不同的辨别能力。因此,我们建议学习在多个卷积层的语义表示。为此,本文设计了一个多层语义表示网络(MSRN),通过对标签相关性的建模,发现标签的局部语义和全局语义,并利用标签语义通过注意机制指导多层语义表示学习。在voc2007、COCO、NUS-WIDE和Apparel等四个基准数据集上进行的大量实验表明,与最先进的模型相比,所提出的MSRN具有竞争力。 摘要:Multi-label image classification (MLIC) is a fundamental and practical task, which aims to assign multiple possible labels to an image. In recent years, many deep convolutional neural network (CNN) based approaches have been proposed which model label correlations to discover semantics of labels and learn semantic representations of images. This paper advances this research direction by improving both the modeling of label correlations and the learning of semantic representations. On the one hand, besides the local semantics of each label, we propose to further explore global semantics shared by multiple labels. On the other hand, existing approaches mainly learn the semantic representations at the last convolutional layer of a CNN. But it has been noted that the image representations of different layers of CNN capture different levels or scales of features and have different discriminative abilities. We thus propose to learn semantic representations at multiple convolutional layers. To this end, this paper designs a Multi-layered Semantic Representation Network (MSRN) which discovers both local and global semantics of labels through modeling label correlations and utilizes the label semantics to guide the semantic representations learning at multiple layers through an attention mechanism. Extensive experiments on four benchmark datasets including VOC 2007, COCO, NUS-WIDE, and Apparel show a competitive performance of the proposed MSRN against state-of-the-art models.

【4】 Unsupervised Embedding Adaptation via Early-Stage Feature Reconstruction for Few-Shot Classification 标题:基于早期特征重构的Few-Shot分类无监督嵌入自适应

作者:Dong Hoon Lee,Sae-Young Chung 机构: one can utilize all the unlabeled testsamples together to make an inference in a transductive 1School of Electrical Engineering, Korea Advanced Instituteof Science and Technology (KAIST) 备注:Accepted to ICML 2021 链接:https://arxiv.org/abs/2106.11486 摘要:本文提出了一种无监督嵌入自适应算法,用于后续的少量镜头分类任务。基于深层神经网络在记忆前学习泛化的发现,我们提出了一种新的自适应方法,即特征重构和维数驱动的早期停止,以发现可泛化的特征。合并ESFR持续改进所有标准设置的基线方法的性能,包括最近提出的转换方法。ESFR与transductive方法结合使用,在mini-ImageNet、tiered-ImageNet和CUB上进一步实现了最先进的性能;特别是在单次定位上,比以往的最佳定位方法精度提高了1.2%~2.0%。 摘要:We propose unsupervised embedding adaptation for the downstream few-shot classification task. Based on findings that deep neural networks learn to generalize before memorizing, we develop Early-Stage Feature Reconstruction (ESFR) -- a novel adaptation scheme with feature reconstruction and dimensionality-driven early stopping that finds generalizable features. Incorporating ESFR consistently improves the performance of baseline methods on all standard settings, including the recently proposed transductive method. ESFR used in conjunction with the transductive method further achieves state-of-the-art performance on mini-ImageNet, tiered-ImageNet, and CUB; especially with 1.2%~2.0% improvements in accuracy over the previous best performing method on 1-shot setting.

【5】 SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition 标题:SeqNetVLAD与PointNetVLAD:图像序列与三维点云的昼夜位置识别

作者:Sourav Garg,Michael Milford 机构:QUT Centre for Robotics, Queensland University of Technology 备注:Accepted to CVPR 2021 Workshop on 3D Vision and Robotics (3DVR). this https URL 链接:https://arxiv.org/abs/2106.11481 摘要:位置识别是移动机器人定位和导航的关键技术。基于图像或视觉位置识别(VPR)是一个具有挑战性的问题,因为场景外观和相机视点在重新访问位置时会发生显著变化。最近基于“序列表示”的VPR方法与传统的序列分数聚合或基于单个图像的技术相比,显示出了良好的结果。与此同时,随着基于深度学习的点云处理技术的发展,基于三维点云的位置识别也在探索之中。然而,一个关键的问题仍然存在:一个显式的基于三维结构的位置表示总是优于一个隐式的基于RGB图像序列的“空间”表示,它可以内在地学习场景结构。在这个扩展的摘要中,我们试图通过考虑类似的“度量跨度”来表示位置,来比较这两种方法。我们将基于三维点云的方法(PointNetVLAD)与基于图像序列的方法(SeqNet等)进行了比较,并展示了基于图像序列的方法在给定度量范围内接近甚至超过基于点云的方法所达到的性能。这些性能变化可归因于输入传感器的数据丰富性以及移动机器人的数据积累策略的差异。虽然对于这两种不同的模式而言,完美的苹果对苹果的比较可能不可行,但所提出的比较朝着回答与空间表示有关的更深层次问题的方向迈出了一步,这些问题与自主驾驶和增强/虚拟现实等数个应用程序有关。公开的源代码https://github.com/oravus/seqNet. 摘要:Place Recognition is a crucial capability for mobile robot localization and navigation. Image-based or Visual Place Recognition (VPR) is a challenging problem as scene appearance and camera viewpoint can change significantly when places are revisited. Recent VPR methods based on ``sequential representations'' have shown promising results as compared to traditional sequence score aggregation or single image based techniques. In parallel to these endeavors, 3D point clouds based place recognition is also being explored following the advances in deep learning based point cloud processing. However, a key question remains: is an explicit 3D structure based place representation always superior to an implicit ``spatial'' representation based on sequence of RGB images which can inherently learn scene structure. In this extended abstract, we attempt to compare these two types of methods by considering a similar ``metric span'' to represent places. We compare a 3D point cloud based method (PointNetVLAD) with image sequence based methods (SeqNet and others) and showcase that image sequence based techniques approach, and can even surpass, the performance achieved by point cloud based methods for a given metric span. These performance variations can be attributed to differences in data richness of input sensors as well as data accumulation strategies for a mobile robot. While a perfect apple-to-apple comparison may not be feasible for these two different modalities, the presented comparison takes a step in the direction of answering deeper questions regarding spatial representations, relevant to several applications like Autonomous Driving and Augmented/Virtual Reality. Source code available publicly https://github.com/oravus/seqNet.

【6】 An Alternative Auxiliary Task for Enhancing Image Classification 标题:一种用于增强图像分类的替代辅助任务

作者:Chen Liu 备注:Work in progress. It will be very much appreciated if you can give suggestions on additional experiments and analyses that may improve this manuscript 链接:https://arxiv.org/abs/2106.11478 摘要:图像重建可能是图像分类最主要的辅助任务。在本文中,我们研究了“估计输入图像的傅里叶变换”作为一个潜在的替代辅助任务,希望它能进一步提高主任务的性能或引入图像重建中没有很好覆盖的新约束。我们在CIFAR-10数据集上对五种流行的分类体系结构进行了实验,实验结果表明,我们提出的辅助任务总体上提高了分类精度。更值得注意的是,在某些情况下,我们提出的辅助任务可以增强分类器对使用快速梯度符号方法生成的对抗性攻击的抵抗力。 摘要:Image reconstruction is likely the most predominant auxiliary task for image classification. In this paper, we investigate ``estimating the Fourier Transform of the input image" as a potential alternative auxiliary task, in the hope that it may further boost the performances on the primary task or introduce novel constraints not well covered by image reconstruction. We experimented with five popular classification architectures on the CIFAR-10 dataset, and the empirical results indicated that our proposed auxiliary task generally improves the classification accuracy. More notably, the results showed that in certain cases our proposed auxiliary task may enhance the classifiers' resistance to adversarial attacks generated using the fast gradient sign method.

【7】 Incremental Deep Neural Network Learning using Classification Confidence Thresholding 标题:基于分类置信度阈值的增量式深度神经网络学习

作者:Justin Leo,Jugal Kalita 机构:Department of Computer Science, University of Colorado at Colorado Springs 备注:Accepted to IEEE TNNLS 链接:https://arxiv.org/abs/2106.11437 摘要:大多数现代神经网络分类都没有考虑到未知的概念。训练好的神经网络通常是在一个不现实的场景中进行测试,只需要从一组封闭的已知类中提取例子。为了开发更真实的模型,引入了在开放集环境中工作的概念。这反过来又导致了增量学习的概念,即具有自己的体系结构和初始训练数据集的模型可以在测试阶段识别未知类,并在检测到新类的证据时自动更新自身。增量学习中出现的一些问题是,重复训练分类器时资源的使用效率低下,并且随着时间的推移,多个类的添加会降低分类精度。实例化新类的过程会根据需要重复多次,从而产生错误。针对这些问题,本文提出了一种基于分类置信阈值的素数神经网络增量学习方法,通过限制遗忘来保持较高的学习精度。一个精益的方法也被用来减少神经网络再训练中使用的资源。提出的方法是基于这样一种思想,即网络即使暴露在与新类相关联的有限数量的样本中,也能够增量地学习新类。这种方法可以应用于大多数现有的神经网络,只需对网络结构进行最小的修改。 摘要:Most modern neural networks for classification fail to take into account the concept of the unknown. Trained neural networks are usually tested in an unrealistic scenario with only examples from a closed set of known classes. In an attempt to develop a more realistic model, the concept of working in an open set environment has been introduced. This in turn leads to the concept of incremental learning where a model with its own architecture and initial trained set of data can identify unknown classes during the testing phase and autonomously update itself if evidence of a new class is detected. Some problems that arise in incremental learning are inefficient use of resources to retrain the classifier repeatedly and the decrease of classification accuracy as multiple classes are added over time. This process of instantiating new classes is repeated as many times as necessary, accruing errors. To address these problems, this paper proposes the Classification Confidence Threshold approach to prime neural networks for incremental learning to keep accuracies high by limiting forgetting. A lean method is also used to reduce resources used in the retraining of the neural network. The proposed method is based on the idea that a network is able to incrementally learn a new class even when exposed to a limited number samples associated with the new class. This method can be applied to most existing neural networks with minimal changes to network architecture.

分割|语义相关(8篇)

【1】 Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation 标题:用于多目标跟踪和分割的典型交叉注意网络

作者:Lei Ke,Xia Li,Martin Danelljan,Yu-Wing Tai,Chi-Keung Tang,Fisher Yu 机构:ETH Zürich, HKUST, Kuaishou Technology 备注:Multiple object tracking and segmentation on large-scale datasets 链接:https://arxiv.org/abs/2106.11958 摘要:多目标跟踪和分割需要检测、跟踪和分割属于一组给定类的对象。大多数方法只利用时间维度来解决关联问题,而依赖于分割掩码本身的单帧预测。提出了一种典型的交叉注意网络(PCAN),能够利用丰富的时空信息进行在线多目标跟踪和分割。PCAN首先将一个时空记忆提取成一组原型,然后利用交叉注意从过去的帧中提取丰富的信息。为了分割每个对象,PCAN采用了一个原型外观模块来学习一组对比的前景和背景原型,然后随着时间的推移进行传播。大量的实验表明,PCAN在Youtube-VIS和BDD100K两个数据集上都优于当前视频实例跟踪和分割竞赛的优胜者,并且对单阶段和两阶段分割框架都显示了有效性。代码将在http://vis.xyz/pub/pcan. 摘要:Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks. Code will be available at http://vis.xyz/pub/pcan.

【2】 Give Me Your Trained Model: Domain Adaptive Semantic Segmentation without Source Data 标题:给我你的训练模型:无源数据的领域自适应语义切分

作者:Yuxi Wang,Jian Liang,Zhaoxiang Zhang 机构: University of Chinese Academy of Sciences, Center for Research on Intelligent Perception and Computing, CASIA, Center for Excellence in Brain Science and Intelligence Technology, CAS 链接:https://arxiv.org/abs/2106.11653 摘要:由于从特定场景(源)收集了大量的像素级注释,经过训练的语义分割模型表现得相当好,但在新场景(目标)中由于域偏移较大而失败。为了缩小域间的鸿沟,以往的跨域语义分割方法都假设分布对齐时源数据和目标数据共存。然而,在真实场景中访问源数据可能会引起隐私问题并侵犯知识产权。为了解决这一问题,我们重点研究了一个有趣而又富有挑战性的跨域语义分割任务,该任务只向目标域提供经过训练的源模型,并进一步提出了一个统一的无源数据域自适应语义分割框架(简称DAS$^3$)。具体来说,DAS$^3$由三个方案组成,即特征对齐、自训练和信息传播。首先,我们主要在网路输出端建立一个焦点熵损失,透过所提供的来源模型,将目标特徵与不可见的来源特徵隐式地对齐。第二,除了在一般的自我训练中使用正的伪标记外,我们首先在领域中引入负的伪标记,并提出了一种双向的自我训练策略来增强目标领域中的表征学习。最后,该信息传播方案通过伪半监督学习进一步减少了目标域内的域内差异。对真实和跨城市驾驶数据集的大量合成结果验证了DAS$^3$产生的最先进性能,甚至与需要访问源数据的方法相当。 摘要:Benefited from considerable pixel-level annotations collected from a specific situation (source), the trained semantic segmentation model performs quite well, but fails in a new situation (target) due to the large domain shift. To mitigate the domain gap, previous cross-domain semantic segmentation methods always assume the co-existence of source data and target data during distribution alignment. However, the access to source data in the real scenario may raise privacy concerns and violate intellectual property. To tackle this problem, we focus on an interesting and challenging cross-domain semantic segmentation task where only the trained source model is provided to the target domain, and further propose a unified framework called Domain Adaptive Semantic Segmentation without Source data (DAS$^3$ for short). Specifically, DAS$^3$ consists of three schemes, i.e., feature alignment, self-training, and information propagation. First, we mainly develop a focal entropic loss on the network outputs to implicitly align the target features with unseen source features via the provided source model. Second, besides positive pseudo labels in vanilla self-training, we first introduce negative pseudo labels to the field and develop a bi-directional self-training strategy to enhance the representation learning in the target domain. Finally, the information propagation scheme further reduces the intra-domain discrepancy within the target domain via pseudo semi-supervised learning. Extensive results on synthesis-to-real and cross-city driving datasets validate DAS$^3$ yields state-of-the-art performance, even on par with methods that need access to source data.

【3】 SSUL: Semantic Segmentation with Unknown Label for Exemplar-based Class-Incremental Learning 标题:SSUL:基于样本类增量学习的未知标注语义分割

作者:Sungmin Cha. Beomyoung Kim,Youngjoon Yoo,Taesup Moon 机构: Department of Electrical and Computer Engineering, Seoul National University, NAVER AI Lab, NAVER Clova 链接:https://arxiv.org/abs/2106.11562 摘要:我们考虑一类增量式语义切分(CISS)问题。虽然最近提出的一些算法利用了知识提取(KD)技术的变体来解决这个问题,但是它们仅仅部分地解决了CISS中导致灾难性遗忘的关键问题;i、 背景类的语义漂移和多标签预测问题。为了更好地应对这些挑战,我们提出了一种新的方法,称为SSUL-M(Semantic Segmentation with Unknown Label with Memory),通过仔细地结合几种为语义分割量身定做的技术。更具体地说,我们有三个主要贡献(1) 在背景类中建模未知类,帮助学习未来类(帮助可塑性),(2)冻结骨干网络和过去分类器,使用二进制交叉熵损失和伪标记克服灾难性遗忘(帮助稳定性),(3)首次在CISS系统中利用微样本记忆提高可塑性和稳定性。结果表明,在标准的基准数据集上,我们的方法比最新的基线具有更好的性能。此外,我们通过深入和广泛的烧蚀分析证明了我们的贡献,并讨论了CISS问题与标准分类增量学习的不同性质。 摘要:We consider a class-incremental semantic segmentation (CISS) problem. While some recently proposed algorithms utilized variants of knowledge distillation (KD) technique to tackle the problem, they only partially addressed the key additional challenges in CISS that causes the catastrophic forgetting; i.e., the semantic drift of the background class and multi-label prediction issue. To better address these challenges, we propose a new method, dubbed as SSUL-M (Semantic Segmentation with Unknown Label with Memory), by carefully combining several techniques tailored for semantic segmentation. More specifically, we make three main contributions; (1) modeling unknown class within the background class to help learning future classes (help plasticity), (2) freezing backbone network and past classifiers with binary cross-entropy loss and pseudo-labeling to overcome catastrophic forgetting (help stability), and (3) utilizing tiny exemplar memory for the first time in CISS to improve both plasticity and stability. As a result, we show our method achieves significantly better performance than the recent state-of-the-art baselines on the standard benchmark datasets. Furthermore, we justify our contributions with thorough and extensive ablation analyses and discuss different natures of the CISS problem compared to the standard class-incremental learning for classification.

【4】 Kernel Clustering with Sigmoid-based Regularization for Efficient Segmentation of Sequential Data 标题:基于Sigmoid正则化的核聚类在序列数据高效分割中的应用

作者:Tung Doan,Atsuhiro Takasu 机构:The Graduate University for Advanced Studies, SOKENDAI, Shonan Village, Hayama, Kanagawa,-, Japan, National Institute of Informatics,-,-, Hitotsubashi, Chiyoda, Tokyo,-, Japan, A R T I C L E I N F O 链接:https://arxiv.org/abs/2106.11541 摘要:核分割的目的是将一个数据序列分割成若干非重叠的段,这些段可能具有非线性和复杂的结构。一般来说,它是一个带有组合约束的离散优化问题。动态规划(DP)是一种常用的优化算法,它具有二次计算和内存需求。考虑到实际中的序列太长,这种算法不是一种实用的方法。虽然许多启发式算法被提出来逼近最优分割,但它们的解的质量没有保证。本文采用可微的方法来解决上述问题。首先,我们引入一种新的基于sigmoid的正则化来平滑地逼近组合约束。结合均衡核聚类的目标,提出了一种基于sigmoid正则化的可微核聚类模型(KCSR),利用基于梯度的算法得到最优分割。其次,我们提出了一个随机变量的模型。第二种模型利用时空复杂度较低的随机梯度下降算法进行优化,可以对超长数据序列进行分割。最后,为了同时分割多个数据序列,我们稍微修改了基于sigmoid的正则化,进一步引入了扩展的模型。通过对不同类型数据序列的大量实验,评估了模型的性能,并与现有方法进行了比较。实验结果验证了所提模型的优越性。我们的Matlab源代码可以在github上获得。 摘要:Kernel segmentation aims at partitioning a data sequence into several non-overlapping segments that may have nonlinear and complex structures. In general, it is formulated as a discrete optimization problem with combinatorial constraints. A popular algorithm for optimally solving this problem is dynamic programming (DP), which has quadratic computation and memory requirements. Given that sequences in practice are too long, this algorithm is not a practical approach. Although many heuristic algorithms have been proposed to approximate the optimal segmentation, they have no guarantee on the quality of their solutions. In this paper, we take a differentiable approach to alleviate the aforementioned issues. First, we introduce a novel sigmoid-based regularization to smoothly approximate the combinatorial constraints. Combining it with objective of the balanced kernel clustering, we formulate a differentiable model termed Kernel clustering with sigmoid-based regularization (KCSR), where the gradient-based algorithm can be exploited to obtain the optimal segmentation. Second, we develop a stochastic variant of the proposed model. By using the stochastic gradient descent algorithm, which has much lower time and space complexities, for optimization, the second model can perform segmentation on overlong data sequences. Finally, for simultaneously segmenting multiple data sequences, we slightly modify the sigmoid-based regularization to further introduce an extended variant of the proposed model. Through extensive experiments on various types of data sequences performances of our models are evaluated and compared with those of the existing methods. The experimental results validate advantages of the proposed models. Our Matlab source code is available on github.

【5】 SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure 标题:SA-LAAM:带循环闭包的语义辅助LiDAR SLAM

作者:Lin Li,Xin Kong,Xiangrui Zhao,Wanlong Li,Feng Wen,Hongbo Zhang,Yong Liu 机构: Zhejiang University 链接:https://arxiv.org/abs/2106.11516 摘要:基于激光雷达的SLAM系统是公认的比其他系统更精确和稳定,但其环路闭合检测仍然是一个悬而未决的问题。随着点云三维语义分割技术的发展,可以方便、稳定地获取点云的语义信息,是实现高层次智能化的关键,有利于SLAM。在本文中,我们提出了一种新的基于LOAM的语义辅助lidarslam,称为SA-LOAM,它利用了里程计和环路闭合检测中的语义。具体来说,我们提出了一个语义辅助的ICP,包括语义匹配、下采样和平面约束,并在循环闭合检测模块中集成了基于语义图的位置识别方法。借助于语义,我们可以提高定位精度,有效地检测循环闭包,甚至在大规模场景中也可以构造一个全局一致的语义映射。在KITTI和Ford校园数据集上的大量实验表明,该系统显著提高了基线性能,对未知数据具有泛化能力,取得了与现有方法相比较有竞争力的结果。 摘要:LiDAR-based SLAM system is admittedly more accurate and stable than others, while its loop closure detection is still an open issue. With the development of 3D semantic segmentation for point cloud, semantic information can be obtained conveniently and steadily, essential for high-level intelligence and conductive to SLAM. In this paper, we present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection. Specifically, we propose a semantic-assisted ICP, including semantically matching, downsampling and plane constraint, and integrates a semantic graph-based place recognition method in our loop closure detection module. Benefitting from semantics, we can improve the localization accuracy, detect loop closures effectively, and construct a global consistent semantic map even in large-scale scenes. Extensive experiments on KITTI and Ford Campus dataset show that our system significantly improves baseline performance, has generalization ability to unseen data and achieves competitive results compared with state-of-the-art methods.

【6】 VoxelEmbed: 3D Instance Segmentation and Tracking with Voxel Embedding based Deep Learning 标题:VoxelEmbed:基于体素嵌入的深度学习3D实例分割与跟踪

作者:Mengyang Zhao,Quan Liu,Aadarsh Jha,Ruining Deng,Tianyuan Yao,Anita Mahadevan-Jansen,Matthew J. Tyska,Bryan A. Millis,Yuankai Huo 机构: Dartmouth College, Hanover, NH , Vanderbilt University, Nashville TN , USA 链接:https://arxiv.org/abs/2106.11480 摘要:生物成像技术的最新进展为科学家们提供了一种优越的高时空分辨率的三维立体视频来观察活细胞的动态。不幸的是,三维生物医学视频分析是滞后的,阻碍了资源不敏感的人类利用现成的三维分析工具。在这里,生物学家通常需要通过最大强度投影在2D分析上妥协,从而丢弃大量丰富的3D空间信息。最近,基于像素嵌入的细胞实例分割和跟踪为理解细胞动力学提供了一种简洁而通用的计算范式。在这项工作中,我们提出了一种新的基于时空体素嵌入(voxeleembed)的学习方法来同时对三维立体视频序列进行细胞实例分割和跟踪。我们的贡献主要体现在四个方面:(1)提出的体素嵌入方法推广了基于三维上下文信息的体素嵌入方法(2) 提出了一种简单的多流学习方法,该方法允许有效的时空嵌入(3) 完成了一个端到端的单阶段三维单元实例分割与跟踪框架,无需进行大量参数调整(4) 提出的三维量化是通过一个单一的GPU与12gb的内存效率。我们在ISBI细胞追踪挑战赛的四个3D数据集(不同细胞类型)上评估了我们的体素嵌入方法。提出的体素嵌入方法在两个注释密集的数据集上取得了一致的优异的整体性能。在两个注释稀少的队列中,有20.6%和2%的数据集具有分段注释,该性能也具有竞争力。结果表明,体素嵌入方法是一种可推广的、存储效率高的解决方案。 摘要:Recent advances in bioimaging have provided scientists a superior high spatial-temporal resolution to observe dynamics of living cells as 3D volumetric videos. Unfortunately, the 3D biomedical video analysis is lagging, impeded by resource insensitive human curation using off-the-shelf 3D analytic tools. Herein, biologists often need to discard a considerable amount of rich 3D spatial information by compromising on 2D analysis via maximum intensity projection. Recently, pixel embedding-based cell instance segmentation and tracking provided a neat and generalizable computing paradigm for understanding cellular dynamics. In this work, we propose a novel spatial-temporal voxel-embedding (VoxelEmbed) based learning method to perform simultaneous cell instance segmenting and tracking on 3D volumetric video sequences. Our contribution is in four-fold: (1) The proposed voxel embedding generalizes the pixel embedding with 3D context information; (2) Present a simple multi-stream learning approach that allows effective spatial-temporal embedding; (3) Accomplished an end-to-end framework for one-stage 3D cell instance segmentation and tracking without heavy parameter tuning; (4) The proposed 3D quantification is memory efficient via a single GPU with 12 GB memory. We evaluate our VoxelEmbed method on four 3D datasets (with different cell types) from the ISBI Cell Tracking Challenge. The proposed VoxelEmbed method achieved consistent superior overall performance (OP) on two densely annotated datasets. The performance is also competitive on two sparsely annotated cohorts with 20.6% and 2% of data-set having segmentation annotations. The results demonstrate that the VoxelEmbed method is a generalizable and memory-efficient solution.

【7】 Encoder-Decoder Architectures for Clinically Relevant Coronary Artery Segmentation 标题:用于临床相关冠状动脉分割的编解码器结构

作者:João Lourenço Silva,Miguel Nobre Menezes,Tiago Rodrigues,Beatriz Silva,Fausto J. Pinto,Arlindo L. Oliveira 机构:INESC-ID Instituto Superior Técnico, University of Lisbon, Cardiology Department, CAML, CCUL, Lisbon School of Medicine, University of Lisbon 链接:https://arxiv.org/abs/2106.11447 摘要:冠状动脉X射线造影是诊断和治疗冠状动脉疾病的重要临床手段,每年约占全球死亡人数的16%。然而,在这些过程中获得的图像分辨率低,对比度差,使病变检测和评估具有挑战性。准确的冠状动脉分割不仅有助于缓解这些问题,而且可以提取相关的解剖特征,以便通过定量方法进行进一步分析。虽然冠状动脉的自动分割已经被提出过,但是以前的方法都使用非最佳分割标准,导致了不太有用的结果。大多数方法要么只分割主要血管,丢弃其余血管的重要信息,要么主要基于对比度信息分割整个冠状动脉树,产生噪声输出,其中包括与诊断无关的血管。我们采用更适合的临床标准,并根据其临床相关性分割血管。此外,我们同时进行导管分割,由于导管的已知直径提供的比例因子,这可能有助于诊断,而且这是一项尚未取得良好结果的任务。为了得到最佳的方法,我们进行了广泛的比较研究编码器-解码器结构训练的组合焦点损失和一个变种的广义骰子损失。基于EfficientNet和UNet++体系结构,我们提出了一系列高效和高性能的分割模型,使用了一种新的解码器体系结构EfficientNet++,其最佳性能版本的动脉和导管类的平均dice分数分别为0.8904和0.7526,平均广义骰子得分为0.9234。 摘要:Coronary X-ray angiography is a crucial clinical procedure for the diagnosis and treatment of coronary artery disease, which accounts for roughly 16% of global deaths every year. However, the images acquired in these procedures have low resolution and poor contrast, making lesion detection and assessment challenging. Accurate coronary artery segmentation not only helps mitigate these problems, but also allows the extraction of relevant anatomical features for further analysis by quantitative methods. Although automated segmentation of coronary arteries has been proposed before, previous approaches have used non-optimal segmentation criteria, leading to less useful results. Most methods either segment only the major vessel, discarding important information from the remaining ones, or segment the whole coronary tree based mostly on contrast information, producing a noisy output that includes vessels that are not relevant for diagnosis. We adopt a better-suited clinical criterion and segment vessels according to their clinical relevance. Additionally, we simultaneously perform catheter segmentation, which may be useful for diagnosis due to the scale factor provided by the catheter's known diameter, and is a task that has not yet been performed with good results. To derive the optimal approach, we conducted an extensive comparative study of encoder-decoder architectures trained on a combination of focal loss and a variant of generalized dice loss. Based on the EfficientNet and the UNet++ architectures, we propose a line of efficient and high-performance segmentation models using a new decoder architecture, the EfficientUNet++, whose best-performing version achieved average dice scores of 0.8904 and 0.7526 for the artery and catheter classes, respectively, and an average generalized dice score of 0.9234.

【8】 Context-aware PolyUNet for Liver and Lesion Segmentation from Abdominal CT Images 标题:基于上下文感知的PolyUNet在腹部CT图像肝脏和病变分割中的应用

作者:Liping Zhang,Simon Chun-Ho Yu 机构:The Chinese University of Hong Kong 备注:7 pages and 3 figures 链接:https://arxiv.org/abs/2106.11330 摘要:从CT图像中准确分割肝脏和病变是临床上辅助诊断和评估肝脏肿瘤疾病的迫切要求。然而,由于对比度、分辨率和图像质量的多样性,从增强CT容积中自动分割肝脏和病变是非常具有挑战性的。以往的基于UNet的二维逐片或三维逐体分割方法要么空间上下文不足,要么GPU计算量大,性能受到限制。为了解决这些问题,我们提出了一种新的上下文感知多谐函数来精确分割肝脏和病变。它联合探索结构多样性和连续的t-相邻切片,以丰富特征表达能力和空间上下文信息,同时避免GPU内存消耗的过载。此外,我们利用缩小/缩小和两阶段细化策略来排除不相关的上下文,并聚焦于特定区域进行细粒度分割。我们的方法在MICCAI 2017肝脏肿瘤分割(LiTS)挑战赛上取得了非常有竞争力的性能,在肝脏分割、病灶分割、病灶检测和肿瘤负荷估计中分别排名第3 ^{rd}$、第12 ^{th}$、第2 ^{nd}$和第5 ^{th}$。 摘要:Accurate liver and lesion segmentation from computed tomography (CT) images are highly demanded in clinical practice for assisting the diagnosis and assessment of hepatic tumor disease. However, automatic liver and lesion segmentation from contrast-enhanced CT volumes is extremely challenging due to the diversity in contrast, resolution, and quality of images. Previous methods based on UNet for 2D slice-by-slice or 3D volume-by-volume segmentation either lack sufficient spatial contexts or suffer from high GPU computational cost, which limits the performance. To tackle these issues, we propose a novel context-aware PolyUNet for accurate liver and lesion segmentation. It jointly explores structural diversity and consecutive t-adjacent slices to enrich feature expressive power and spatial contextual information while avoiding the overload of GPU memory consumption. In addition, we utilize zoom out/in and two-stage refinement strategy to exclude the irrelevant contexts and focus on the specific region for the fine-grained segmentation. Our method achieved very competitive performance at the MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge among all tasks with a single model and ranked the $3^{rd}$, $12^{th}$, $2^{nd}$, and $5^{th}$ places in the liver segmentation, lesion segmentation, lesion detection, and tumor burden estimation, respectively.

Zero/Few Shot|迁移|域适配|自适应(6篇)

【1】 Enhanced Separable Disentanglement for Unsupervised Domain Adaptation 标题:用于无监督区域自适应的增强型可分离解缠算法

作者:Youshan Zhang,Brian D. Davison 机构:Computer Science and Engineering, Lehigh University, Bethlehem, PA, USA 备注:ICIP 2021 链接:https://arxiv.org/abs/2106.11915 摘要:领域自适应的目的是在将知识从一个已有的标记领域转移到一个新的领域时,减少领域间的差距。然而,现有的基于解纠缠的方法并没有充分考虑域不变特征和特定域特征的分离,这意味着域不变特征不具有区分性。重建的特征在训练过程中也没有得到充分的利用。在本文中,我们提出了一个新的增强的可分离解纠缠(ESD)模型。我们首先使用一个解纠缠器来提取域不变和域特定的特征。然后,我们应用特征分离增强过程来最小化域不变和域特定特征之间的污染。最后,我们的模型重建了完整的特征向量,在训练阶段用于进一步的解纠缠。从三个基准数据集进行的大量实验优于最先进的方法,特别是在具有挑战性的跨领域任务上。 摘要:Domain adaptation aims to mitigate the domain gap when transferring knowledge from an existing labeled domain to a new domain. However, existing disentanglement-based methods do not fully consider separation between domain-invariant and domain-specific features, which means the domain-invariant features are not discriminative. The reconstructed features are also not sufficiently used during training. In this paper, we propose a novel enhanced separable disentanglement (ESD) model. We first employ a disentangler to distill domain-invariant and domain-specific features. Then, we apply feature separation enhancement processes to minimize contamination between domain-invariant and domain-specific features. Finally, our model reconstructs complete feature vectors, which are used for further disentanglement during the training phase. Extensive experiments from three benchmark datasets outperform state-of-the-art methods, especially on challenging cross-domain tasks.

【2】 Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval 标题:基于Zero-Shot草图的域平滑网络图像检索

作者:Zhipeng Wang,Hao Wang,Jiexi Yan,Aming Wu,Cheng Deng 机构:Xidian University 备注:Accepted to IJCAI 2021 链接:https://arxiv.org/abs/2106.11841 摘要:基于Zero-Shot草图的图像检索(ZS-SBIR)是一种新型的跨模式检索任务,它以抽象草图作为查询,在Zero-Shot场景下检索自然图像。现有的方法大多将ZS-SBIR作为一个传统的分类问题,采用交叉熵或基于三元组的损失来实现检索,而忽略了草图与自然图像之间的域差距和草图中类内差异大的问题。为此,我们提出了一种新的用于ZS-SBIR的域平滑网络(DSN)。具体地说,本文提出了一种跨模态对比方法来学习广义表示,通过挖掘额外的增广样本来平滑域间的关系。此外,还探讨了一种具有草图特征的类别特定内存库,以减少草图域中的类内多样性。大量实验表明,在Sketchy和TU-Berlin数据集中,我们的方法都明显优于最新的方法。我们的源代码在https://github.com/haowang1992/DSN. 摘要:Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a novel cross-modal retrieval task, where abstract sketches are used as queries to retrieve natural images under zero-shot scenario. Most existing methods regard ZS-SBIR as a traditional classification problem and employ a cross-entropy or triplet-based loss to achieve retrieval, which neglect the problems of the domain gap between sketches and natural images and the large intra-class diversity in sketches. Toward this end, we propose a novel Domain-Smoothing Network (DSN) for ZS-SBIR. Specifically, a cross-modal contrastive method is proposed to learn generalized representations to smooth the domain gap by mining relations with additional augmented samples. Furthermore, a category-specific memory bank with sketch features is explored to reduce intra-class diversity in the sketch domain. Extensive experiments demonstrate that our approach notably outperforms the state-of-the-art methods in both Sketchy and TU-Berlin datasets. Our source code is publicly available at https://github.com/haowang1992/DSN.

【3】 The Hitchhiker's Guide to Prior-Shift Adaptation 标题:“顺风车乘客适应前班次指南”

作者:Tomas Sipka,Milan Sulc,Jiri Matas 机构:Dept. of Cybernetics, Czech Technical University in Prague 备注:16 pages, 7 figures 链接:https://arxiv.org/abs/2106.11695 摘要:在许多计算机视觉分类任务中,测试时的类先验往往不同于训练集上的先验。在这种先验变换的情况下,分类器必须相应地调整以保持接近最优的性能。本文分析了概率分类器对新先验的自适应方法和在未标记测试集上估计新先验的方法。我们提出了一种新的方法来解决基于混淆矩阵的先验估计方法的一个已知问题,其中决策概率和混淆矩阵的不一致估计导致估计的先验值为负值。在细粒度图像分类数据集上的实验表明,该方法在先验自适应方面取得了很好的效果。将该方法应用于两个先验不平衡的任务,即网络爬虫图像学习和植物物种分类,识别率分别提高了1.1%和3.4%。 摘要:In many computer vision classification tasks, class priors at test time often differ from priors on the training set. In the case of such prior shift, classifiers must be adapted correspondingly to maintain close to optimal performance. This paper analyzes methods for adaptation of probabilistic classifiers to new priors and for estimating new priors on an unlabeled test set. We propose a novel method to address a known issue of prior estimation methods based on confusion matrices, where inconsistent estimates of decision probabilities and confusion matrices lead to negative values in the estimated priors. Experiments on fine-grained image classification datasets provide insight into the best practice of prior shift estimation and classifier adaptation and show that the proposed method achieves state-of-the-art results in prior adaptation. Applying the best practice to two tasks with naturally imbalanced priors, learning from web-crawled images and plant species classification, increased the recognition accuracy by 1.1% and 3.4% respectively.

【4】 Universal Domain Adaptation in Ordinal Regression 标题:有序回归中的泛域自适应

作者:Chidlovskii Boris,Assem Sadek,Christian Wolf 机构:Naver Labs Europe, ch. Maupertuis , Meylan, France, INSA-Lyon, LIRIS, UMR CNRS , Villeurbanne 链接:https://arxiv.org/abs/2106.11576 摘要:我们讨论了序数回归(OR)中的泛域适应(UDA)问题,它试图解决标签不是独立的,而是遵循自然顺序的分类问题。我们证明了为分类而开发的UDA技术和基于聚类假设的UDA技术在或设置中的性能不足。我们提出了一种用顺序学习辅助任务来补充OR分类器的方法,它起到区分公共实例和私有实例的双重作用,并通过排序将类标签扩展到私有目标图像。结合对抗域识别,我们的模型能够处理闭集、部分和开集配置。我们在三个人脸年龄估计数据集上对我们的方法进行了评估,结果表明它优于基线方法。 摘要:We address the problem of universal domain adaptation (UDA) in ordinal regression (OR), which attempts to solve classification problems in which labels are not independent, but follow a natural order. We show that the UDA techniques developed for classification and based on the clustering assumption, under-perform in OR settings. We propose a method that complements the OR classifier with an auxiliary task of order learning, which plays the double role of discriminating between common and private instances, and expanding class labels to the private target images via ranking. Combined with adversarial domain discrimination, our model is able to address the closed set, partial and open set configurations. We evaluate our method on three face age estimation datasets, and show that it outperforms the baseline methods.

【5】 f-Domain-Adversarial Learning: Theory and Algorithms 标题:F-域-对抗性学习:理论与算法

作者:David Acuna,Guojun Zhang,Marc T. Law,Sanja Fidler 机构: The model is trained with both the 1NVIDIA 2UniversityofToronto 3VectorInstitute 4University of Waterloo 备注:ICML 2021 链接:https://arxiv.org/abs/2106.11344 摘要:无监督域自适应在许多机器学习应用中都有应用,在训练过程中,模型可以访问目标域中的未标记数据和相关的标记数据集。本文介绍了一种新颖的通用领域对抗框架。具体来说,我们推导了一个新的推广界域适应,利用一个新的措施之间的差异分布的基础上变分特征的f-分歧。作为特例,它恢复了Ben-David等人(2010a)的理论结果,并支持了实践中使用的分歧。基于这个界限,我们导出了一个新的算法框架,该框架在Ganin等人(2016)的原始对抗训练方法中引入了关键修正。我们表明,过去几年在这个框架中引入的许多正则化器和特殊目标不需要达到与最先进的领域对抗方法相当的性能(如果不是更好的话)。在真实的自然语言和计算机视觉数据集上进行的实验分析表明,我们的框架优于现有的基线,并且对于领域对抗学习中没有考虑的f-分歧获得了最好的结果。 摘要:Unsupervised domain adaptation is used in many machine learning applications where, during training, a model has access to unlabeled data in the target domain, and a related labeled dataset. In this paper, we introduce a novel and general domain-adversarial framework. Specifically, we derive a novel generalization bound for domain adaptation that exploits a new measure of discrepancy between distributions based on a variational characterization of f-divergences. It recovers the theoretical results from Ben-David et al. (2010a) as a special case and supports divergences used in practice. Based on this bound, we derive a new algorithmic framework that introduces a key correction in the original adversarial training method of Ganin et al. (2016). We show that many regularizers and ad-hoc objectives introduced over the last years in this framework are then not required to achieve performance comparable to (if not better than) state-of-the-art domain-adversarial methods. Experimental analysis conducted on real-world natural language and computer vision datasets show that our framework outperforms existing baselines, and obtains the best results for f-divergences that were not considered previously in domain-adversarial learning.

【6】 BiAdam: Fast Adaptive Bilevel Optimization Methods 标题:BiAdam:快速自适应双层优化方法

作者:Feihu Huang,Heng Huang 机构:Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, USA, Editor: 备注:20 pages, 2 tables 链接:https://arxiv.org/abs/2106.11396 摘要:二层优化由于其在超参数优化、策略优化等方面的广泛应用,近年来引起了机器学习领域的广泛关注。虽然最近有人提出了一些方法来解决二层问题,但这些方法没有考虑使用自适应学习率。为了填补这一空白,本文提出了一类快速有效的自适应方法,用于求解外部可能非凸、内部可能强凸的双层优化问题。具体来说,我们提出了一种基于基本动量技术的快速单循环BiAdam算法,该算法的样本复杂度为$\tilde{O}(\epsilon^{-4})$,用于寻找$\epsilon$-平稳点。同时,我们提出了一种基于方差缩减技术的加速BiAdam算法(VR-BiAdam),其样本复杂度达到了$\tilde{O}(\epsilon^{-3})$。为了进一步减少导数估计的计算量,我们提出了一种避免Hessian逆的快速单循环随机逼近BiAdam算法(saBiAdam),在没有大批量的情况下,该算法的样本复杂度仍然达到$\tilde{O}(\epsilon^{-4})$。我们进一步提出了一种加速的saBiAdam算法(VR-saBiAdam),该算法的样本复杂度达到了$\tilde{O}(\epsilon^{-3})$。我们将统一的自适应矩阵应用到我们的方法中,如SUPER-ADAM\citep{huang2021super},其中包括多种类型的自适应学习率。此外,我们的框架可以灵活地使用动量和方差缩减技术。特别是,我们提供了一个有用的收敛性分析框架的约束和无约束双层优化。据我们所知,我们首先研究了具有自适应学习率的自适应双层优化方法。 摘要:Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and policy optimization. Although some methods recently have been proposed to solve the bilevel problems, these methods do not consider using adaptive learning rates. To fill this gap, in the paper, we propose a class of fast and effective adaptive methods for solving bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly-convex. Specifically, we propose a fast single-loop BiAdam algorithm based on the basic momentum technique, which achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding an $\epsilon$-stationary point. At the same time, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$. To further reduce computation in estimating derivatives, we propose a fast single-loop stochastic approximated BiAdam algorithm (saBiAdam) by avoiding the Hessian inverse, which still achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ without large batches. We further present an accelerated version of saBiAdam algorithm (VR-saBiAdam), which also reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$. We apply the unified adaptive matrices to our methods as the SUPER-ADAM \citep{huang2021super}, which including many types of adaptive learning rates. Moreover, our framework can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for both the constrained and unconstrained bilevel optimization. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates.

半弱无监督|主动学习|不确定性(6篇)

【1】 Unsupervised Object-Level Representation Learning from Scene Images 标题:从场景图像中学习无监督的对象级表示

作者:Jiahao Xie,Xiaohang Zhan,Ziwei Liu,Yew Soon Ong,Chen Change Loy 机构:Nanyang Technological University, The Chinese University of Hong Kong 链接:https://arxiv.org/abs/2106.11952 摘要:对比自监督学习在很大程度上缩小了与ImageNet上有监督预训练的差距。然而,它的成功在很大程度上依赖于ImageNet以对象为中心的先验知识,即同一图像的不同增强视图对应于同一对象。当对包含许多对象的更复杂场景图像进行预训练时,这样一个严格控制的约束立即变得不可行。为了克服这一局限性,我们引入了一种新的面向场景图像的自监督学习框架对象级表示学习(ORL)。我们的关键是利用图像级的自监督预训练作为先验知识来发现对象级的语义对应关系,从而实现场景图像的对象级表示学习。在COCO上的大量实验表明,ORL显著提高了场景图像的自监督学习性能,甚至在一些下游任务上超过了有监督的ImageNet预训练。此外,ORL在获得更多未标记场景图像的情况下提高了下游性能,显示了其在野外利用未标记数据的巨大潜力。我们希望我们的方法能推动未来研究更通用的无监督表示学习场景数据。项目页面:https://www.mmlab-ntu.com/project/orl/. 摘要:Contrastive self-supervised learning has largely narrowed the gap to supervised pre-training on ImageNet. However, its success highly relies on the object-centric priors of ImageNet, i.e., different augmented views of the same image correspond to the same object. Such a heavily curated constraint becomes immediately infeasible when pre-trained on more complex scene images with many objects. To overcome this limitation, we introduce Object-level Representation Learning (ORL), a new self-supervised learning framework towards scene images. Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence, thus realizing object-level representation learning from scene images. Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Furthermore, ORL improves the downstream performance when more unlabeled scene images are available, demonstrating its great potential of harnessing unlabeled data in the wild. We hope our approach can motivate future research on more general-purpose unsupervised representation learning from scene data. Project page: https://www.mmlab-ntu.com/project/orl/.

【2】 MEAL: Manifold Embedding-based Active Learning 标题:Meal:基于流形嵌入的主动学习

作者:Deepthi Sreenivasaiah,Thomas Wollmann 机构:Merantix Labs GmbH, Berlin, Germany 链接:https://arxiv.org/abs/2106.11858 摘要:在自动驾驶中,图像分割是一项常见而又具有挑战性的任务。为训练数据提供足够的像素级注释是一个障碍。主动学习通过推荐最有希望的标记样本,帮助从少量数据中学习。在这项工作中,我们提出了一种新的基于池的主动学习方法,即在每一个采集步骤中提出有希望的图像区域。在勘探开发框架下,将基于均匀流形逼近的嵌入方法与熵作为不确定性测度的模型表示方法相结合。我们将我们提出的方法应用于具有挑战性的自主驾驶数据集CamVid和Cityscapes,并与最先进的方法进行了定量比较。我们发现,与其他方法相比,我们的主动学习方法在CamVid上获得了更好的性能,而在城市景观上,性能提升可以忽略不计。 摘要:Image segmentation is a common and challenging task in autonomous driving. Availability of sufficient pixel-level annotations for the training data is a hurdle. Active learning helps learning from small amounts of data by suggesting the most promising samples for labeling. In this work, we propose a new pool-based method for active learning, which proposes promising image regions, in each acquisition step. The problem is framed in an exploration-exploitation framework by combining an embedding based on Uniform Manifold Approximation to model representativeness with entropy as uncertainty measure to model informativeness. We applied our proposed method to the challenging autonomous driving data sets CamVid and Cityscapes and performed a quantitative comparison with state-of-the-art methods. We find that our active learning method achieves better performance on CamVid compared to other methods, while on Cityscapes, the performance lift was negligible.

【3】 Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling 标题:基于局部-全局背景建模的弱监督时间动作定位

作者:Xiang Wang,Zhiwu Qing,Ziyuan Huang,Yutong Feng,Shiwei Zhang,Jianwen Jiang,Mingqian Tang,Yuanjie Shao,Nong Sang 机构: Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Alibaba Group 备注:None 链接:https://arxiv.org/abs/2106.11811 摘要:弱监督时态动作定位(WS-TAL)任务的目标是识别和定位未经剪辑的视频中动作实例的时态开始和结束,并且只需要视频级别的标签监督。由于缺乏背景类的负样本,网络很难将前景和背景分开,导致检测性能较差。在本报告中,我们提出了2021年HACS挑战-弱监督学习跟踪解决方案,该方案基于BaSNet来解决上述问题。具体来说,我们首先采用预先训练好的CSN、Slowfast、TDN和ViViT作为特征抽取器来获取特征序列。然后,基于多实例学习(MIL)方法,训练局部全局背景建模网络(LGBM-Net)只使用视频层标签进行实例定位。最后,我们将多个模型进行集成,得到最终的检测结果,在测试集上达到22.45%的mAP 摘要:Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to recognize and localize temporal starts and ends of action instances in an untrimmed video with only video-level label supervision. Due to lack of negative samples of background category, it is difficult for the network to separate foreground and background, resulting in poor detection performance. In this report, we present our 2021 HACS Challenge - Weakly-supervised Learning Track solution that based on BaSNet to address above problem. Specifically, we first adopt pre-trained CSN, Slowfast, TDN, and ViViT as feature extractors to get feature sequences. Then our proposed Local-Global Background Modeling Network (LGBM-Net) is trained to localize instances by using only video-level labels based on Multi-Instance Learning (MIL). Finally, we ensemble multiple models to get the final detection results and reach 22.45% mAP on the test set

【4】 Self-Supervised Iterative Contextual Smoothing for Efficient Adversarial Defense against Gray- and Black-Box Attack 标题:自监督迭代上下文平滑有效防御灰箱和黑箱攻击

作者:Sungmin Cha,Naeun Ko,Youngjoon Yoo,Taesup Moon 机构: Department of Electrical and Computer Engineering, Seoul National University, NAVER AI Lab, NAVER Clova 备注:Preprint version 链接:https://arxiv.org/abs/2106.11644 摘要:提出了一种新颖有效的基于输入变换的对抗性灰盒和黑盒攻击防御方法,该方法计算效率高,不需要对分类模型进行任何对抗性训练或再训练。我们首先证明了一个非常简单的迭代高斯平滑可以有效地消除敌对噪声,并获得相当高的鲁棒精度。在此基础上,本文提出了一种自监督迭代上下文平滑(SSICS)算法,该算法的目标是在平滑后的高斯图像中以上下文自适应的方式重建出原始的鉴别特征,同时平滑掉敌对噪声。在ImageNet上的实验表明,我们的SSICS对灰盒和黑盒攻击具有很高的标准精度和很强的鲁棒性;e、 基于转移的PGD攻击和基于分数的攻击。值得注意的一点是,我们的防御是免费的计算昂贵的对抗性训练,但可以接近其强大的精度通过输入转换。 摘要:We propose a novel and effective input transformation based adversarial defense method against gray- and black-box attack, which is computationally efficient and does not require any adversarial training or retraining of a classification model. We first show that a very simple iterative Gaussian smoothing can effectively wash out adversarial noise and achieve substantially high robust accuracy. Based on the observation, we propose Self-Supervised Iterative Contextual Smoothing (SSICS), which aims to reconstruct the original discriminative features from the Gaussian-smoothed image in context-adaptive manner, while still smoothing out the adversarial noise. From the experiments on ImageNet, we show that our SSICS achieves both high standard accuracy and very competitive robust accuracy for the gray- and black-box attacks; e.g., transfer-based PGD-attack and score-based attack. A note-worthy point to stress is that our defense is free of computationally expensive adversarial training, yet, can approach its robust accuracy via input transformation.

【5】 Recent Deep Semi-supervised Learning Approaches and Related Works 标题:深度半监督学习方法及相关研究进展

作者:Gyeongho Kim 机构:Department of Industrial Engineering 链接:https://arxiv.org/abs/2106.11528 摘要:作者的这项工作提出了一个最近的半监督学习方法和相关工作的概述。尽管神经网络在各种应用中取得了显著的成功,但仍然存在一些难以克服的限制,包括需要大量的标记数据。因此,半监督学习(semi-supervised learning)越来越重要,它是一种利用稀缺的标签和大量的未标记数据来训练模型(如深度神经网络)的学习方案。基于半监督学习的主要假设,即流形假设、聚类假设和连续性假设,本文回顾了近年来半监督学习方法的研究进展。特别地,对在半监督学习环境中使用深度神经网络的方法进行了初步讨论。此外,本文首先对现有的研究成果进行了分类和阐释,然后详细阐述了统一上述思想的整体方法。 摘要:The author of this work proposes an overview of the recent semi-supervised learning approaches and related works. Despite the remarkable success of neural networks in various applications, there exist few formidable constraints including the need for a large amount of labeled data. Therefore, semi-supervised learning, which is a learning scheme in which the scarce labels and a larger amount of unlabeled data are utilized to train models (e.g., deep neural networks) is getting more important. Based on the key assumptions of semi-supervised learning, which are the manifold assumption, cluster assumption, and continuity assumption, the work reviews the recent semi-supervised learning approaches. In particular, the methods in regard to using deep neural networks in a semi-supervised learning setting are primarily discussed. In addition, the existing works are first classified based on the underlying idea and explained, and then the holistic approaches that unify the aforementioned ideas are detailed.

【6】 Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism 标题:利用自监督学习和注意机制改进唇部图像的超声舌象重建

作者:Haiyang Liu,Jihan Zhang 机构:Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan, School of Mechanical Engineering, Southeast University, Nanjing, China 备注:Accepted in KDD Workshop (BIOKDD 2021) 链接:https://arxiv.org/abs/2106.11769 摘要:言语产生是一个动态的过程,涉及舌、颌、唇等多个人体器官。建立声道变形的动力学模型是理解语音的一个基本问题,而语音是人类日常交际中最常见的方式。研究人员使用几个感官流来同时描述这个过程,这些感官流在统计学上无疑与其他感官流相关。在本文中,我们解决了以下问题:给定一个可观察到的嘴唇图像序列,我们可以描绘出相应的舌头运动。将该问题描述为自监督学习问题,采用双流卷积网络和长短记忆网络作为学习任务,并引入注意机制。通过利用未标记的唇部视频预测即将到来的超声舌像序列,我们评估了该方法的性能。结果表明,该模型能够生成与真实舌象接近的图像,实现了两种成像模式的匹配。 摘要:Speech production is a dynamic procedure, which involved multi human organs including the tongue, jaw and lips. Modeling the dynamics of the vocal tract deformation is a fundamental problem to understand the speech, which is the most common way for human daily communication. Researchers employ several sensory streams to describe the process simultaneously, which are incontrovertibly statistically related to other streams. In this paper, we address the following question: given an observable image sequences of lips, can we picture the corresponding tongue motion. We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism. We evaluate the performance of the proposed method by leveraging the unlabeled lip videos to predict an upcoming ultrasound tongue image sequence. The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video 标题:RGB2Hands:实时跟踪单目RGB视频中的3D手交互

作者:Jiayi Wang,Franziska Mueller,Florian Bernard,Suzanne Sorli,Oleksandr Sotnychenko,Neng Qian,Miguel A. Otaduy,Dan Casas,Christian Theobalt 机构:VR View, D View 备注:None 链接:https://arxiv.org/abs/2106.11725 摘要:在交互过程中,跟踪和重建双手的三维姿态和几何结构是一个具有挑战性的问题,它在许多人机交互应用中具有很高的相关性,包括AR/VR、机器人技术或手语识别。现有的工作要么局限于更简单的跟踪设置(例如,仅考虑一只手或两个空间上分开的手),要么依赖于不太普遍的传感器,例如深度相机。相比之下,在这项工作中,我们提出了第一个实时的方法,从一个单一的RGB相机,明确考虑密切互动的骨骼姿态和三维表面几何手的运动捕捉。为了解决RGB数据中固有的深度模糊问题,我们提出了一种新的多任务CNN,该CNN回归多个互补信息,包括分割、三维手模型的密集匹配和二维关键点位置,以及新提出的手内相对深度和手间距离图。这些预测随后用于生成模型拟合框架,以便估计双手的三维手模型的姿势和形状参数。我们通过广泛的消融研究,实验验证了我们的RGB双手跟踪和三维重建管道的各个组成部分。此外,我们证明,我们的方法提供了以前从未见过的两手跟踪性能从RGB,并在数量和质量上优于现有的RGB为基础的方法,没有明确设计的双手互动。此外,我们的方法甚至可以与基于深度的实时方法相媲美。 摘要:Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings (e.g., considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.

【2】 Part-Aware Measurement for Robust Multi-View Multi-Human 3D Pose Estimation and Tracking 标题:鲁棒多视角多人体三维位姿估计与跟踪的局部感知测量

作者:Hau Chu,Jia-Hong Lee,Yao-Chih Lee,Ching-Hsien Hsu,Jia-Da Li,Chu-Song Chen 机构:National Taiwan University, Academia Sinica, Taiwan 备注:12 pages with supplementary material; accepted to CVPR 2021 B-AMFG Workshop 链接:https://arxiv.org/abs/2106.11589 摘要:提出了一种基于标定多视点的多人三维姿态估计与跟踪方法。主要的挑战在于即使在多个人体姿态估计有噪声的情况下也能正确地找到交叉视图和时间对应。与以往从多个视图构造三维姿态的方法相比,我们的方法利用时间一致性来匹配在每个视图中由先前构造的三维骨架估计的二维姿态。因此,交叉视图和时间关联是同时完成的。由于性能受到错误关联和噪声预测的影响,我们设计了两种策略以获得更好的匹配和三维重建。具体地说,我们提出了一种二维-三维关联的零件感知测量方法和一种能够处理重建过程中二维异常值的滤波器。与最先进的方法相比,我们的方法是高效和有效的;它在两个基准上取得了有竞争力的成绩:校园96.8%和货架97.4%。此外,我们扩展了校园评估框架的长度,使之更具挑战性,并取得了良好的效果。 摘要:This paper introduces an approach for multi-human 3D pose estimation and tracking based on calibrated multi-view. The main challenge lies in finding the cross-view and temporal correspondences correctly even when several human pose estimations are noisy. Compare to previous solutions that construct 3D poses from multiple views, our approach takes advantage of temporal consistency to match the 2D poses estimated with previously constructed 3D skeletons in every view. Therefore cross-view and temporal associations are accomplished simultaneously. Since the performance suffers from mistaken association and noisy predictions, we design two strategies for aiming better correspondences and 3D reconstruction. Specifically, we propose a part-aware measurement for 2D-3D association and a filter that can cope with 2D outliers during reconstruction. Our approach is efficient and effective comparing to state-of-the-art methods; it achieves competitive results on two benchmarks: 96.8% on Campus and 97.4% on Shelf. Moreover, we extends the length of Campus evaluation frames to be more challenging and our proposal also reach well-performed result.

医学相关(1篇)

【1】 MIMIR: Deep Regression for Automated Analysis of UK Biobank Body MRI 标题:MIMIR:用于英国生物库体部MRI自动分析的深度回归

作者:Taro Langner,Andrés Martínez Mora,Robin Strand,Håkan Ahlström,Joel Kullberg 链接:https://arxiv.org/abs/2106.11731 摘要:英国生物银行(UKB)正在对50多万名志愿者进行大规模研究,收集有关遗传学、生活方式、血液生化等方面的健康相关信息。此外,医学成像还针对10万名受试者,进行了7万次随访,实现了对器官、肌肉和身体成分的测量。随着多达170000个MR图像的安装,各种方法也相应地用于大规模图像分析。这项工作提出了一个实验推理机,可以自动预测从UKB颈部到膝盖体MRI的主题元数据的综合概况。在交叉验证中,它准确地推断了年龄、身高、体重和性别等基线特征,但也模拟了DXA测量的身体成分、器官体积以及握力、脉搏率和2型糖尿病状态等抽象特征(AUC:0.866)。该系统能在数小时内自动分析数千个被试,并提供个体置信区间。其基本方法是基于卷积神经网络的图像均值-方差回归的二维表示的磁共振数据。这项工作的目的是使拟议的系统免费提供给研究人员,他们可以使用它获得快速和完全自动化的估计72个不同的测量后,立即发布新的英国生物银行的图像数据。 摘要:UK Biobank (UKB) is conducting a large-scale study of more than half a million volunteers, collecting health-related information on genetics, lifestyle, blood biochemistry, and more. Medical imaging furthermore targets 100,000 subjects, with 70,000 follow-up sessions, enabling measurements of organs, muscle, and body composition. With up to 170,000 mounting MR images, various methodologies are accordingly engaged in large-scale image analysis. This work presents an experimental inference engine that can automatically predict a comprehensive profile of subject metadata from UKB neck-to-knee body MRI. In cross-validation, it accurately inferred baseline characteristics such as age, height, weight, and sex, but also emulated measurements of body composition by DXA, organ volumes, and abstract properties like grip strength, pulse rate, and type 2 diabetic status (AUC: 0.866). The proposed system can automatically analyze thousands of subjects within hours and provide individual confidence intervals. The underlying methodology is based on convolutional neural networks for image-based mean-variance regression on two-dimensional representations of the MRI data. This work aims to make the proposed system available for free to researchers, who can use it to obtain fast and fully-automated estimates of 72 different measurements immediately upon release of new UK Biobank image data.

GAN|对抗|攻击|生成相关(4篇)

【1】 G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation 标题:G-VAE,一种用于蛋白质结构生成的几何卷积VAE

作者:Hao Huang,Boulbaba Ben Amor,Xichan Lin,Fan Zhu,Yi Fang 机构:New York University Abu Dhabi, Inception Institute of Artificial Intelligence 备注:14 pages 链接:https://arxiv.org/abs/2106.11920 摘要:分析蛋白质的结构是理解其功能以及在分子水平上的生物学作用的关键部分。此外,有条不紊地设计新的蛋白质是一项重大的工程挑战。在这项工作中,我们介绍了一种联合几何神经网络的方法来比较,变形和生成三维蛋白质结构。将蛋白质结构视为三维开放曲线,我们采用平方根速度函数(SRVF)表示,并利用其合适的几何性质和深度残差网络(resnet)进行联合配准和比较。我们的resnet处理更好的大蛋白质变形,同时计算效率更高。在数学框架的基础上,我们进一步设计了一个几何变分自动编码器(G-VAE),该编码器经过训练后,将原来看不见的结构映射到一个低维(潜在的)超球体中。基于预形空间的球形结构,我们自然地采用von Mises-Fisher(vMF)分布来建模我们的隐藏变量。我们通过产生新的蛋白质结构和预测损坏的蛋白质结构的完成来测试我们的模型的有效性。实验结果表明,该方法能够生成不同于训练数据的似然结构。 摘要:Analyzing the structure of proteins is a key part of understanding their functions and thus their role in biology at the molecular level. In addition, design new proteins in a methodical way is a major engineering challenge. In this work, we introduce a joint geometric-neural networks approach for comparing, deforming and generating 3D protein structures. Viewing protein structures as 3D open curves, we adopt the Square Root Velocity Function (SRVF) representation and leverage its suitable geometric properties along with Deep Residual Networks (ResNets) for a joint registration and comparison. Our ResNets handle better large protein deformations while being more computationally efficient. On top of the mathematical framework, we further design a Geometric Variational Auto-Encoder (G-VAE), that once trained, maps original, previously unseen structures, into a low-dimensional (latent) hyper-sphere. Motivated by the spherical structure of the pre-shape space, we naturally adopt the von Mises-Fisher (vMF) distribution to model our hidden variables. We test the effectiveness of our models by generating novel protein structures and predicting completions of corrupted protein structures. Experimental results show that our method is able to generate plausible structures, different from the structures in the training data.

【2】 A Stealthy and Robust Fingerprinting Scheme for Generative Models 标题:一种适用于产生式模型的隐形鲁棒指纹方案

作者:Li Guanlin,Guo Shangwei,Wang Run,Xu Guowen,Zhang Tianwei 机构:Nanyang Technological University, Chongqing University, Wuhan University 链接:https://arxiv.org/abs/2106.11760 摘要:本文提出了一种新的用于生成模型知识产权保护的指纹识别方法。判别模型的先验解通常采用对抗性的例子作为指纹,给出异常的推理行为和预测结果。因此,这些方法不是秘密的,很容易被对手识别。我们的方法利用隐形后门技术来克服上述限制。具体来说,我们设计验证样本,其模型输出看起来正常,但可以触发后门分类器作出异常预测。我们提出了一种新的后门嵌入方法,具有独特的三重丢失和细粒度分类,以提高指纹的有效性。大量的评估结果表明,该方法对于不同的GAN模型具有更高的鲁棒性、唯一性和隐蔽性。 摘要:This paper presents a novel fingerprinting methodology for the Intellectual Property protection of generative models. Prior solutions for discriminative models usually adopt adversarial examples as the fingerprints, which give anomalous inference behaviors and prediction results. Hence, these methods are not stealthy and can be easily recognized by the adversary. Our approach leverages the invisible backdoor technique to overcome the above limitation. Specifically, we design verification samples, whose model outputs look normal but can trigger a backdoor classifier to make abnormal predictions. We propose a new backdoor embedding approach with Unique-Triplet Loss and fine-grained categorization to enhance the effectiveness of our fingerprints. Extensive evaluations show that this solution can outperform other strategies with higher robustness, uniqueness and stealthiness for various GAN models.

【3】 Wallpaper Texture Generation and Style Transfer Based on Multi-label Semantics 标题:基于多标签语义的墙纸纹理生成与风格转换

作者:Ying Gao,Xiaohan Feng,Tiange Zhang,Eric Rigall,Huiyu Zhou,Lin Qi,Junyu Dong 机构: Zhou is with University of Leicester 备注:IEEE Transactions on Circuits and Systems for Video Technology 链接:https://arxiv.org/abs/2106.11482 摘要:纹理包含着丰富的图像信息,广泛应用于计算机图形学、计算机视觉等各个领域。随着机器学习技术的发展,纹理的合成和生成技术有了很大的提高。作为日常生活中非常常见的元素,壁纸蕴含着丰富的纹理信息,很难用一个简单的标签进行标注。此外,墙纸设计师花大量的时间来创造不同风格的墙纸。为此,本文提出用多标签语义描述壁纸纹理图像。基于这些标签和生成对抗网络,我们提出了一个感知驱动的墙纸纹理生成和风格转换框架。在该框架中,通过训练一个感知模型来识别由生成器网络生成的壁纸是否足够真实,是否具有给定的感知描述所指定的属性;这些多标签语义属性被视为条件变量来生成墙纸图像。生成的墙纸图像可以转换为那些与知名的艺术家风格使用CycleGAN。最后,利用美学评价方法,对生成的壁纸图像进行定量测量。实验结果表明,该方法能够生成符合人类审美的、具有艺术特色的壁纸纹理。 摘要:Textures contain a wealth of image information and are widely used in various fields such as computer graphics and computer vision. With the development of machine learning, the texture synthesis and generation have been greatly improved. As a very common element in everyday life, wallpapers contain a wealth of texture information, making it difficult to annotate with a simple single label. Moreover, wallpaper designers spend significant time to create different styles of wallpaper. For this purpose, this paper proposes to describe wallpaper texture images by using multi-label semantics. Based on these labels and generative adversarial networks, we present a framework for perception driven wallpaper texture generation and style transfer. In this framework, a perceptual model is trained to recognize whether the wallpapers produced by the generator network are sufficiently realistic and have the attribute designated by given perceptual description; these multi-label semantic attributes are treated as condition variables to generate wallpaper images. The generated wallpaper images can be converted to those with well-known artist styles using CycleGAN. Finally, using the aesthetic evaluation method, the generated wallpaper images are quantitatively measured. The experimental results demonstrate that the proposed method can generate wallpaper textures conforming to human aesthetics and have artistic characteristics.

【4】 FDeblur-GAN: Fingerprint Deblurring using Generative Adversarial Network 标题:FDeblur-GAN:基于产生式对抗网络的指纹去模糊

作者:Amol S. Joshi,Ali Dabouei,Jeremy Dawson,Nasser M. Nasrabadi 机构:West Virginia University 备注:8 Pages, Accepted in IJCB Conference 链接:https://arxiv.org/abs/2106.11354 摘要:在处理从犯罪现场、移动摄像机或低质量传感器获取的指纹图像时,由于图像模糊和失真,自动识别系统很难验证身份。提出了一种基于条件生成对抗网络(cGANs)的指纹去模糊模型FDeblur-GAN。此外,我们将两个辅助子网络整合到模型中,以执行去模糊任务。第一个子网络是脊提取模型。为了保证在去模糊过程中保留指纹信息和细节点,防止模型产生错误的细节点,增加了生成脊线图的功能。第二个子网是验证器,它帮助生成器在生成过程中保存ID信息。利用模糊指纹数据库和相应的脊线图,deep网络从输入的模糊样本中学习去模糊。结合两种不同的指纹匹配算法对该方法进行了评价。在我们的指纹数据库中,我们实现了95.18%的准确率来匹配去模糊和真实的指纹。 摘要:While working with fingerprint images acquired from crime scenes, mobile cameras, or low-quality sensors, it becomes difficult for automated identification systems to verify the identity due to image blur and distortion. We propose a fingerprint deblurring model FDeblur-GAN, based on the conditional Generative Adversarial Networks (cGANs) and multi-stage framework of the stack GAN. Additionally, we integrate two auxiliary sub-networks into the model for the deblurring task. The first sub-network is a ridge extractor model. It is added to generate ridge maps to ensure that fingerprint information and minutiae are preserved in the deblurring process and prevent the model from generating erroneous minutiae. The second sub-network is a verifier that helps the generator to preserve the ID information during the generation process. Using a database of blurred fingerprints and corresponding ridge maps, the deep network learns to deblur from the input blurry samples. We evaluate the proposed method in combination with two different fingerprint matching algorithms. We achieved an accuracy of 95.18% on our fingerprint database for the task of matching deblurred and ground truth fingerprints.

自动驾驶|车辆|车道检测等(1篇)

【1】 nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles 标题:NuPlan:一种基于ML的自动驾驶车辆闭环规划基准

作者:Holger Caesar,Juraj Kabzan,Kok Seang Tan,Whye Kit Fong,Eric Wolff,Alex Lang,Luke Fletcher,Oscar Beijbom,Sammy Omari 机构:Motional 备注:Camera-ready for CVPR ADP3 workshop 链接:https://arxiv.org/abs/2106.11810 摘要:在这项工作中,我们提出了世界上第一个闭环ML为基础的规划基准自主驾驶。虽然有越来越多的基于ML的运动规划器,但是缺乏成熟的数据集和度量限制了这一领域的进展。现有的自主车辆运动预测基准主要集中于短期运动预测,而不是长期规划。这导致以前的工作使用基于L2的度量的开环评估,这不适合公平地评估长期规划。我们的基准测试通过引入大规模驾驶数据集、轻量级闭环模拟器和特定于运动规划的度量来克服这些限制。我们提供了一个高质量的数据集,其中包含来自美国和亚洲4个城市(波士顿、匹兹堡、拉斯维加斯和新加坡)的1500h人类驾驶数据。我们将提供一个带有反应式代理的闭环仿真框架,并提供大量通用和场景特定的规划度量。我们计划在NeurIPS 2021年发布数据集,并从2022年初开始组织基准挑战。 摘要:In this work, we propose the world's first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a large-scale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a high-quality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022.

NAS模型搜索(1篇)

【1】 Differentiable Architecture Search Without Training Nor Labels: A Pruning Perspective 标题:无训练和无标签的可区分体系结构搜索:一种剪枝观点

作者:Miao Zhang,Steven Su,Shirui Pan,Xiaojun Chang,Wei Huang,Gholamreza Haffari 机构:Monash University, University of Technology Sydney 链接:https://arxiv.org/abs/2106.11542 摘要:利用权重共享和连续松弛,使梯度下降通过双层优化范式交替优化超净权重和架构参数,\textit{可微结构搜索(Differentiable ARchiTecture Search},DARTS)以其简单高效的特点成为神经结构搜索(Neural ARchiTecture Search,NAS)的主流方法。然而,最近的研究发现,在DARTS中,搜索结构的性能几乎没有随着优化的进行而提高。此外,一些并行工作表明,NAS可以找到更具竞争力的无标签体系结构。上述观察结果表明,DART中的监督信号可能是架构优化的一个不良指标,引发了一个基本问题:与其使用监督信号来执行双层优化,不如使用监督信号来找到高质量的架构\textbf{不需要任何训练或标签}}?我们提供了一个肯定的答案,将NAS定制为初始化问题时的网络修剪。通过利用最新的网络初始化剪枝技术,我们设计了一个无需任何训练或标记的自由流代理来评估NAS中候选操作的重要性,并相应地提出了一个新的框架,称为\textit{training and label free neural architecture search}(\textbf{FreeNAS})。我们表明,在没有任何训练和标签的情况下,使用所提出的FreeFlow代理的FreeNAS可以比大多数NAS基线更好。更重要的是,我们的框架是非常有效的,它只在单个GPU上的\textbf{3.6s}和\textbf{79s}内分别完成NAS-Bench-201和DARTS搜索空间的架构搜索。我们希望我们的工作能从初始化剪枝的角度激发更多解决NAS的尝试。 摘要:With leveraging the weight-sharing and continuous relaxation to enable gradient-descent to alternately optimize the supernet weights and the architecture parameters through a bi-level optimization paradigm, \textit{Differentiable ARchiTecture Search} (DARTS) has become the mainstream method in Neural Architecture Search (NAS) due to its simplicity and efficiency. However, more recent works found that the performance of the searched architecture barely increases with the optimization proceeding in DARTS. In addition, several concurrent works show that the NAS could find more competitive architectures without labels. The above observations reveal that the supervision signal in DARTS may be a poor indicator for architecture optimization, inspiring a foundational question: instead of using the supervision signal to perform bi-level optimization, \textit{can we find high-quality architectures \textbf{without any training nor labels}}? We provide an affirmative answer by customizing the NAS as a network pruning at initialization problem. By leveraging recent techniques on the network pruning at initialization, we designed a FreeFlow proxy to score the importance of candidate operations in NAS without any training nor labels, and proposed a novel framework called \textit{training and label free neural architecture search} (\textbf{FreeNAS}) accordingly. We show that, without any training nor labels, FreeNAS with the proposed FreeFlow proxy can outperform most NAS baselines. More importantly, our framework is extremely efficient, which completes the architecture search within only \textbf{3.6s} and \textbf{79s} on a single GPU for the NAS-Bench-201 and DARTS search space, respectively. We hope our work inspires more attempts in solving NAS from the perspective of pruning at initialization.

Attention注意力(1篇)

【1】 Understanding top-down attention using task-oriented ablation design 标题:使用面向任务的消融设计理解自上而下的注意

作者:Freddie Bickford Smith,Brett D Roads,Xiaoliang Luo,Bradley C Love 机构:University of Oxford, University College London 链接:https://arxiv.org/abs/2106.11339 摘要:自上而下的注意力允许神经网络,无论是人工的还是生物的,专注于与给定任务最相关的信息。众所周知,这可以提高视觉感知能力。但注意力是如何促进知觉的,尤其是在自然环境下,比如在日常场景中识别物体时,还不清楚。注意力有助于处理视觉任务的哪些方面?我们的目标是通过一个基于任务导向设计的计算实验来回答这个问题。首先,我们定义了广泛的视觉任务,并确定了任务可变性的六个因素。然后在每个任务上,我们比较两个神经网络的性能,一个有自上而下的注意,另一个没有。这些比较揭示了注意知觉提升的任务依赖性,使人们对注意所扮演的角色有了更清晰的认识。然而,许多现有的认知账户将注意力与刺激水平变量联系起来,如视觉混乱和物体尺度,我们发现系统水平变量具有更大的解释力,这些变量捕捉了模型、训练数据分布和任务格式之间的相互作用。这一发现表明,注意力研究方式的转变可能是卓有成效的。我们公开了我们的代码和结果,以及与基于ImageNet的实验相关的统计数据。我们的贡献有助于开发更多类似人类的视觉模型和设计更多信息的机器学习实验。 摘要:Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task. This is known to enhance performance in visual perception. But it remains unclear how attention brings about its perceptual boost, especially when it comes to naturalistic settings like recognising an object in an everyday scene. What aspects of a visual task does attention help to deal with? We aim to answer this with a computational experiment based on a general framework called task-oriented ablation design. First we define a broad range of visual tasks and identify six factors that underlie task variability. Then on each task we compare the performance of two neural networks, one with top-down attention and one without. These comparisons reveal the task-dependence of attention's perceptual boost, giving a clearer idea of the role attention plays. Whereas many existing cognitive accounts link attention to stimulus-level variables, such as visual clutter and object scale, we find greater explanatory power in system-level variables that capture the interaction between the model, the distribution of training data and the task format. This finding suggests a shift in how attention is studied could be fruitful. We make publicly available our code and results, along with statistics relevant to ImageNet-based experiments beyond this one. Our contribution serves to support the development of more human-like vision models and the design of more informative machine-learning experiments.

人脸|人群计数(3篇)

【1】 MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images 标题:元阿凡达:从少量深度图像中学习可动画服装人体模型

作者:Shaofei Wang,Marko Mihajlovic,Qianli Ma,Andreas Geiger,Siyu Tang 机构:ETH Zürich, Max Planck Institute for Intelligent Systems, Tübingen, University of Tübingen, neuralbodies.github.iometavatar 备注:17 pages, 9 figures. Project page: this https URL 链接:https://arxiv.org/abs/2106.11944 摘要:在这篇论文中,我们的目标是建立可推广和可控的神经符号距离场(SDF),代表人类衣服从单目深度观测。近年来,随着深度学习技术的发展,特别是神经内隐表征技术的发展,使得人们能够从不同的传感器输入中进行形状重建和可控的化身生成。然而,要从新的输入姿势生成真实的布料变形,通常需要水密网格或密集的全身扫描作为输入。此外,由于很难有效地为不同体型和布料类型的布料变形建模,现有的方法从零开始就采用每个主题/布料类型的优化方法,这在计算上非常昂贵。相比之下,我们提出了一种方法,可以快速生成真实的人类头像,表示为可控的神经sdf,只给定单目深度图像。我们通过使用元学习来学习一个超网络的初始化来实现这一点,该超网络可以预测神经sdf的参数。该超网络以人体姿态为条件,表示一个根据输入姿态非刚性变形的穿戴式神经化身。同时,与从头开始训练的模型相比,它能够有效地融合不同体型和布料类型的先验知识,从而可以更快地进行微调。定性和定量分析表明,我们的方法优于最新的需要完整网格作为输入的方法,而我们的方法只需要深度帧作为输入,并且运行速度快几个数量级。此外,我们证明了我们的元学习超网络是非常强大的,是第一个生成具有真实动态布料变形的头像,只给出8个单目深度帧。 摘要:In this paper, we aim to create generalizable and controllable neural signed distance fields (SDFs) that represent clothed humans from monocular depth observations. Recent advances in deep learning, especially neural implicit representations, have enabled human shape reconstruction and controllable avatar generation from different sensor inputs. However, to generate realistic cloth deformations from novel input poses, watertight meshes or dense full-body scans are usually needed as inputs. Furthermore, due to the difficulty of effectively modeling pose-dependent cloth deformations for diverse body shapes and cloth types, existing approaches resort to per-subject/cloth-type optimization from scratch, which is computationally expensive. In contrast, we propose an approach that can quickly generate realistic clothed human avatars, represented as controllable neural SDFs, given only monocular depth images. We achieve this by using meta-learning to learn an initialization of a hypernetwork that predicts the parameters of neural SDFs. The hypernetwork is conditioned on human poses and represents a clothed neural avatar that deforms non-rigidly according to the input poses. Meanwhile, it is meta-learned to effectively incorporate priors of diverse body shapes and cloth types and thus can be much faster to fine-tune, compared to models trained from scratch. We qualitatively and quantitatively show that our approach outperforms state-of-the-art approaches that require complete meshes as inputs while our approach requires only depth frames as inputs and runs orders of magnitudes faster. Furthermore, we demonstrate that our meta-learned hypernetwork is very robust, being the first to generate avatars with realistic dynamic cloth deformations given as few as 8 monocular depth frames.

【2】 A Survey on Human-aware Robot Navigation 标题:人类感知机器人导航研究综述

作者:Ronja Möller,Antonino Furnari,Sebastiano Battiato,Aki Härmä,Giovanni Maria Farinella 机构:Department of Mathematics and Informatics, University of Catania, V. Andrea Doria, Catania , Italy, Philips Research, High Tech Campus , Eindhoven, Netherlands, Cognitive Robotics and Social Sensing Laboratory, ICAR-CNR, Palermo, Italy 备注:Robotics and Autonomous Systems, 2021 链接:https://arxiv.org/abs/2106.11650 摘要:智能系统越来越成为我们日常生活的一部分,已经无缝集成到很难想象没有智能系统的世界。另一方面,这些系统的物理表现形式,以具体化的代理或机器人的形式,迄今为止仅用于特定的应用,而且往往局限于功能角色(例如在工业、娱乐和军事领域)。考虑到目前机器人导航、人机交互和人类活动识别等研究领域的发展和创新,这种情况似乎很快就会改变。机器人越来越容易获得和使用,人们对机器人的接受程度也在不断提高。然而,设计一个社会兼容的机器人,可以作为一个同伴需要考虑到各个领域的研究。本文关注的是一个社会顺应机器人的导航方面,并提供了一个现有的解决方案,为相关领域的研究以及对未来可能的方向展望。 摘要:Intelligent systems are increasingly part of our everyday lives and have been integrated seamlessly to the point where it is difficult to imagine a world without them. Physical manifestations of those systems on the other hand, in the form of embodied agents or robots, have so far been used only for specific applications and are often limited to functional roles (e.g. in the industry, entertainment and military fields). Given the current growth and innovation in the research communities concerned with the topics of robot navigation, human-robot-interaction and human activity recognition, it seems like this might soon change. Robots are increasingly easy to obtain and use and the acceptance of them in general is growing. However, the design of a socially compliant robot that can function as a companion needs to take various areas of research into account. This paper is concerned with the navigation aspect of a socially-compliant robot and provides a survey of existing solutions for the relevant areas of research as well as an outlook on possible future directions.

【3】 Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images 标题:Deep3DPose:从单幅RGB图像实时重建任意姿态的人体

作者:Liguo Jiang,Miaopeng Li,Jianjie Zhang,Congyi Wang,Juntao Ye,Xinguo Liu,Jinxiang Chai 链接:https://arxiv.org/abs/2106.11536 摘要:介绍了一种从单幅图像中实时精确重建三维人体姿态和详细的三维人体几何模型的方法。该方法的核心思想是一个新颖的端到端多任务深度学习框架,该框架使用单个图像同时预测五个输出:前景分割模板、二维关节位置、语义体分割、三维零件方向和uv坐标(uv-map)。多任务网络结构不仅能产生更多的重建视觉线索,而且能使每个个体的预测更加准确。将CNN回归器与基于优化的算法相结合,实现了精确的运动学位姿重建和全身形状建模。结果表明,实时重建达到了前所未有的精确拟合,特别是对于野生图像。我们展示了我们的实时三维姿势和人体重建系统在各种挑战性的野外视频的结果。通过与现有方法的定量评价和比较,表明该系统在三维人体和单幅图像的姿态重建方面有了新的进展。 摘要:We introduce an approach that accurately reconstructs 3D human poses and detailed 3D full-body geometric models from single images in realtime. The key idea of our approach is a novel end-to-end multi-task deep learning framework that uses single images to predict five outputs simultaneously: foreground segmentation mask, 2D joints positions, semantic body partitions, 3D part orientations and uv coordinates (uv map). The multi-task network architecture not only generates more visual cues for reconstruction, but also makes each individual prediction more accurate. The CNN regressor is further combined with an optimization based algorithm for accurate kinematic pose reconstruction and full-body shape modeling. We show that the realtime reconstruction reaches accurate fitting that has not been seen before, especially for wild images. We demonstrate the results of our realtime 3D pose and human body reconstruction system on various challenging in-the-wild videos. We show the system advances the frontier of 3D human body and pose reconstruction from single images by quantitative evaluations and comparisons with state-of-the-art methods.

跟踪(1篇)

【1】 Tracking Instances as Queries 标题:将实例作为查询进行跟踪

作者:Shusheng Yang,Yuxin Fang,Xinggang Wang,Yu Li,Ying Shan,Bin Feng,Wenyu Liu 机构:School of EIC, Huazhong University of Science & Technology, Applied Research Center (ARC), Tencent PCG 备注:None 链接:https://arxiv.org/abs/2106.11963 摘要:近年来,基于查询的深度网络由于其端到端的流水线结构和在目标检测、语义分割、实例分割等基本计算机视觉任务中的竞争优势而受到广泛关注。然而,如何建立一个架构优雅、性能强大的基于查询的视频实例分割(VIS)框架还有待解决。在本文中,我们提出了一个基于查询的VIS框架,它充分利用了queryist中实例和查询之间的一对一对应关系。该方法在YouTube-VIS-2019/2021数据集上获得52.7/52.3 AP,在CVPR 2021\textbf{的YouTube-VIS挑战赛中获得第二名,采用单一在线端到端模型、单规模测试和适量训练数据}。我们还在YouTube-VIS-2021数据集上提供QueryTrack-ResNet-50基线结果,作为VIS社区的参考。 摘要:Recently, query based deep networks catch lots of attention owing to their end-to-end pipeline and competitive results on several fundamental computer vision tasks, such as object detection, semantic segmentation, and instance segmentation. However, how to establish a query based video instance segmentation (VIS) framework with elegant architecture and strong performance remains to be settled. In this paper, we present \textbf{QueryTrack} (i.e., tracking instances as queries), a unified query based VIS framework fully leveraging the intrinsic one-to-one correspondence between instances and queries in QueryInst. The proposed method obtains 52.7 / 52.3 AP on YouTube-VIS-2019 / 2021 datasets, which wins the 2-nd place in the YouTube-VIS Challenge at CVPR 2021 \textbf{with a single online end-to-end model, single scale testing \& modest amount of training data}. We also provide QueryTrack-ResNet-50 baseline results on YouTube-VIS-2021 dataset as references for the VIS community.

蒸馏|知识提取(1篇)

【1】 DeepMesh: Differentiable Iso-Surface Extraction 标题:DeepMesh:可微等值面提取

作者:Benoit Guillard,Edoardo Remelli,Artem Lukoianov,Stephan Richter,Timur Bagautdinov,Pierre Baque,Pascal Fua 备注:arXiv admin note: substantial text overlap with arXiv:2006.03997 链接:https://arxiv.org/abs/2106.11795 摘要:近年来,随着连续深隐场的出现,几何深度学习取得了显著的进展。它们允许对任意拓扑的水密表面进行详细建模,而不依赖于三维欧几里德网格,从而产生分辨率无限的可学习参数化。不幸的是,这些方法通常不适用于需要基于显式网格的曲面表示的应用,因为将隐式场转换为这种表示依赖于Marching Cubes算法,而Marching Cubes算法无法区分底层隐式场。在这项工作中,我们消除了这一限制,并介绍了一种可微的方法来产生显式表面网格表示深隐式领域。我们的主要见解是,通过推理隐式场扰动如何影响局部表面几何,最终可以区分表面样本相对于底层深层隐式场的三维位置。我们利用它来定义DeepMesh——可以改变其拓扑结构的端到端可微网格表示。我们使用两个不同的应用来验证我们的理论观点:通过可微渲染的单视图三维重建和物理驱动的形状优化。在这两种情况下,我们的端到端可微参数化使我们比最先进的算法有优势。 摘要:Geometric Deep Learning has recently made striking progress with the advent of continuous Deep Implicit Fields. They allow for detailed modeling of watertight surfaces of arbitrary topology while not relying on a 3D Euclidean grid, resulting in a learnable parameterization that is unlimited in resolution. Unfortunately, these methods are often unsuitable for applications that require an explicit mesh-based surface representation because converting an implicit field to such a representation relies on the Marching Cubes algorithm, which cannot be differentiated with respect to the underlying implicit field. In this work, we remove this limitation and introduce a differentiable way to produce explicit surface mesh representations from Deep Implicit Fields. Our key insight is that by reasoning on how implicit field perturbations impact local surface geometry, one can ultimately differentiate the 3D location of surface samples with respect to the underlying deep implicit field. We exploit this to define DeepMesh -- end-to-end differentiable mesh representation that can vary its topology. We use two different applications to validate our theoretical insight: Single view 3D Reconstruction via Differentiable Rendering and Physically-Driven Shape Optimization. In both cases our end-to-end differentiable parameterization gives us an edge over state-of-the-art algorithms.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Spatial-Temporal Super-Resolution of Satellite Imagery via Conditional Pixel Synthesis 标题:基于条件像元合成的卫星图像时空超分辨率研究

作者:Yutong He,Dingjie Wang,Nicholas Lai,William Zhang,Chenlin Meng,Marshall Burke,David B. Lobell,Stefano Ermon 机构:Wiiliam Zhang, Stanford University 链接:https://arxiv.org/abs/2106.11485 摘要:高分辨率卫星图像已被证明对广泛的任务有用,包括测量全球人口、当地经济生计和生物多样性等。不幸的是,高分辨率图像收集的频率很低,购买成本也很高,因此很难在时间和空间上有效地扩展这些下游任务。我们提出了一种新的条件像素合成模型,该模型利用丰富、低成本、低分辨率的图像在不可用的时间和地点生成精确的高分辨率图像。我们的研究表明,我们的模型在一个关键的下游任务——物体计数——上达到了照片真实的样本质量,并且优于竞争基线,特别是在地面条件变化迅速的地理位置上。 摘要:High-resolution satellite imagery has proven useful for a broad range of tasks, including measurement of global human population, local economic livelihoods, and biodiversity, among many others. Unfortunately, high-resolution imagery is both infrequently collected and expensive to purchase, making it hard to efficiently and effectively scale these downstream tasks over both time and space. We propose a new conditional pixel synthesis model that uses abundant, low-cost, low-resolution imagery to generate accurate high-resolution imagery at locations and times in which it is unavailable. We show that our model attains photo-realistic sample quality and outperforms competing baselines on a key downstream task -- object counting -- particularly in geographic locations where conditions on the ground are changing rapidly.

多模态(1篇)

【1】 Multimodal trajectory forecasting based on discrete heat map 标题:基于离散热图的多模态轨迹预测

作者:Jingni Yuan,Jianyun Xu,Yushi Zhu 机构:HIKVISION Research Institute, . Problem Formulation, Argoverse, motion, forecasting, competition, the task is to predict the, probabilistic future trajectory distribution for, the interested targets in the traffic scene. We 链接:https://arxiv.org/abs/2106.11467 摘要:在辩论运动预测比赛中,任务是预测交通场景中感兴趣目标的概率未来轨迹分布。我们使用矢量化车道图和2 s目标的历史轨迹作为输入。然后,该模型输出6个预测轨迹,每个目标的概率。 摘要:In Argoverse motion forecasting competition, the task is to predict the probabilistic future trajectory distribution for the interested targets in the traffic scene. We use vectorized lane map and 2 s targets' history trajectories as input. Then the model outputs 6 forecasted trajectories with probability for each target.

其他神经网络|深度学习|模型|建模(6篇)

【1】 RootPainter3D: Interactive-machine-learning enables rapid and accurate contouring for radiotherapy 标题:RootPainter3D:交互式机器学习使放射治疗的轮廓快速准确

作者:Abraham George Smith,Jens Petersen,Cynthia Terrones-Campos,Anne Kiil Berthelsen,Nora Jarrett Forbes,Sune Darkner,Lena Specht,Ivan Richter Vogelius 机构:Department of Computer Science, University of Copenhagen, Department of Oncology, Rigshospitalet, University of Copenhagen, Department of Infectious Diseases, Rigshospitalet, University of Copenhagen 链接:https://arxiv.org/abs/2106.11942 摘要:危险器官轮廓仍然是放射治疗中的一个瓶颈,许多深度学习方法在评估临床数据时达不到预期的效果。我们调查的准确性和时间节省使用交互式机器学习方法的风险器官轮廓任务。我们将该方法与Eclipse等高线软件进行了比较,发现它与手工绘制的方法有很强的一致性,骰子得分为0.95。使用校正注释创建的注释也需要更少的时间来创建,因为更多的图像被注释,与手动方法相比节省了大量的时间,在923个图像被描绘之后,心脏平均需要2分2秒来描绘,与手动描绘时的7分1秒相比。我们的实验表明,交互式机器学习与纠正注释提供了一种快速和方便的方式,非计算机科学家训练深度学习模型,以分割自己的兴趣结构,作为常规临床工作流程的一部分。源代码位于\href{https://github.com/Abe404/RootPainter3D}{此HTTPS URL}。 摘要:Organ-at-risk contouring is still a bottleneck in radiotherapy, with many deep learning methods falling short of promised results when evaluated on clinical data. We investigate the accuracy and time-savings resulting from the use of an interactive-machine-learning method for an organ-at-risk contouring task. We compare the method to the Eclipse contouring software and find strong agreement with manual delineations, with a dice score of 0.95. The annotations created using corrective-annotation also take less time to create as more images are annotated, resulting in substantial time savings compared to manual methods, with hearts that take 2 minutes and 2 seconds to delineate on average, after 923 images have been delineated, compared to 7 minutes and 1 seconds when delineating manually. Our experiment demonstrates that interactive-machine-learning with corrective-annotation provides a fast and accessible way for non computer-scientists to train deep-learning models to segment their own structures of interest as part of routine clinical workflows. Source code is available at \href{https://github.com/Abe404/RootPainter3D}{this HTTPS URL}.

【2】 On the importance of cross-task features for class-incremental learning 标题:论跨任务特征对班级增量学习的重要性

作者:Albin Soutif--Cormerais,Marc Masana,Joost Van de Weijer,Bartłomiej Twardowski 机构:∗LAMP team, Computer Vision Center, UAB Barcelona, Spain, †VLO team, Institute of Computer Graphics and Vision, TU Graz, Austria 备注:includes supplementary material 链接:https://arxiv.org/abs/2106.11930 摘要:在类增量学习中,资源有限的agent需要学习一系列的分类任务,形成了一个不断增长的分类问题,其约束条件是不能访问以前任务中的数据。任务增量学习的主要区别在于,在推理时任务ID可用,学习者还需要执行跨任务区分,即区分没有一起看到的类。解决这个问题的方法有很多种,大多数都是利用大小不可忽略的外部存储器(缓冲区)。本文主要研究了跨任务特征的学习及其对课堂基本重放策略的影响。我们还定义了一个新的遗忘量表来衡量班级增量学习,发现遗忘并不是导致学习成绩低下的主要原因。实验结果表明,未来的类增量学习算法不仅要防止遗忘,而且要提高跨任务特征的质量。当每个任务的类数较少时,这一点尤为重要。 摘要:In class-incremental learning, an agent with limited resources needs to learn a sequence of classification tasks, forming an ever growing classification problem, with the constraint of not being able to access data from previous tasks. The main difference with task-incremental learning, where a task-ID is available at inference time, is that the learner also needs to perform cross-task discrimination, i.e. distinguish between classes that have not been seen together. Approaches to tackle this problem are numerous and mostly make use of an external memory (buffer) of non-negligible size. In this paper, we ablate the learning of cross-task features and study its influence on the performance of basic replay strategies used for class-IL. We also define a new forgetting measure for class-incremental learning, and see that forgetting is not the principal cause of low performance. Our experimental results show that future algorithms for class-incremental learning should not only prevent forgetting, but also aim to improve the quality of the cross-task features. This is especially important when the number of classes per task is small.

【3】 Residual Networks as Flows of Velocity Fields for Diffeomorphic Time Series Alignment 标题:作为速度场流的残差网络用于微分时间序列比对

作者:Hao Huang,Boulbaba Ben Amor,Xichan Lin,Fan Zhu,Yi Fang 机构:New York University Abu Dhabi, Inception Institute of Artificial Intelligence 备注:19 pages 链接:https://arxiv.org/abs/2106.11911 摘要:非线性(大)时间扭曲是时间序列分析中一个具有挑战性的干扰源。在本文中,我们提出了一种新的差分同胚时间Transformer网络,用于成对和联合时间序列对齐。我们的ResNet-TW(Deep Residual Network for Time Warping)通过合成一个增量微分同胚映射流来解决对齐问题。在流动方程的控制下,我们的剩余网络(ResNet)建立了光滑的、流动的和规则的速度场流动,从而产生了光滑的和可逆的变换(即微分同态翘曲函数)。受优雅的大变形微分同胚度量映射(LDDMM)框架的启发,最终的变换是由依赖于时间的向量场流构建的,这些向量场正是剩余网络的构建块。后者自然地被视为流动方程的欧拉离散模式(ODE)。一旦训练,我们的ResNet TW通过一个廉价的前向传递来对齐看不见的数据。正如我们在单变量(来自UCR档案的84个数据集)和多变量时间序列(MSR Action-3D、Florence-3D和MSR Daily Activity)上的实验所示,ResNet-TW在联合对齐和分类方面取得了很好的性能。 摘要:Non-linear (large) time warping is a challenging source of nuisance in time-series analysis. In this paper, we propose a novel diffeomorphic temporal transformer network for both pairwise and joint time-series alignment. Our ResNet-TW (Deep Residual Network for Time Warping) tackles the alignment problem by compositing a flow of incremental diffeomorphic mappings. Governed by the flow equation, our Residual Network (ResNet) builds smooth, fluid and regular flows of velocity fields and consequently generates smooth and invertible transformations (i.e. diffeomorphic warping functions). Inspired by the elegant Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework, the final transformation is built by the flow of time-dependent vector fields which are none other than the building blocks of our Residual Network. The latter is naturally viewed as an Eulerian discretization schema of the flow equation (an ODE). Once trained, our ResNet-TW aligns unseen data by a single inexpensive forward pass. As we show in experiments on both univariate (84 datasets from UCR archive) and multivariate time-series (MSR Action-3D, Florence-3D and MSR Daily Activity), ResNet-TW achieves competitive performance in joint alignment and classification.

【4】 Winning the CVPR'2021 Kinetics-GEBD Challenge: Contrastive Learning Approach 标题:赢得CVPR‘2021年动力学-GEBD挑战赛:对比学习方法

作者:Hyolim Kang,Jinwoo Kim,Kyungmin Kim,Taehyun Kim,Seon Joo Kim 机构:Yonsei University, Hyundai Mobis 链接:https://arxiv.org/abs/2106.11549 摘要:一般事件边界检测(genericpeventbounderdetection,GEBD)是一项新引入的任务,旨在检测与人类自然感知相对应的“一般”事件边界。本文介绍了一种新的基于对比学习的方法来处理GEBD。我们的直觉是,视频片段的特征相似性在事件边界附近会发生显著变化,而在视频的其余部分则保持相对相同。在我们的模型中,时间自相似矩阵(TSM)被用作一个中间表示,它扮演了一个信息瓶颈的角色。与给定的基线相比,我们的模型实现了显著的性能提升。我们的代码在https://github.com/hello-jinwoo/LOVEU-CVPR2021. 摘要:Generic Event Boundary Detection (GEBD) is a newly introduced task that aims to detect "general" event boundaries that correspond to natural human perception. In this paper, we introduce a novel contrastive learning based approach to deal with the GEBD. Our intuition is that the feature similarity of the video snippet would significantly vary near the event boundaries, while remaining relatively the same in the remaining part of the video. In our model, Temporal Self-similarity Matrix (TSM) is utilized as an intermediate representation which takes on a role as an information bottleneck. With our model, we achieved significant performance boost compared to the given baselines. Our code is available at https://github.com/hello-jinwoo/LOVEU-CVPR2021.

【5】 Dive into Deep Learning 标题:深入研究深度学习

作者:Aston Zhang,Zachary C. Lipton,Mu Li,Alexander J. Smola 机构:Jun 备注:(HTML) this https URL (GitHub) this https URL 链接:https://arxiv.org/abs/2106.11342 摘要:这本开源的书代表了我们试图让深度学习变得平易近人,教读者概念、上下文和代码。整本书是在Jupyter笔记本中起草的,无缝集成了说明图、数学和交互式示例以及自包含的代码。我们的目标是提供一个资源,可以(i)免费提供给每个人(ii)提供足够的技术深度,为实际成为应用机器学习科学家提供起点(iii)包括可运行代码,向读者展示如何在实践中解决问题(iv)允许我们和整个社区快速更新(v) 辅以一个论坛,就技术细节进行互动讨论并回答问题。 摘要:This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions.

【6】 Learning-Based Practical Light Field Image Compression Using A Disparity-Aware Model 标题:基于视差感知模型的基于学习的实用光场图像压缩

作者:Mohana Singh,Renu M. Rameshan 机构:School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Himachal Pradesh, India 备注:accepted to Picture Coding Symposium 2021 链接:https://arxiv.org/abs/2106.11558 摘要:光场技术以其许多可能的应用越来越受到研究界的关注。商业全光相机中的透镜阵列有助于在一次曝光中捕获光线的空间和角度信息。虽然由此产生的高维光场数据使其优越的能力,它也阻碍了广泛采用。因此,迫切需要对光场图像进行有效压缩。现有的解决方案通常由几个独立的模块组成,其中一些模块可能不是针对光场数据的特定结构和质量而设计的。这增加了编解码器的复杂性,并导致不实用的解码运行时。提出了一种新的基于学习的视差辅助模型,用于4D光场图像的并行解码压缩。该模型是端到端可训练的,无需手动调整单独的模块,并允许联合学习速率和失真。视差辅助方法保证了重建光场的结构完整性。与现有技术的比较表明,PSNR和MS-SSIM指标的表现令人鼓舞。此外,在编码和解码运行时方面也有显著的改进。源代码位于https://moha23.github.io/LFDAAE. 摘要:Light field technology has increasingly attracted the attention of the research community with its many possible applications. The lenslet array in commercial plenoptic cameras helps capture both the spatial and angular information of light rays in a single exposure. While the resulting high dimensionality of light field data enables its superior capabilities, it also impedes its extensive adoption. Hence, there is a compelling need for efficient compression of light field images. Existing solutions are commonly composed of several separate modules, some of which may not have been designed for the specific structure and quality of light field data. This increases the complexity of the codec and results in impractical decoding runtimes. We propose a new learning-based, disparity-aided model for compression of 4D light field images capable of parallel decoding. The model is end-to-end trainable, eliminating the need for hand-tuning separate modules and allowing joint learning of rate and distortion. The disparity-aided approach ensures the structural integrity of the reconstructed light fields. Comparisons with the state of the art show encouraging performance in terms of PSNR and MS-SSIM metrics. Also, there is a notable gain in the encoding and decoding runtimes. Source code is available at https://moha23.github.io/LFDAAE.

其他(11篇)

【1】 HybVIO: Pushing the Limits of Real-time Visual-inertial Odometry 标题:HybVIO:突破实时视觉惯性里程计的极限

作者:Otto Seiskari,Pekka Rantalankila,Juho Kannala,Jerry Ylilammi,Esa Rahtu,Arno Solin 机构:com 2Aalto University, fi 3University of Tampere 链接:https://arxiv.org/abs/2106.11857 摘要:提出了一种基于滤波的视觉惯性里程表(VIO)与基于优化的SLAM相结合的混合方法HybVIO。该方法的核心是高鲁棒性、独立的VIO算法,具有改进的IMU偏差建模、离群点剔除、平稳性检测和特征轨迹选择,可在嵌入式硬件上运行。长期的一致性是通过一个松散耦合的SLAM模块来实现的。在学术基准测试中,我们的解决方案在所有类别中都具有优异的性能,特别是在实时用例中,我们的性能优于当前最先进的技术。我们还证明了VIO在消费者级硬件上使用自定义数据集进行车辆跟踪的可行性,并与现有的商用VISLAM替代方案进行了比较,显示了良好的性能。 摘要:We present HybVIO, a novel hybrid approach for combining filtering-based visual-inertial odometry (VIO) with optimization-based SLAM. The core of our method is highly robust, independent VIO with improved IMU bias modeling, outlier rejection, stationarity detection, and feature track selection, which is adjustable to run on embedded hardware. Long-term consistency is achieved with a loosely-coupled SLAM module. In academic benchmarks, our solution yields excellent performance in all categories, especially in the real-time use case, where we outperform the current state-of-the-art. We also demonstrate the feasibility of VIO for vehicular tracking on consumer-grade hardware using a custom dataset, and show good performance in comparison to current commercial VISLAM alternatives.

【2】 Evaluation of a Region Proposal Architecture for Multi-task Document Layout Analysis 标题:面向多任务文档布局分析的区域提案体系评价

作者:Lorenzo Quirós,Enrique Vidal 机构:PRHLT Research Center, Universitat Politecnica de Valencia, Camino de Vera, sn, Valencia, Spain 链接:https://arxiv.org/abs/2106.11797 摘要:自动识别手写文档的布局是从这些文档中提取有用信息的一个重要步骤。最常见的应用是为下游应用提供信息,如自动文本识别和关键字定位;但是,对布局的识别也有助于在文档中的元素之间建立关系,从而丰富可以提取的信息。现代文档版面分析系统大多只针对文档版面问题的一部分,即:基线检测或区域分割。对比之下,我们评估了Mask-RCNN架构在解决基线检测和区域分割问题上的有效性。我们给出了两个手写文本数据集和一个手写音乐数据集的实验结果。所分析的体系结构产生了有希望的结果,在所有三个数据集中都优于最先进的技术。 摘要:Automatically recognizing the layout of handwritten documents is an important step towards useful extraction of information from those documents. The most common application is to feed downstream applications such as automatic text recognition and keyword spotting; however, the recognition of the layout also helps to establish relationships between elements in the document which allows to enrich the information that can be extracted. Most of the modern document layout analysis systems are designed to address only one part of the document layout problem, namely: baseline detection or region segmentation. In contrast, we evaluate the effectiveness of the Mask-RCNN architecture to address the problem of baseline detection and region segmentation in an integrated manner. We present experimental results on two handwritten text datasets and one handwritten music dataset. The analyzed architecture yields promising results, outperforming state-of-the-art techniques in all three datasets.

【3】 A Review of the Vision-based Approaches for Dietary Assessment 标题:基于视觉的膳食评价方法综述

作者:Ghalib Tahir,Chu Kiong Loo 机构:Department of Artificial Intelligence, University of Malaya, Kuala Lumpur, Malaysia 链接:https://arxiv.org/abs/2106.11776 摘要:与饮食有关的问题,如肥胖,是当今世界日益关注的问题。如果目前的趋势继续下去,很有可能生活质量总体上受到显著影响,因为肥胖与高血压、血糖水平不规律和心脏病发作风险增加等其他慢性疾病有关。造成这些问题的主要原因是不良的生活方式选择和不健康的饮食习惯,重点放在少数食物组,如糖、脂肪和碳水化合物。在这方面,基于计算机的食物识别提供了基于视觉的自动方法来评估饮食摄入,帮助人们做出更健康的选择。因此,本文简要回顾了基于视觉的食品识别方法,包括它们的准确性、性能,以及使用流行的食品数据库来评估现有模型。这项工作还旨在突出这一领域今后的挑战。建议开展新的高质量研究,制定标准基准,并使用持续学习方法进行食品识别。 摘要:Dietary-related problems such as obesity are a growing concern in todays modern world. If the current trend continues, it is most likely that the quality of life, in general, is significantly affected since obesity is associated with other chronic diseases such as hypertension, irregular blood sugar levels, and increased risk of heart attacks. The primary cause of these problems is poor lifestyle choices and unhealthy dietary habits, with emphasis on a select few food groups such as sugars, fats, and carbohydrates. In this regard, computer-based food recognition offers automatic visual-based methods to assess dietary intake and help people make healthier choices. Thus, the following paper presents a brief review of visual-based methods for food recognition, including their accuracy, performance, and the use of popular food databases to evaluate existing models. The work further aims to highlight future challenges in this area. New high-quality studies for developing standard benchmarks and using continual learning methods for food recognition are recommended.

【4】 Trinity: A No-Code AI platform for complex spatial datasets 标题:Trinity:一个面向复杂空间数据集的无码人工智能平台

作者:C. V. Krishnakumar Iyer,Feili Hou,Henry Wang,Yonghong Wang,Kay Oh,Swetava Ganguli,Vipul Pandey 备注:12 pages, Submitted to SIGSPATIAL '21 链接:https://arxiv.org/abs/2106.11756 摘要:我们提出了一个名为Trinity的无代码人工智能(AI)平台,其主要设计目标是使机器学习研究人员和非技术地理空间领域专家能够对特定领域的信号和数据集进行实验,以自行解决各种复杂问题。这种解决不同问题的多功能性是通过转换复杂的时空数据集来实现的,使它们能够被标准的深度学习模型(在本例中是卷积神经网络(CNN))所使用,并提供以标准方式(例如语义分割)来描述不同问题的能力。凭借直观的用户界面、承载复杂功能工程衍生产品的功能商店、深度学习内核和可扩展的数据处理机制,Trinity为领域专家提供了一个强大的平台,让他们与科学家和工程师共享解决关键业务问题的舞台。它通过标准化模型构建和部署,实现了快速原型设计、快速实验和缩短生产时间。在本文中,我们介绍了Trinity及其设计背后的动机,并展示了示例应用程序,以鼓励降低使用AI的门槛。 摘要:We present a no-code Artificial Intelligence (AI) platform called Trinity with the main design goal of enabling both machine learning researchers and non-technical geospatial domain experts to experiment with domain-specific signals and datasets for solving a variety of complex problems on their own. This versatility to solve diverse problems is achieved by transforming complex Spatio-temporal datasets to make them consumable by standard deep learning models, in this case, Convolutional Neural Networks (CNNs), and giving the ability to formulate disparate problems in a standard way, eg. semantic segmentation. With an intuitive user interface, a feature store that hosts derivatives of complex feature engineering, a deep learning kernel, and a scalable data processing mechanism, Trinity provides a powerful platform for domain experts to share the stage with scientists and engineers in solving business-critical problems. It enables quick prototyping, rapid experimentation and reduces the time to production by standardizing model building and deployment. In this paper, we present our motivation behind Trinity and its design along with showcasing sample applications to motivate the idea of lowering the bar to using AI.

【5】 Gait analysis with curvature maps: A simulation study 标题:曲率图步态分析的仿真研究

作者:Khac Chinh Tran,Marc Daniel,Jean Meunier 机构:Information Technology Faculty, Danang University of Technology and Science, Danang, Vietnam, Laboratoire d’Informatique et des Systemes, Aix-Marseille Universit´e, Marseille, France, Department of Computer Science, University of Montr´eal, Montr´eal, Canada 备注:4 pages, 5 figures 链接:https://arxiv.org/abs/2106.11466 摘要:步态分析是临床研究的一个重要方面,用于检测神经和肌肉骨骼疾病,评估患者的整体健康状况。在本文中,我们将注意力集中在从深度相机提供的体表提取相关曲率信息上。我们假设3D网格在前一步中可用,并演示了曲率图如何有助于评估不对称异常,与正常步态相比,使用两个简单的模拟异常步态。这项研究为将来开发一个基于曲率的步态分析系统奠定了基础。 摘要:Gait analysis is an important aspect of clinical investigation for detecting neurological and musculoskeletal disorders and assessing the global health of a patient. In this paper we propose to focus our attention on extracting relevant curvature information from the body surface provided by a depth camera. We assumed that the 3D mesh was made available in a previous step and demonstrated how curvature maps could be useful to assess asymmetric anomalies with two simple simulated abnormal gaits compared with a normal one. This research set the grounds for the future development of a curvature-based gait analysis system for healthcare professionals.

【6】 Normalized Avatar Synthesis Using StyleGAN and Perceptual Refinement 标题:基于StyleGan和感知细化的归一化化身合成

作者:Huiwen Luo,Koki Nagano,Han-Wei Kung,Mclean Goldwhite,Qingguo Xu,Zejian Wang,Lingyu Wei,Liwen Hu,Hao Li 机构:Pinscreen 备注:Accepted to CVPR 2021 链接:https://arxiv.org/abs/2106.11423 摘要:我们介绍了一个高度健壮的基于GAN的框架,用于从单个无约束照片数字化一个人的标准化3D化身。虽然输入图像可以是一个微笑的人或采取在极端的光照条件下,我们的方法可以可靠地产生一个高品质的纹理模型,一个人的脸在中立的表情和皮肤纹理漫射光照条件下。先进的三维人脸重建方法使用非线性变形人脸模型与基于GAN的解码器相结合来捕捉人物的相似性和细节,但无法生成具有非阴影反照率纹理的中性头部模型,这对于创建可重新照明和动画友好的化身以集成到虚拟环境中至关重要。现有方法面临的主要挑战是缺乏包含标准化三维人脸的训练和地面真实数据。我们提出了一个两阶段的方法来解决这个问题。首先,在StyleGAN2网络中嵌入一个非线性可变形人脸模型,采用一种鲁棒性很强的归一化三维人脸生成器。这允许我们生成详细但规范化的面部资产。该推断随后是一个感知细化步骤,该步骤使用生成的资源作为正则化,以处理有限的可用归一化人脸训练样本。我们进一步介绍了一个标准化的人脸数据集,该数据集由组合摄影测量扫描、精心挑选的照片和在漫射光照条件下生成具有中性表情的假人组成。虽然我们准备的数据集包含的主题比最新的基于GAN的三维人脸重建方法少两个数量级,但是我们表明,对于非常具有挑战性的无约束输入图像,可以生成高质量的标准化人脸模型,并且表现出比当前最先进的技术更好的性能。 摘要:We introduce a highly robust GAN-based framework for digitizing a normalized 3D avatar of a person from a single unconstrained photo. While the input image can be of a smiling person or taken in extreme lighting conditions, our method can reliably produce a high-quality textured model of a person's face in neutral expression and skin textures under diffuse lighting condition. Cutting-edge 3D face reconstruction methods use non-linear morphable face models combined with GAN-based decoders to capture the likeness and details of a person but fail to produce neutral head models with unshaded albedo textures which is critical for creating relightable and animation-friendly avatars for integration in virtual environments. The key challenges for existing methods to work is the lack of training and ground truth data containing normalized 3D faces. We propose a two-stage approach to address this problem. First, we adopt a highly robust normalized 3D face generator by embedding a non-linear morphable face model into a StyleGAN2 network. This allows us to generate detailed but normalized facial assets. This inference is then followed by a perceptual refinement step that uses the generated assets as regularization to cope with the limited available training samples of normalized faces. We further introduce a Normalized Face Dataset, which consists of a combination photogrammetry scans, carefully selected photographs, and generated fake people with neutral expressions in diffuse lighting conditions. While our prepared dataset contains two orders of magnitude less subjects than cutting edge GAN-based 3D facial reconstruction methods, we show that it is possible to produce high-quality normalized face models for very challenging unconstrained input images, and demonstrate superior performance to the current state-of-the-art.

【7】 Mapping Slums with Medium Resolution Satellite Imagery: a Comparative Analysis of Multi-Spectral Data and Grey-level Co-occurrence Matrix Techniques 标题:利用中分辨率卫星影像绘制贫民窟图:多光谱数据与灰度共生矩阵技术的比较分析

作者:Agatha C. H. de Mattos,Gavin McArdle,Michela Bertolotto 备注:Accepted at the 3rd Workshop on Artificial Intelligence for Social Good (IJCAI 2021) 链接:https://arxiv.org/abs/2106.11395 摘要:联合国人居署估计,全世界有超过10亿人生活在贫民窟。然而,探测贫民区位置的最先进技术采用高分辨率卫星图像,获取和处理成本高昂。因此,研究人员开始考虑利用免费和开放获取的中分辨率卫星图像。然而,对于哪种数据准备和机器学习方法最适合用于此类图像数据,目前还没有明确的共识。在本文中,我们评估了两种技术(多光谱数据和灰度共生矩阵特征提取)的开放存取数据集组成的标记哨兵-2图像的空间分辨率为10米。这两种技术都与典型相关森林分类器配对。结果表明,四个城市的灰色共生矩阵均优于多光谱数据。它对贫民窟等级的平均准确率为97%,对并集的平均交集为94%,而多光谱数据分别为75%和64%。这些结果表明,分辨率至少为10米的开放获取卫星图像可能适合跟踪发展目标,例如城市贫民窟的探测。 摘要:The UN-Habitat estimates that over one billion people live in slums around the world. However, state-of-the-art techniques to detect the location of slum areas employ high-resolution satellite imagery, which is costly to obtain and process. As a result, researchers have started to look at utilising free and open-access medium resolution satellite imagery. Yet, there is no clear consensus on which data preparation and machine learning approaches are the most appropriate to use with such imagery data. In this paper, we evaluate two techniques (multi-spectral data and grey-level co-occurrence matrix feature extraction) on an open-access dataset consisting of labelled Sentinel-2 images with a spatial resolution of 10 meters. Both techniques were paired with a canonical correlation forests classifier. The results show that the grey-level co-occurrence matrix performed better than multi-spectral data for all four cities. It had an average accuracy for the slum class of 97% and a mean intersection over union of 94%, while multi-spectral data had 75% and 64% for the respective metrics. These results indicate that open-access satellite imagery with a resolution of at least 10 meters may be suitable for keeping track of development goals such as the detection of slums in cities.

【8】 BEyond observation: an approach for ObjectNav 标题:超越观测:ObjectNav的一种方法

作者:Daniel V. Ruiz,Eduardo Todt 机构:Department of Informatics, Federal University of Paran´a, Brazil, Paran´a, Curitiba 备注:Presented at the 2th Embodied AI Workshop at CVPR 2021 链接:https://arxiv.org/abs/2106.11379 摘要:随着自动化的兴起,无人驾驶汽车无论是作为商用产品还是作为科研课题,都成为一个热门话题。它包含了机器人学的多学科领域,包括嵌入式系统、控制理论、路径规划、同步定位和映射(SLAM)、场景重建和模式识别。在这项工作中,我们探索性地研究了传感器数据融合和最新的机器学习算法如何执行被称为视觉语义导航的人工智能(E-AI)任务。这个任务,又称目标导航(ObjectNav),是指在不事先了解环境的情况下,通过以自我为中心的视觉观察来到达属于目标语义类的对象的自主导航。我们的方法在生境挑战2021 ObjectNav的Minival阶段和测试标准阶段均获得第四名。 摘要:With the rise of automation, unmanned vehicles became a hot topic both as commercial products and as a scientific research topic. It composes a multi-disciplinary field of robotics that encompasses embedded systems, control theory, path planning, Simultaneous Localization and Mapping (SLAM), scene reconstruction, and pattern recognition. In this work, we present our exploratory research of how sensor data fusion and state-of-the-art machine learning algorithms can perform the Embodied Artificial Intelligence (E-AI) task called Visual Semantic Navigation. This task, a.k.a Object-Goal Navigation (ObjectNav) consists of autonomous navigation using egocentric visual observations to reach an object belonging to the target semantic class without prior knowledge of the environment. Our method reached fourth place on the Habitat Challenge 2021 ObjectNav on the Minival phase and the Test-Standard Phase.

【9】 Photozilla: A Large-Scale Photography Dataset and Visual Embedding for 20 Photography Styles 标题:Photozilla:一个大规模摄影数据集和20种摄影样式的视觉嵌入

作者:Trisha Singhal,Junhua Liu,Lucienne T. M. Blessing,Kwan Hui Lim 机构:Lucienne T.M. Blessing, Singapore University of Technology and Design, Singapore, Forth AI, Singapore 备注:In the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021. (Poster) 链接:https://arxiv.org/abs/2106.11359 摘要:社交媒体平台的出现促进了数码摄影的发展,从而催生了视觉应用的繁荣。基于这一动机,我们引入了一个称为“Photozilla”的大规模数据集,其中包括超过990k的图像,属于10种不同的摄影风格。然后利用该数据集训练3个分类模型,将图像自动分类为相应的风格,分类准确率达96%。随着数码摄影的迅速发展,我们看到新的摄影风格以指数级的速度出现。基于此,我们提出了一种新的基于暹罗的分类网络,该网络以训练好的分类模型为基础结构,只需25个训练样本就可以适应和分类不可见的样式。我们报告的准确率超过68%的确定10其他不同类型的摄影风格。此数据集位于https://trisha025.github.io/Photozilla/ 摘要:The advent of social media platforms has been a catalyst for the development of digital photography that engendered a boom in vision applications. With this motivation, we introduce a large-scale dataset termed 'Photozilla', which includes over 990k images belonging to 10 different photographic styles. The dataset is then used to train 3 classification models to automatically classify the images into the relevant style which resulted in an accuracy of ~96%. With the rapid evolution of digital photography, we have seen new types of photography styles emerging at an exponential rate. On that account, we present a novel Siamese-based network that uses the trained classification models as the base architecture to adapt and classify unseen styles with only 25 training samples. We report an accuracy of over 68% for identifying 10 other distinct types of photography styles. This dataset can be found at https://trisha025.github.io/Photozilla/

【10】 Analysis and Tuning of a Voice Assistant System for Dysfluent Speech 标题:一种不流利语音辅助系统的分析与调谐

作者:Vikramjit Mitra,Zifang Huang,Colin Lea,Lauren Tooley,Sarah Wu,Darren Botten,Ashwini Palekar,Shrinath Thelapurath,Panayiotis Georgiou,Sachin Kajarekar,Jefferey Bigham 机构:Apple, Cupertino, CA, USA 备注:5 pages, 1 page reference, 2 figures 链接:https://arxiv.org/abs/2106.11759 摘要:语音发音的不流畅和变异会严重降低语音识别性能,对于许多中重度语音障碍患者来说,语音操作系统不起作用。当前的语音识别系统主要是用流利的说话者的数据来训练的,因此不能很好地推广到语音不流畅的情况,如声音或单词重复、声音延长或听觉障碍。这项工作的重点是定量分析消费者的语音识别系统对个人谁口吃和生产为导向的方法,以提高性能的共同语音助理任务(即,“天气如何?”)。在基线检查时,该系统引入了大量的插入和替换错误,导致流利性障碍患者的预期语音词错误率(isWER)下降13.64%(绝对值)。我们表明,在现有的混合语音识别系统中,只要简单地调整解码参数,就可以将流利性障碍个体的isWER提高24%(相对)。与所有口吃严重程度的18名受试者的默认设置相比,调整这些参数可以提高3.6%的领域识别率和1.7%的意图识别率。 摘要:Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.

【11】 Image simulation for space applications with the SurRender software 标题:用CONFIGHT软件进行空间应用的图像仿真

作者:Jérémy Lebreton,Roland Brochard,Matthieu Baudry,Grégory Jonniaux,Adrien Hadj Salah,Keyvan Kanani,Matthieu Le Goff,Aurore Masson,Nicolas Ollagnier,Paolo Panicucci,Amsha Proag,Cyril Robin 机构:Airbus Defence & Space, rue des Cosmonautes, Toulouse 备注:11th International ESA Conference on Guidance, Navigation & Control Systems, 22 - 25 June 2021 16 pages, 8 figures 链接:https://arxiv.org/abs/2106.11322 摘要:基于视觉导航的图像处理算法需要可靠的图像仿真能力。在本文中,我们解释了为什么传统的渲染引擎可能会出现一些限制,这些限制对空间应用可能至关重要。我们将介绍空客投降软件v7,并提供详细的功能,使它成为一个非常强大的空间图像模拟器。我们展示了投降是如何在我们的计算机视觉解决方案的发展过程的核心,我们提供了一系列的插图渲染图像,从月球和太阳系的探索,在轨道交会和行星机器人的各种使用情况。 摘要:Image Processing algorithms for vision-based navigation require reliable image simulation capacities. In this paper we explain why traditional rendering engines may present limitations that are potentially critical for space applications. We introduce Airbus SurRender software v7 and provide details on features that make it a very powerful space image simulator. We show how SurRender is at the heart of the development processes of our computer vision solutions and we provide a series of illustrations of rendered images for various use cases ranging from Moon and Solar System exploration, to in orbit rendezvous and planetary robotics.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-23,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档