前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >计算机视觉学术速递[6.22]

计算机视觉学术速递[6.22]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 18:29:09
2.1K0
发布2021-07-02 18:29:09
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CV 方向,今日共计122篇

Transformer(3篇)

【1】 OadTR: Online Action Detection with Transformers 标题:OadTR:Transformer的在线动作检测

作者:Xiang Wang,Shiwei Zhang,Zhiwu Qing,Yuanjie Shao,Zhengrong Zuo,Changxin Gao,Nong Sang 机构:Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China, DAMO Academy, Alibaba Group, China 备注:Code is available at this https URL 链接:https://arxiv.org/abs/2106.11149 摘要:最新的在线行为检测方法倾向于使用递归神经网络(RNN)来捕捉长程时间结构。然而,RNN存在非平行性和梯度消失等问题,因此很难进行优化。针对这些问题,本文提出了一种新的基于Transformer的编解码框架OadTR。带有任务标记的编码器旨在捕获历史观测之间的关系和全局交互。解码器通过聚合预期的未来片段表示来提取辅助信息。因此,OadTR可以通过同时编码历史信息和预测未来上下文来识别当前动作。我们在三个具有挑战性的数据集:HDD、TVSeries和THUMOS14上广泛地评估了提议的OadTR。实验结果表明,OadTR比现有的基于RNN的方法具有更高的训练和推理速度,并且在mAP和mcAP方面都显著优于现有的方法。代码位于https://github.com/wangxiang1230/OadTR. 摘要:Most recent approaches for online action detection tend to apply Recurrent Neural Network (RNN) to capture long-range temporal structure. However, RNN suffers from non-parallelism and gradient vanishing, hence it is hard to be optimized. In this paper, we propose a new encoder-decoder framework based on Transformers, named OadTR, to tackle these problems. The encoder attached with a task token aims to capture the relationships and global interactions between historical observations. The decoder extracts auxiliary information by aggregating anticipated future clip representations. Therefore, OadTR can recognize current actions by encoding historical information and predicting future context simultaneously. We extensively evaluate the proposed OadTR on three challenging datasets: HDD, TVSeries, and THUMOS14. The experimental results show that OadTR achieves higher training and inference speeds than current RNN based approaches, and significantly outperforms the state-of-the-art methods in terms of both mAP and mcAP. Code is available at https://github.com/wangxiang1230/OadTR.

【2】 More than Encoder: Introducing Transformer Decoder to Upsample 标题:不仅仅是编码器:将Transformer解码器引入上采样

作者:Yijiang Li,Wentian Cai,Ying Gao,Xiping Hu 机构:South China University of Technology, Guangzhou, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing 备注:19 pages, 7 figures 链接:https://arxiv.org/abs/2106.10637 摘要:一般的分割模型是先对图像进行下采样,然后再进行上采样,以恢复像素级预测的分辨率。在这种模式中,上采样技术对于维护信息以获得更好的性能至关重要。在本文中,我们提出了一种新的上采样方法,即注意上采样(AU),它可以作为一般的上采样方法,并可以应用到任何具有横向连接的分割模型中。AU利用像素级关注模型的长程依赖性和全局信息来实现更好的重建。它由注意解码器(AD)和双线性上采样作为残差连接来补充上采样特征。AD采用了Transformer译码器的思想,根据收缩路径的局部细节信息对特征进行上采样。此外,考虑到像素级注意的大内存和计算开销,我们进一步提出用窗口注意机制来限制局部窗口的注意计算,而不是全局窗口。结合窗口注意,我们将解码器表示为窗口注意解码器(WAD),将上采样方法表示为窗口注意上采样(WAU)。我们在经典的U网结构上测试了我们的方法,通过横向连接从收缩路径传递信息,并在Synapse(80.30 DSC和23.12 HD)和MSD Brain(74.75 DSC)数据集上实现了最先进的性能。 摘要:General segmentation models downsample images and then upsample to restore resolution for pixel level prediction. In such schema, upsample technique is vital in maintaining information for better performance. In this paper, we present a new upsample approach, Attention Upsample (AU), that could serve as general upsample method and be incorporated into any segmentation model that possesses lateral connections. AU leverages pixel-level attention to model long range dependency and global information for better reconstruction. It consists of Attention Decoder (AD) and bilinear upsample as residual connection to complement the upsampled features. AD adopts the idea of decoder from transformer which upsamples features conditioned on local and detailed information from contracting path. Moreover, considering the extensive memory and computation cost of pixel-level attention, we further propose to use window attention scheme to restrict attention computation in local windows instead of global range. Incorporating window attention, we denote our decoder as Window Attention Decoder (WAD) and our upsample method as Window Attention Upsample (WAU). We test our method on classic U-Net structure with lateral connection to deliver information from contracting path and achieve state-of-the-arts performance on Synapse (80.30 DSC and 23.12 HD) and MSD Brain (74.75 DSC) datasets.

【3】 Exploring Vision Transformers for Fine-grained Classification 标题:探索用于细粒度分类的视觉转换器

作者:Marcos V. Conde,Kerem Turgutlu 机构:Universidad de Valladolid, University of San Francisco 备注:4 pages, 5 figures, 4 tables. Published in The Eighth Workshop on Fine-Grained Visual Categorization, for code see this https URL, for workshop papers see this https URL 链接:https://arxiv.org/abs/2106.10587 摘要:现有的计算机视觉分类研究由于类内方差高,类间方差低,难以实现细粒度属性识别。SOTA方法通过定位信息量最大的图像区域来解决这一难题,并依靠这些区域对完整图像进行分类。最近的工作visiontransformer(ViT)显示了它在传统和细粒度分类任务中的强大性能。在这项工作中,我们提出了一个用于细粒度图像分类任务的多阶段ViT框架,该框架利用固有的多头自注意机制,在不需要结构改变的情况下定位信息丰富的图像区域。我们还引入了注意力引导的增强来提高模型的能力。我们通过四个流行的细粒度基准测试来证明我们的方法的价值:CUB-200-2011、斯坦福汽车、斯坦福狗和FGVC7植物病理学。我们还通过定性结果证明了模型的可解释性。 摘要:Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most informative image regions and rely on them to classify the complete image. The most recent work, Vision Transformer (ViT), shows its strong performance in both traditional and fine-grained classification tasks. In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism. We also introduce attention-guided augmentations for improving the model's capabilities. We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's interpretability via qualitative results.

检测相关(12篇)

【1】 Domain and Modality Gaps for LiDAR-based Person Detection on Mobile Robots 标题:基于LiDAR的移动机器人人物检测的域和模态间隙

作者:Dan Jia,Alexander Hermans,Bastian Leibe 机构:Visual Computing Institute, RWTH Aachen 链接:https://arxiv.org/abs/2106.11239 摘要:人体检测是移动机器人在人类居住环境中导航的一项重要任务,而激光雷达传感器由于具有精确的深度测量和大视场的特点,在这项任务中具有广阔的应用前景。本文研究了现有的基于LiDAR的人探测器,特别关注移动机器人场景(如服务机器人或社交机器人),与驾驶场景相比,在移动机器人场景中,人被观察的频率更高,距离更近。我们使用最近发布的JackRabbot数据集和基于3D或2D激光雷达传感器(CenterPoint和DR-SPAAM)的最新探测器进行了一系列实验。这些实验围绕驾驶和移动机器人场景之间的领域差距,以及三维和二维激光雷达传感器之间的模态差距展开。对于领域差距,我们的目的是了解在驱动数据集上预训练的检测器是否能在移动机器人场景中获得良好的性能,目前还没有现成的训练模型。对于模态间隙,我们从性能、运行时间、定位精度、对距离和拥挤度的鲁棒性等多个方面对使用3D或2D激光雷达的探测器进行了比较。我们的实验结果为基于激光雷达的人体检测提供了实用的见解,并为相关的移动机器人设计和应用提供了明智的决策。 摘要:Person detection is a crucial task for mobile robots navigating in human-populated environments and LiDAR sensors are promising for this task, given their accurate depth measurements and large field of view. This paper studies existing LiDAR-based person detectors with a particular focus on mobile robot scenarios (e.g. service robot or social robot), where persons are observed more frequently and in much closer ranges, compared to the driving scenarios. We conduct a series of experiments, using the recently released JackRabbot dataset and the state-of-the-art detectors based on 3D or 2D LiDAR sensors (CenterPoint and DR-SPAAM respectively). These experiments revolve around the domain gap between driving and mobile robot scenarios, as well as the modality gap between 3D and 2D LiDAR sensors. For the domain gap, we aim to understand if detectors pretrained on driving datasets can achieve good performance on the mobile robot scenarios, for which there are currently no trained models readily available. For the modality gap, we compare detectors that use 3D or 2D LiDAR, from various aspects, including performance, runtime, localization accuracy, robustness to range and crowdedness. The results from our experiments provide practical insights into LiDAR-based person detection and facilitate informed decisions for relevant mobile robot designs and applications.

【2】 Temporal Early Exits for Efficient Video Object Detection 标题:基于时间提前退出的高效视频对象检测

作者:Amin Sabet,Jonathon Hare,Bashir Al-Hashimi,Geoff V. Merrett 机构:School of Electronics and Computer Science, University of Southampton, UK, King’s College London, UK 链接:https://arxiv.org/abs/2106.11208 摘要:在资源受限的情况下,将基于图像的目标检测器转移到视频领域仍然是一个挑战。以前的工作是利用光流来传播不变的特征,然而,当处理监控等应用程序中变化非常缓慢的场景时,开销是相当大的。在本文中,我们提出了时间提前出口来降低每帧视频目标检测的计算复杂度。在主干网的早期层插入多个具有低计算开销的时间提前退出模块,以识别连续帧之间的语义差异。仅当该帧被识别为对先前帧具有语义改变时才需要完全计算;否则,将重用来自前一帧的检测结果。在CDnet上的实验表明,与现有方法相比,我们的方法显著降低了每帧视频目标检测的计算复杂度和执行率,最高可达34美元/倍,mAP可接受地降低了2.2%。 摘要:Transferring image-based object detectors to the domain of video remains challenging under resource constraints. Previous efforts utilised optical flow to allow unchanged features to be propagated, however, the overhead is considerable when working with very slowly changing scenes from applications such as surveillance. In this paper, we propose temporal early exits to reduce the computational complexity of per-frame video object detection. Multiple temporal early exit modules with low computational overhead are inserted at early layers of the backbone network to identify the semantic differences between consecutive frames. Full computation is only required if the frame is identified as having a semantic change to previous frames; otherwise, detection results from previous frames are reused. Experiments on CDnet show that our method significantly reduces the computational complexity and execution of per-frame video object detection up to $34 \times$ compared to existing methods with an acceptable reduction of 2.2\% in mAP.

【3】 SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous Driving 标题:SODA10M:面向自动驾驶的大规模目标检测基准

作者:Jianhua Han,Xiwen Liang,Hang Xu,Kai Chen,Lanqing Hong,Chaoqiang Ye,Wei Zhang,Zhenguo Li,Chunjing Xu,Xiaodan Liang 机构: was 1 Huawei Noah’s Ark Lab 2 Sun Yat-Sen University 3 Hong Kong University of Science and Technology 链接:https://arxiv.org/abs/2106.11118 摘要:为了建立一个真实的、不断发展的、可扩展的自主驾驶系统,我们提出了一个大规模的基准,通过从原始数据中学习来标准化各种自监督和半监督方法的评估,这是迄今为止第一个也是最大的基准。现有的自动驾驶系统严重依赖于使用大量注释数据训练的“完美”视觉感知模型(如检测),以确保安全性。然而,在部署强大的自动驾驶系统时,对所有场景和情况(如夜间、极端天气、城市)的实例进行精心标注是不现实的。基于近年来自监督和半监督学习的强大发展,一个很有前途的方向是通过协作利用大量未标记数据和少量标记数据来学习一个鲁棒的检测模型。现有的数据集(如KITTI、Waymo)要么只提供少量的数据,要么只覆盖有限的域并进行了完整的注释,阻碍了大规模预训练模型的探索。在这里,我们发布了一个用于自动驾驶的大规模目标检测基准,名为SODA10M,包含1000万张未标记图像和20K张标记有6个代表性目标类别的图像。为了提高多样性,在32个不同的城市,在不同的天气条件、时段和地点场景下,每帧每10秒采集一次图像。我们提供了广泛的实验和深入的分析,现有的监督最先进的检测模型,流行的自我监督和半监督的方法,以及一些关于如何开发未来模型的见解。有关数据和更多最新信息已在https://soda-2d.github.io. 摘要:Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present a large-scale benchmark for standardizing the evaluation of different self-supervised and semi-supervised approaches by learning from raw data, which is the first and largest benchmark to date. Existing autonomous driving systems heavily rely on `perfect' visual perception models (e.g., detection) trained using extensive annotated data to ensure the safety. However, it is unrealistic to elaborately label instances of all scenarios and circumstances (e.g., night, extreme weather, cities) when deploying a robust autonomous driving system. Motivated by recent powerful advances of self-supervised and semi-supervised learning, a promising direction is to learn a robust detection model by collaboratively exploiting large-scale unlabeled data and few labeled data. Existing dataset (e.g., KITTI, Waymo) either provides only a small amount of data or covers limited domains with full annotation, hindering the exploration of large-scale pre-trained models. Here, we release a Large-Scale Object Detection benchmark for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories. To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes. We provide extensive experiments and deep analyses of existing supervised state-of-the-art detection models, popular self-supervised and semi-supervised approaches, and some insights about how to develop future models. The data and more up-to-date information have been released at https://soda-2d.github.io.

【4】 Obstacle Detection for BVLOS Drones 标题:BVLOS无人机的障碍物检测

作者:Jan Moros Esteban 机构:Jaap van de Loosdrecht 备注:7 pages, 7 figures, Supervisors: Maya Aghaei Gavari and Jaap van de Loosdrecht 链接:https://arxiv.org/abs/2106.11098 摘要:随着欧盟新法规的出台,超视距无人机的未来必将绽放。这导致了theBEAST项目的创建,该项目旨在创建一种自主安全无人机,重点放在这些法规和安全上。本技术文件描述了该项目中模块的第一步,该模块围绕探测障碍物展开,以便在故障安全着陆时避免障碍物。我们研究了一种基于深度学习的目标检测方法,并进行了各种实验,如比较了各种数据增强技术或YOLOv3和YOLOv5。根据实验结果,我们得出结论,虽然目标检测是解决这一问题的一种很有前途的方法,但是在实际应用中需要更多的数据。 摘要:With the introduction of new regulations in the European Union, the future of Beyond Visual Line Of Sight (BVLOS) drones is set to bloom. This led to the creation of the theBEAST project, which aims to create an autonomous security drone, with focus on those regulations and on safety. This technical paper describes the first steps of a module within this project, which revolves around detecting obstacles so they can be avoided in a fail-safe landing. A deep learning powered object detection method is the subject of our research, and various experiments are held to maximize its performance, such as comparing various data augmentation techniques or YOLOv3 and YOLOv5. According to the results of the experiments, we conclude that although object detection is a promising approach to resolve this problem, more volume of data is required for potential usage in a real-life application.

【5】 Hard hat wearing detection based on head keypoint localization 标题:基于头部关键点定位的安全帽佩戴检测

作者:Bartosz Wójcik,Mateusz Żarski,Kamil Książek,Jarosław Adam Miszczak,Mirosław Jan Skibniewski 机构:Jan Skibniewskid,b,e,f, Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Bałtycka ,-, Gliwice, Poland, Electronics and Computer Science, Silesian University of Technology, Akademicka ,-, Gliwice, Poland 备注:15 pages, 9 figures and 9 tables 链接:https://arxiv.org/abs/2106.10944 摘要:近年来,基于视觉的施工现场安全系统中的深度学习方法受到了广泛关注,尤其是个人防护装备。然而,尽管如此,仍然没有可靠的方法来确定工人和他们的安全帽之间的关系。为了解决这一问题,本文提出了一种结合深度学习、目标检测和头部关键点定位的方法,并结合简单的基于规则的推理。在测试中,该方法超越了以往基于不同实例的相对包围盒位置以及直接检测戴头盔者和未戴头盔者的方法。结果表明,将新的深度学习方法与基于规则的人性化解释系统相结合,可以得到既可靠又能成功模拟人工现场监督的解决方案。这项工作是发展完全自主的建筑工地安全系统的下一步,表明这方面仍有改进的余地。 摘要:In recent years, a lot of attention is paid to deep learning methods in the context of vision-based construction site safety systems, especially regarding personal protective equipment. However, despite all this attention, there is still no reliable way to establish the relationship between workers and their hard hats. To answer this problem a combination of deep learning, object detection and head keypoint localization, with simple rule-based reasoning is proposed in this article. In tests, this solution surpassed the previous methods based on the relative bounding box position of different instances, as well as direct detection of hard hat wearers and non-wearers. The results show that the conjunction of novel deep learning methods with humanly-interpretable rule-based systems can result in a solution that is both reliable and can successfully mimic manual, on-site supervision. This work is the next step in the development of fully autonomous construction site safety systems and shows that there is still room for improvement in this area.

【6】 Interpretable Face Manipulation Detection via Feature Whitening 标题:基于特征白化的可解释人脸操作检测

作者:Yingying Hua,Daichi Zhang,Pengju Wang,Shiming Ge 机构: 20 2 1) mostly fo- 1Institute of Information Engineering, China 2School of Cyber Security 链接:https://arxiv.org/abs/2106.10834 摘要:为什么我们要相信深层神经网络对被操纵人脸的检测?了解其原因对于提高检测模型的公平性、可靠性、隐私性和可信度具有重要意义。在这项工作中,我们提出了一种可解释的人脸操作检测方法来实现可信和准确的推断。该方法通过嵌入特征增白模块,使人脸检测过程透明化。本模块旨在通过特征去相关和特征约束来白化深层网络的内部工作机制。实验结果表明,该方法能够在检测精度和模型可解释性之间取得平衡。 摘要:Why should we trust the detections of deep neural networks for manipulated faces? Understanding the reasons is important for users in improving the fairness, reliability, privacy and trust of the detection models. In this work, we propose an interpretable face manipulation detection approach to achieve the trustworthy and accurate inference. The approach could make the face manipulation detection process transparent by embedding the feature whitening module. This module aims to whiten the internal working mechanism of deep networks through feature decorrelation and feature constraint. The experimental results demonstrate that our proposed approach can strike a balance between the detection accuracy and the model interpretability.

【7】 3D Object Detection for Autonomous Driving: A Survey 标题:自动驾驶中的三维目标检测:综述

作者:Rui Qian,Xin Lai,Xirong Li 机构:Renmin University of China, Beijing, China 备注:3D object detection, Autonomous driving, Point clouds 链接:https://arxiv.org/abs/2106.10823 摘要:自动驾驶被认为是最有前途的补救措施之一,以保护人类免受严重的碰撞。为此,三维目标检测是这种感知系统的核心基础,特别是在路径规划、运动预测、碰撞避免等方面。通常,具有相应三维点云的立体或单目图像已经是三维目标检测的标准布局,其中点云越来越普遍,提供了准确的深度信息。尽管已有的研究成果,但由于点云本身的高度稀疏性和不规则性,以及相机视角与激光雷达鸟瞰视角之间的错位导致的模态协同效应、远距离遮挡和尺度变化等原因,点云三维目标检测仍处于起步阶段,随着大量文献的研究,三维目标检测已经取得了长足的进展。因此,我们全面回顾了该领域的最新进展,包括传感器、基础知识和最新最先进的检测方法及其优缺点。此外,我们还引入了度量标准,并对流行的公共数据集进行了定量比较。在对调查作品进行深入分析之后,将明智地确定未来工作的途径。最后,对本文进行了总结。 摘要:Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of such perception system especially for the sake of path planning, motion prediction, collision avoidance, etc. Generally, stereo or monocular images with corresponding 3D point clouds are already standard layout for 3D object detection, out of which point clouds are increasingly prevalent with accurate depth information being provided. Despite existing efforts, 3D object detection on point clouds is still in its infancy due to high sparseness and irregularity of point clouds by nature, misalignment view between camera view and LiDAR bird's eye of view for modality synergies, occlusions and scale variations at long distances, etc. Recently, profound progress has been made in 3D object detection, with a large body of literature being investigated to address this vision task. As such, we present a comprehensive review of the latest progress in this field covering all the main topics including sensors, fundamentals, and the recent state-of-the-art detection methods with their pros and cons. Furthermore, we introduce metrics and provide quantitative comparisons on popular public datasets. The avenues for future work are going to be judiciously identified after an in-deep analysis of the surveyed works. Finally, we conclude this paper.

【8】 Automated Deepfake Detection 标题:自动深伪检测

作者:Ping Liu 机构: Liu is with Institute of High Performance Computing 链接:https://arxiv.org/abs/2106.10705 摘要:在这篇论文中,我们提出利用自动机器学习来自动搜寻架构来侦测假货。与以往的工作不同的是,我们的方法得益于优秀的深度学习能力,同时也减轻了人工网络设计过程中的高人工成本。实验证明,该方法不仅优于以往的非深度学习方法,而且预测精度与以往的深度学习方法相当甚至更好。为了提高该方法的通用性,特别是当训练数据和测试数据被不同的方法操作时,我们在网络学习过程中提出了一种多任务策略,使其能够估计给定样本中潜在的操作区域,并预测样本是否真实。与以往使用相似策略的工作相比,我们的方法对先验知识的依赖性要小得多,例如不需要知道使用了哪种操作方法以及是否已经使用了。在两个基准数据集上的大量实验结果证明了该方法的有效性。 摘要:In this paper, we propose to utilize Automated Machine Learning to automatically search architecture for deepfake detection. Unlike previous works, our method benefits from the superior capability of deep learning while relieving us from the high labor cost in the manual network design process. It is experimentally proved that our proposed method not only outperforms previous non-deep learning methods but achieves comparable or even better prediction accuracy compared to previous deep learning methods. To improve the generality of our method, especially when training data and testing data are manipulated by different methods, we propose a multi-task strategy in our network learning process, making it estimate potential manipulation regions in given samples as well as predict whether the samples are real. Comparing to previous works using similar strategies, our method depends much less on prior knowledge, such as no need to know which manipulation method is utilized and whether it is utilized already. Extensive experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method on deepfake detection.

【9】 Plant Disease Detection Using Image Processing and Machine Learning 标题:基于图像处理和机器学习的植物病害检测

作者:Pranesh Kulkarni,Atharva Karwande,Tejas Kolhe,Soham Kamble,Akshay Joshi,Medha Wyawahare 机构: Department of Electronics and Telecommunication, Vishwakarma Institute of Technology, Pune, India. 链接:https://arxiv.org/abs/2106.10698 摘要:在农业实践中,一项重要而繁琐的任务是检测作物上的病害。它需要大量的时间和熟练的劳动力。利用计算机视觉和机器学习技术,提出了一种智能高效的作物病害检测技术。该系统可检测5种常见植物的20种病害,准确率达93%。 摘要:One of the important and tedious task in agricultural practices is the detection of the disease on crops. It requires huge time as well as skilled labor. This paper proposes a smart and efficient technique for detection of crop disease which uses computer vision and machine learning techniques. The proposed system is able to detect 20 different diseases of 5 common plants with 93% accuracy.

【10】 Humble Teachers Teach Better Students for Semi-Supervised Object Detection 标题:谦虚的教师教会更好的学生进行半监督目标检测

作者:Yihe Tang,Weifeng Chen,Yijun Luo,Yuting Zhang 机构:† Carnegie Mellon University, ‡ Amazon Web Services 备注:CVPR 2021 camera-ready. Code: this https URL 链接:https://arxiv.org/abs/2106.10456 摘要:我们提出了一种基于师生双模型框架的半监督方法。该方法的特点是:1)采用指数移动平均策略从学生在线更新教师;2)使用大量的区域建议和软伪标签作为学生的训练目标;3)使用一个轻加权的检测特定数据集合,为教师生成更可靠的伪标签。与最近最先进的STAC(在稀疏选择的硬伪样本上使用硬标签)相比,我们模型中的教师在许多提议上使用软标签向学生展示更丰富的信息。当使用VOC12作为未标记数据时,我们的模型在voc07val集上实现了53.04%的COCO风格AP,比STAC好8.4%。在MS-COCO上,当只有一小部分数据作为标记时,它的性能优于以前的工作。在MS-COCO测试设备上,通过利用与标记数据大小相似的未标记数据,其AP也达到53.8%,比完全监督的ResNet-152级联R-CNN有3.1%的增益。 摘要:We propose a semi-supervised approach for contemporary object detectors following the teacher-student dual model framework. Our method is featured with 1) the exponential moving averaging strategy to update the teacher from the student online, 2) using plenty of region proposals and soft pseudo-labels as the student's training targets, and 3) a light-weighted detection-specific data ensemble for the teacher to generate more reliable pseudo-labels. Compared to the recent state-of-the-art -- STAC, which uses hard labels on sparsely selected hard pseudo samples, the teacher in our model exposes richer information to the student with soft-labels on many proposals. Our model achieves COCO-style AP of 53.04% on VOC07 val set, 8.4% better than STAC, when using VOC12 as unlabeled data. On MS-COCO, it outperforms prior work when only a small percentage of data is taken as labeled. It also reaches 53.8% AP on MS-COCO test-dev with 3.1% gain over the fully supervised ResNet-152 Cascaded R-CNN, by tapping into unlabeled data of a similar size to the labeled data.

【11】 AdaZoom: Adaptive Zoom Network for Multi-Scale Object Detection in Large Scenes 标题:AdaZoom:用于大场景多尺度目标检测的自适应缩放网络

作者:Jingtao Xu,Yali Li,Shengjin Wang 机构:Department of Electronic Engineering, Tsinghua University 链接:https://arxiv.org/abs/2106.10409 摘要:大规模场景中的目标检测是一个具有挑战性的问题,因为目标很小,而且尺度变化很大。重点研究小物体的图像区域是非常必要的。本文提出了一种新的自适应变焦(AdaZoom)网络作为选择性放大镜,具有灵活的形状和焦距,可以自适应地对目标检测的聚焦区域进行变焦。基于策略梯度,我们构造了一个焦点区域生成的强化学习框架,奖励由目标分布表示。生成区域的尺度和宽高比与内部对象的尺度和分布相适应。我们根据区域的大小采用可变放大倍数进行自适应多尺度检测。我们进一步提出了协作训练,以补充提高AdaZoom和检测网络的性能。为了验证其有效性,我们在vis、UAVDT和DOTA数据集上进行了大量实验。实验表明,AdaZoom对不同的检测网络有着一致而显著的改进,在这些数据集上取得了最先进的性能,尤其是在Vis-Drone2019上,AdaZoom的性能优于AP提出的4.64%的现有方法。 摘要:Detection in large-scale scenes is a challenging problem due to small objects and extreme scale variation. It is essential to focus on the image regions of small objects. In this paper, we propose a novel Adaptive Zoom (AdaZoom) network as a selective magnifier with flexible shape and focal length to adaptively zoom the focus regions for object detection. Based on policy gradient, we construct a reinforcement learning framework for focus region generation, with the reward formulated by object distributions. The scales and aspect ratios of the generated regions are adaptive to the scales and distribution of objects inside. We apply variable magnification according to the scale of the region for adaptive multi-scale detection. We further propose collaborative training to complementarily promote the performance of AdaZoom and the detection network. To validate the effectiveness, we conduct extensive experiments on VisDrone2019, UAVDT, and DOTA datasets. The experiments show AdaZoom brings a consistent and significant improvement over different detection networks, achieving state-of-the-art performance on these datasets, especially outperforming the existing methods by AP of 4.64% on Vis-Drone2019.

【12】 Implementing a Detection System for COVID-19 based on Lung Ultrasound Imaging and Deep Learning 标题:基于肺部超声成像和深度学习的冠状病毒检测系统的实现

作者:Carlos Rojas-Azabache,Karen Vilca-Janampa,Renzo Guerrero-Huayta,Dennis Núñez-Fernández 机构: Universidad Nacional de Ingenier´ıa, Lima, Peru 备注:Beyond Fairness Workshop at CVPR 2021 链接:https://arxiv.org/abs/2106.10651 摘要:COVID-19大流行始于2019年12月的中国,并迅速蔓延至多个国家。这一流行病的后果不可估量,造成数百万人死亡,破坏全球经济。为了大规模控制这一流行病,需要快速检测和治疗病人的工具。因此,由于没有准确和自动化的工具,对诊断COVID-19的替代工具的需求急剧增加。在本文中,我们提出了正在进行的工作,系统检测COVID-19使用超声成像和使用深度学习技术。此外,这样一个系统是在树莓Pi上实现的,以使其便于携带和在没有互联网连接的偏远地区使用。 摘要:The COVID-19 pandemic started in China in December 2019 and quickly spread to several countries. The consequences of this pandemic are incalculable, causing the death of millions of people and damaging the global economy. To achieve large-scale control of this pandemic, fast tools for detection and treatment of patients are needed. Thus, the demand for alternative tools for the diagnosis of COVID-19 has increased dramatically since accurated and automated tools are not available. In this paper we present the ongoing work on a system for COVID-19 detection using ultrasound imaging and using Deep Learning techniques. Furthermore, such a system is implemented on a Raspberry Pi to make it portable and easy to use in remote regions without an Internet connection.

分类|识别相关(15篇)

【1】 The Arm-Swing Is Discriminative in Video Gait Recognition for Athlete Re-Identification 标题:摆臂在运动员再识别视频步态识别中的判别性

作者:Yapkan Choi,Yeshwanth Napolean,Jan C. van Gemert 机构:Computer Vision Lab, Delft University of Technology, Delft, The Netherlands 备注:ICIP 2021 链接:https://arxiv.org/abs/2106.11280 摘要:本文将跑步步态作为长跑项目中视频人再识别的一个属性进行评价。结果表明,在跨摄像机检索任务中,与基于外观的方法相比,跑步步态识别具有更好的性能,而且步态和外观特征是互补的。对于步态,由于躯干区域的模糊性,当使用二元步态轮廓时,跑步期间的手臂摆动不易区分。我们建议使用人类的语义分析来创建部分步态轮廓,其中躯干被忽略。省略躯干可以让手臂摆动在正面和斜视角下更明显,从而提高识别结果,这暗示了手臂摆动有点私人化。实验表明,与使用全身轮廓相比,CampusRun上的mAP提高了3.2%,CASIA-B上的前后视图的精确度提高了4.8%。 摘要:In this paper we evaluate running gait as an attribute for video person re-identification in a long-distance running event. We show that running gait recognition achieves competitive performance compared to appearance-based approaches in the cross-camera retrieval task and that gait and appearance features are complementary to each other. For gait, the arm swing during running is less distinguishable when using binary gait silhouettes, due to ambiguity in the torso region. We propose to use human semantic parsing to create partial gait silhouettes where the torso is left out. Leaving out the torso improves recognition results by allowing the arm swing to be more visible in the frontal and oblique viewing angles, which offers hints that arm swings are somewhat personal. Experiments show an increase of 3.2% mAP on the CampusRun and increased accuracy with 4.8% in the frontal and rear view on CASIA-B, compared to using the full body silhouettes.

【2】 TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification 标题:TNT:用于Few-Shot视频分类的带归纳推理的文本条件网络

作者:Andrés Villa,Juan-Manuel Perez-Rua,Vladimir Araujo,Juan Carlos Niebles,Victor Escorcia,Alvaro Soto 机构:Pontificia Universidad Cat´olica de Chile,Facebook AI, Samsung AI Center Cambridge,Stanford University,KU Leuven 备注:10 pages including references, 7 figures, and 4 tables 链接:https://arxiv.org/abs/2106.11173 摘要:近年来,很少有镜头学习受到越来越多的关注。现有的工作主要集中在图像分类上,很少有人致力于解决更具挑战性的Few-Shot视频分类问题。这些尝试旨在有效地利用视频中的时间维度,以便在低数据状态下更好地学习。然而,他们在很大程度上忽略了视频的一个关键特征,即视频通常伴随着丰富的文本描述,而这一特征对于Few-Shot识别至关重要。在本文中,我们首次提出在训练少量镜头视频分类模型时,利用这些人类提供的文本描述作为特权信息。具体来说,我们制定了一个基于文本的任务调节器,以适应视频特征的少数镜头学习任务。我们的模型遵循了一个transductive设置,其中可以使用查询样本和支持文本描述来更新支持集类原型,以进一步提高模型的任务适应能力。在Few-Shot视频动作分类中,我们的模型在四个具有挑战性的基准上取得了最先进的性能。 摘要:Recently, few-shot learning has received increasing interest. Existing efforts have been focused on image classification, with very few attempts dedicated to the more challenging few-shot video classification problem. These few attempts aim to effectively exploit the temporal dimension in videos for better learning in low data regimes. However, they have largely ignored a key characteristic of video which could be vital for few-shot recognition, that is, videos are often accompanied by rich text descriptions. In this paper, for the first time, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Our model follows a transductive setting where query samples and support textual descriptions can be used to update the support set class prototype to further improve the task-adaptation ability of the model. Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.

【3】 Classification of Documents Extracted from Images with Optical Character Recognition Methods 标题:用光学字符识别方法对从图像中提取的文档进行分类

作者:Omer Aydin 备注:None 链接:https://arxiv.org/abs/2106.11125 摘要:在过去的十年里,机器学习方法给了我们无人驾驶汽车、语音识别、有效的网络搜索以及对人类基因组的更好理解。机器学习在今天是如此普遍,以至于它一天被使用几十次,可能是在不知不觉中。试着教机器一些过程或某些情况,可以让他们预测一些人脑难以预测的结果。这些方法也帮助我们在短时间内完成一些人类活动通常不可能或难以完成的操作。基于这些原因,机器学习在今天是如此重要。在这项研究中,两种不同的机器学习方法相结合。为了解决现实世界中的问题,手稿文件首先被传送到计算机,然后被分类。整个过程采用了三种基本的实现方法。手写或打印的文件已被扫描仪或数码相机数字化。这些文件已处理两种不同的光学字符识别(OCR)操作。然后利用朴素贝叶斯算法对生成的文本进行分类。所有项目都是在Windows操作系统上的microsoftvisualstudio12平台上进行编程的。研究的所有部分都使用了C#编程语言。此外,还使用了一些准备好的代码和DLL。 摘要:Over the past decade, machine learning methods have given us driverless cars, voice recognition, effective web search, and a much better understanding of the human genome. Machine learning is so common today that it is used dozens of times a day, possibly unknowingly. Trying to teach a machine some processes or some situations can make them predict some results that are difficult to predict by the human brain. These methods also help us do some operations that are often impossible or difficult to do with human activities in a short time. For these reasons, machine learning is so important today. In this study, two different machine learning methods were combined. In order to solve a real-world problem, the manuscript documents were first transferred to the computer and then classified. We used three basic methods to realize the whole process. Handwriting or printed documents have been digitalized by a scanner or digital camera. These documents have been processed with two different Optical Character Recognition (OCR) operation. After that generated texts are classified by using Naive Bayes algorithm. All project was programmed in Microsoft Visual Studio 12 platform on Windows operating system. C# programming language was used for all parts of the study. Also, some prepared codes and DLLs were used.

【4】 Paradigm selection for Data Fusion of SAR and Multispectral Sentinel data applied to Land-Cover Classification 标题:用于土地覆盖分类的SAR与多光谱哨兵数据融合范例选择

作者:Alessandro Sebastianelli,Maria Pia Del Rosso,Pierre Philippe Mathieu,Silvia Liberata Ullo 机构: University of Sannio 备注:This work has been submitted to the IEEE Geoscience and Remote Sensing Letters for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 链接:https://arxiv.org/abs/2106.11056 摘要:数据融合是一项众所周知的技术,在人工智能对地观测(AI4EO)领域得到越来越广泛的应用,主要是因为它能够通过组合多个数据源来加强AI4EO的应用,从而产生更好的结果。另一方面,与其他卫星数据分析方法一样,由于人工智能(AI)的集成,数据融合本身也在受益和发展。本文分析并实现了四种基于卷积神经网络的数据融合方法。其目的是在确定了CNN的基本结构之后,为选择最佳的数据融合框架提供一个系统的过程,从而得到最佳的分类结果,并在涉及数据融合应用于遥感时帮助感兴趣的研究人员进行工作。该方法已在土地覆被分类中得到验证,但也可以推广到其他情况。 摘要:Data fusion is a well-known technique, becoming more and more popular in the Artificial Intelligence for Earth Observation (AI4EO) domain mainly due to its ability of reinforcing AI4EO applications by combining multiple data sources and thus bringing better results. On the other hand, like other methods for satellite data analysis, data fusion itself is also benefiting and evolving thanks to the integration of Artificial Intelligence (AI). In this letter, four data fusion paradigms, based on Convolutional Neural Networks (CNNs), are analyzed and implemented. The goals are to provide a systematic procedure for choosing the best data fusion framework, resulting in the best classification results, once the basic structure for the CNN has been defined, and to help interested researchers in their work when data fusion applied to remote sensing is involved. The procedure has been validated for land-cover classification but it can be transferred to other cases.

【5】 SHREC 2021: Track on Skeleton-based Hand Gesture Recognition in the Wild 标题:SHREC 2021:野外基于骨架的手势识别跟踪

作者:Ariel Caputo,Andrea Giachetti,Simone Soso,Deborah Pintani,Andrea D'Eusanio,Stefano Pini,Guido Borghi,Alessandro Simoni,Roberto Vezzani,Rita Cucchiara,Andrea Ranieri,Franca Giannini,Katia Lupinetti,Marina Monti,Mehran Maghoumi,Joseph J. LaViola Jr,Minh-Quan Le,Hai-Dang Nguyen,Minh-Triet Tran 机构:Department of Computer Science, University of Verona, Italy, b Università di Modena e Reggio Emilia, Dipartimento di Ingegneria "Enzo Ferrari", Dipartimento di Informatica - Scienza e Ingegneria 备注:12 pages, to be published on Computers & Graphics 链接:https://arxiv.org/abs/2106.10980 摘要:手势识别是一种基本的工具,可以在各种应用场景中实现新的交互模式,如混合现实环境、非接触公共信息亭、娱乐系统等。如今,手势识别可以直接从低成本跟踪器(Ultraleap)和MR耳机(Hololens、Oculus Quest)提供的软件或视频处理软件模块(例如Google Mediapipe)估计的手骨架流中执行。尽管最近在骨骼的手势和动作识别方面取得了一些进展,但目前的最新技术在识别大量异构手势的真实场景中的表现如何尚不清楚,因为许多基准测试不测试在线识别,并且使用的词典有限。这推动了shrec2021的提案:野外基于骨架的手势识别跟踪。在本次比赛中,我们创建了一个新的数据集,其中包含具有不同类型和持续时间的异构手势。这些手势必须在在线识别场景中的序列中找到。本文介绍了比赛的结果,展示了四个研究小组提出的技术在挑战性任务上的表现,并与简单的基线方法进行了比较。 摘要:Gesture recognition is a fundamental tool to enable novel interaction paradigms in a variety of application scenarios like Mixed Reality environments, touchless public kiosks, entertainment systems, and more. Recognition of hand gestures can be nowadays performed directly from the stream of hand skeletons estimated by software provided by low-cost trackers (Ultraleap) and MR headsets (Hololens, Oculus Quest) or by video processing software modules (e.g. Google Mediapipe). Despite the recent advancements in gesture and action recognition from skeletons, it is unclear how well the current state-of-the-art techniques can perform in a real-world scenario for the recognition of a wide set of heterogeneous gestures, as many benchmarks do not test online recognition and use limited dictionaries. This motivated the proposal of the SHREC 2021: Track on Skeleton-based Hand Gesture Recognition in the Wild. For this contest, we created a novel dataset with heterogeneous gestures featuring different types and duration. These gestures have to be found inside sequences in an online recognition scenario. This paper presents the result of the contest, showing the performances of the techniques proposed by four research groups on the challenging task compared with a simple baseline method.

【6】 Leveraging Conditional Generative Models in a General Explanation Framework of Classifier Decisions 标题:在分类器决策的一般解释框架中利用条件生成模型

作者:Martin Charachon,Paul-Henry Cournède,Céline Hudelot,Roberto Ardon 机构:Ardon, Incepto Medical, France, MICS, Universit´e Paris-Saclay, CentraleSup´elec, France 链接:https://arxiv.org/abs/2106.10947 摘要:为分类器的决策提供一个人类可以理解的解释,对于在日常任务中使用分类器产生信任已成为当务之急。尽管许多工作已经通过生成视觉解释图来解决这个问题,但是它们经常提供噪声和不准确的结果,迫使使用与所讨论的分类器无关的启发式正则化。在本文中,我们提出了一个新的一般视角的视觉解释问题克服这些限制。我们证明了通过两个特定的条件生成模型得到的两幅图像之间的差异可以产生视觉解释。这两种生成模型都是使用分类器来解释的,并使用数据库来实现以下特性:(i)第一个生成器生成的所有图像分类与输入图像相似,而第二个生成器的输出分类则相反(ii)生成的图像属于真实图像的分布(iii)输入图像和相应生成图像之间的距离最小,因此生成元素之间的差异仅揭示所研究分类器的相关信息。利用对称约束和循环约束,我们给出了一般公式的两种不同的近似和实现。在实验上,我们在三个不同的公共数据集上证明了与最新技术相比的显著改进。特别地,影响分类器的区域的定位与人类注释是一致的。 摘要:Providing a human-understandable explanation of classifiers' decisions has become imperative to generate trust in their use for day-to-day tasks. Although many works have addressed this problem by generating visual explanation maps, they often provide noisy and inaccurate results forcing the use of heuristic regularization unrelated to the classifier in question. In this paper, we propose a new general perspective of the visual explanation problem overcoming these limitations. We show that visual explanation can be produced as the difference between two generated images obtained via two specific conditional generative models. Both generative models are trained using the classifier to explain and a database to enforce the following properties: (i) All images generated by the first generator are classified similarly to the input image, whereas the second generator's outputs are classified oppositely. (ii) Generated images belong to the distribution of real images. (iii) The distances between the input image and the corresponding generated images are minimal so that the difference between the generated elements only reveals relevant information for the studied classifier. Using symmetrical and cyclic constraints, we present two different approximations and implementations of the general formulation. Experimentally, we demonstrate significant improvements w.r.t the state-of-the-art on three different public data sets. In particular, the localization of regions influencing the classifier is consistent with human annotations.

【7】 Cross-layer Navigation Convolutional Neural Network for Fine-grained Visual Classification 标题:用于细粒度视觉分类的跨层导航卷积神经网络

作者:Chenyu Guo,Jiyang Xie,Kongming Liang,Xian Sun,Zhanyu Ma 机构: Pattern Recognition and Intelligent System Lab., Beijing University of Posts and Telecommunications, Beijing, China, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China 备注:5 pages, 3 figures 链接:https://arxiv.org/abs/2106.10920 摘要:细粒度视觉分类(FGVC)的目的是对同一超类中的子类对象(如鸟类、汽车模型)进行分类。对于FGVC任务,最基本的解决方法是从局部区域中找出目标的细微差别信息。传统的GVC模型倾向于使用精细化的特征,即高层次的语义信息进行识别,很少使用低层次的信息。然而,事实证明,包含丰富细节信息的低层信息也有助于提高性能。因此,本文提出了跨层导航卷积神经网络进行特征融合。首先,将骨干网提取的特征映射从高到低依次送入卷积长短时记忆模型中进行特征聚合。然后,在特征融合后利用注意机制提取空间和通道信息,同时将高层语义信息和底层纹理特征联系起来,从而更好地定位FGVC的判别区域。实验中使用了CUB-200-2011、Stanford Cars和FGVC飞机三个常用的FGVC数据集进行评价,通过与其他FGVC方法的比较,证明了该方法的优越性,表明该方法取得了较好的效果。 摘要:Fine-grained visual classification (FGVC) aims to classify sub-classes of objects in the same super-class (e.g., species of birds, models of cars). For the FGVC tasks, the essential solution is to find discriminative subtle information of the target from local regions. TraditionalFGVC models preferred to use the refined features,i.e., high-level semantic information for recognition and rarely use low-level in-formation. However, it turns out that low-level information which contains rich detail information also has effect on improving performance. Therefore, in this paper, we propose cross-layer navigation convolutional neural network for feature fusion. First, the feature maps extracted by the backbone network are fed into a convolutional long short-term memory model sequentially from high-level to low-level to perform feature aggregation. Then, attention mechanisms are used after feature fusion to extract spatial and channel information while linking the high-level semantic information and the low-level texture features, which can better locate the discriminative regions for the FGVC. In the experiments, three commonly used FGVC datasets, including CUB-200-2011, Stanford-Cars, andFGVC-Aircraft datasets, are used for evaluation and we demonstrate the superiority of the proposed method by comparing it with other referred FGVC methods to show that this method achieves superior results.

【8】 An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention 标题:一种端到端的注意顺序高棉文光学字符识别方法

作者:Rina Buoy,Sokchea Kor,Nguonly Taing 机构:†Techo Startup Center (TSC), ‡Royal University of Phnom Penh (RUPP) 备注:16 pages, 18 figures 链接:https://arxiv.org/abs/2106.10875 摘要:提出了一种用于高棉语光学字符识别(OCR)任务的端到端深度卷积递归神经网络方法。该方案采用了一种带注意机制的序列到序列(Seq2Seq)结构。编码器通过剩余卷积块层和选通递归单元(GRU)层从输入文本行图像中提取视觉特征。这些特征被编码在单个上下文向量和一系列隐藏状态中,这些隐藏状态被馈送到解码器,用于一次解码一个字符,直到到达特殊的句末(EOS)令牌为止。注意机制允许解码器网络在预测目标字符的同时自适应地选择输入图像的部分。Seq2Seq高棉OCR网络是在计算机生成的七种常用高棉字体的文本行图像的大量集合上训练的。在3000个图像测试集上,该模型的性能优于用于高棉语的Tesseract OCR引擎,字符错误率(CER)为1%对3%。 摘要:This paper presents an end-to-end deep convolutional recurrent neural network solution for Khmer optical character recognition (OCR) task. The proposed solution uses a sequence-to-sequence (Seq2Seq) architecture with attention mechanism. The encoder extracts visual features from an input text-line image via layers of residual convolutional blocks and a layer of gated recurrent units (GRU). The features are encoded in a single context vector and a sequence of hidden states which are fed to the decoder for decoding one character at a time until a special end-of-sentence (EOS) token is reached. The attention mechanism allows the decoder network to adaptively select parts of the input image while predicting a target character. The Seq2Seq Khmer OCR network was trained on a large collection of computer-generated text-line images for seven common Khmer fonts. The proposed model's performance outperformed the state-of-art Tesseract OCR engine for Khmer language on the 3000-images test set by achieving a character error rate (CER) of 1% vs 3%.

【9】 Solution for Large-scale Long-tailed Recognition with Noisy Labels 标题:一种含噪声标签的大规模长尾识别解决方案

作者:Yuqiao Xian,Jia-Xin Zhuang,Fufu Yu 机构:Sun Yat-sen University, Tencent Youtu Lab 备注:None 链接:https://arxiv.org/abs/2106.10683 摘要:这是CVPR 2021 AliProducts Challenge的技术报告。AliProducts Challenge是为研究全球领先的电子商务公司所遇到的大规模、细粒度的商品图像识别问题而提出的一项竞赛。大规模产品识别同时面临着标注噪声、数据分布不均衡(长尾)和细粒度分类的挑战。在我们的解决方案中,我们采用了CNN和Transformer的最先进的模型架构,包括ResNeSt、EfficientNetV2和DeiT。我们发现,迭代数据清洗、分类器权重归一化、高分辨率微调和测试时间增加是提高噪声和不平衡数据集训练性能的关键。最后,我们使用我们的集成模型获得了6.4365%的平均分类错误率。 摘要:This is a technical report for CVPR 2021 AliProducts Challenge. AliProducts Challenge is a competition proposed for studying the large-scale and fine-grained commodity image recognition problem encountered by worldleading ecommerce companies. The large-scale product recognition simultaneously meets the challenge of noisy annotations, imbalanced (long-tailed) data distribution and fine-grained classification. In our solution, we adopt stateof-the-art model architectures of both CNNs and Transformer, including ResNeSt, EfficientNetV2, and DeiT. We found that iterative data cleaning, classifier weight normalization, high-resolution finetuning, and test time augmentation are key components to improve the performance of training with the noisy and imbalanced dataset. Finally, we obtain 6.4365% mean class error rate in the leaderboard with our ensemble model.

【10】 TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition 标题:TGRNet:一种面向表格结构识别的表格图形重构网络

作者:Wenyuan Xue,Baosheng Yu,Wen Wang,Dacheng Tao,Qingyong Li 机构:∗Beijing Key Lab of Trafic Data Analysis and Mining, Beijing Jiaotong University, China, †School of Computer Science, The University of Sydney, Australia, ‡JD Explore Academy, China 链接:https://arxiv.org/abs/2106.10598 摘要:表是一种非常有效的数据结构,在商业和科学研究中得到了广泛的应用。考虑到在线和离线文档中的大量表格数据,表格自动识别越来越受到文档分析界的关注。虽然人类可以很容易地理解表格的结构,但对于机器来说,理解表格结构仍然是一个挑战,特别是由于表格的布局和样式多种多样。现有的方法通常将表建模为不同表单元之间的标记序列或邻接矩阵,没有考虑到表单元逻辑位置的重要性,例如单元格位于表的第一行和第二列。本文将表结构识别问题转化为表图重构问题,提出了一种用于表结构识别的端到端可训练表图重构网络。具体地说,该方法有两个主要分支,一个单元检测分支和一个单元逻辑位置分支,共同预测不同单元的空间位置和逻辑位置。在三个流行的表识别数据集和一个新的带有表图标注的数据集(TableGraph-350K)上的实验结果证明了所提出的TGRNet在表结构识别中的有效性。代码和注释将公开。 摘要:A table arranging data in rows and columns is a very effective data structure, which has been widely used in business and scientific research. Considering large-scale tabular data in online and offline documents, automatic table recognition has attracted increasing attention from the document analysis community. Though human can easily understand the structure of tables, it remains a challenge for machines to understand that, especially due to a variety of different table layouts and styles. Existing methods usually model a table as either the markup sequence or the adjacency matrix between different table cells, failing to address the importance of the logical location of table cells, e.g., a cell is located in the first row and the second column of the table. In this paper, we reformulate the problem of table structure recognition as the table graph reconstruction, and propose an end-to-end trainable table graph reconstruction network (TGRNet) for table structure recognition. Specifically, the proposed method has two main branches, a cell detection branch and a cell logical location branch, to jointly predict the spatial location and the logical location of different cells. Experimental results on three popular table recognition datasets and a new dataset with table graph annotations (TableGraph-350K) demonstrate the effectiveness of the proposed TGRNet for table structure recognition. Code and annotations will be made publicly available.

【11】 Low-Power Multi-Camera Object Re-Identification using Hierarchical Neural Networks 标题:基于递阶神经网络的低功耗多摄像机目标再识别

作者:Abhinav Goel,Caleb Tung,Xiao Hu,Haobo Wang,James C. Davis,George K. Thiruvathukal,Yung-Hsiang Lu 机构:Purdue University, School of Electrical and Computer Engineering, West Lafayette, IN, USA, ∗Loyola University Chicago, Department of Computer Science, Chicago, IL, USA 备注:Accepted to ISLPED 2021 链接:https://arxiv.org/abs/2106.10588 摘要:低功耗计算机视觉在嵌入式设备上有着广泛的应用。本文描述了一种低功耗的对象再识别(reID)技术:将查询图像与先前看到的图像库进行匹配。最先进的技术依赖于大型计算密集型深度神经网络(DNNs)。我们提出了一种新的分层DNN结构,该结构使用训练数据集中的属性标签来执行有效的对象reID。在层次结构中的每个节点上,一个小DNN标识查询图像的不同属性。每个叶节点上的小DNN专门用于重新标识库的一个子集:仅具有沿着从根到叶的路径标识的属性的图像。因此,在使用少量小dnn进行处理之后,可以准确地重新识别查询图像。我们将我们的方法与最先进的对象里德技术进行了比较。在准确率降低4%的情况下,我们的方法实现了显著的资源节约:减少74%的内存、72%的操作和67%的查询延迟,从而减少65%的能耗。 摘要:Low-power computer vision on embedded devices has many applications. This paper describes a low-power technique for the object re-identification (reID) problem: matching a query image against a gallery of previously seen images. State-of-the-art techniques rely on large, computationally-intensive Deep Neural Networks (DNNs). We propose a novel hierarchical DNN architecture that uses attribute labels in the training dataset to perform efficient object reID. At each node in the hierarchy, a small DNN identifies a different attribute of the query image. The small DNN at each leaf node is specialized to re-identify a subset of the gallery: only the images with the attributes identified along the path from the root to a leaf. Thus, a query image is re-identified accurately after processing with a few small DNNs. We compare our method with state-of-the-art object reID techniques. With a 4% loss in accuracy, our approach realizes significant resource savings: 74% less memory, 72% fewer operations, and 67% lower query latency, yielding 65% less energy consumption.

【12】 Supervised learning for crop/weed classification based on color and texture features 标题:基于颜色和纹理特征的有监督学习在作物/杂草分类中的应用

作者:Faiza Mekhalfa,Fouad Yacef 机构:Division Productique et Robotique, Centre de Developpement de Technologies Avancees, Algiers, Algeria 链接:https://arxiv.org/abs/2106.10581 摘要:近年来,计算机视觉技术引起了人们对精确农业的极大兴趣。所有基于计算机视觉的精确农业任务的共同目标是检测感兴趣的对象(如作物、杂草)并将其与背景区分开来。杂草是作物间生长的有害植物,它们争夺养分、水分和阳光,导致作物产量下降。杂草检测和绘图对于特定地点的杂草管理至关重要,以减少劳动力成本和除草剂的影响。研究了利用颜色和纹理特征识别大豆作物和杂草的方法。采用两种颜色空间(RGB、HSV)、灰度共生矩阵(GLCM)和局部二值模式(LBP)等特征提取方法训练支持向量机(SVM)分类器。该实验是在公开的无人机(UAV)上获取的大豆作物图像数据集上进行的。实验结果表明,将颜色特征与LBP特征相结合可以获得最高的准确率(96%以上)。 摘要:Computer vision techniques have attracted a great interest in precision agriculture, recently. The common goal of all computer vision-based precision agriculture tasks is to detect the objects of interest (e.g., crop, weed) and discriminating them from the background. The Weeds are unwanted plants growing among crops competing for nutrients, water, and sunlight, causing losses to crop yields. Weed detection and mapping is critical for site-specific weed management to reduce the cost of labor and impact of herbicides. This paper investigates the use of color and texture features for discrimination of Soybean crops and weeds. Feature extraction methods including two color spaces (RGB, HSV), gray level Co-occurrence matrix (GLCM), and Local Binary Pattern (LBP) are used to train the Support Vector Machine (SVM) classifier. The experiment was carried out on image dataset of soybean crop, obtained from an unmanned aerial vehicle (UAV), which is publicly available. The results from the experiment showed that the highest accuracy (above 96%) was obtained from the combination of color and LBP features.

【13】 Practical Transferability Estimation for Image Classification Tasks 标题:一种实用的图像分类任务可转移性估计

作者:Yang Tan,Yang Li,Shao-Lun Huang 机构:Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, China 备注:12 pages 链接:https://arxiv.org/abs/2106.10479 摘要:迁移性估计是迁移学习中的一个重要问题,它用来预测将源模型(源任务)迁移到目标任务时的性能。最近的分析性可转移性度量被广泛应用于源模型选择和多任务学习。在具有挑战性的跨域跨任务传输设置下,早期的指标并不能很好地工作,但是最近的OTCE评分通过使用辅助任务获得了显著的性能。一个名为OT-based NCE score的简化版本牺牲了准确度以提高效率,但它可以进一步改进。因此,我们提出了一个实用的可转移性度量JC-NCE评分,以进一步提高跨域跨任务可转移性估计的性能,该评分比OTCE评分更有效,比基于OT的NCE评分更准确。具体来说,我们通过求解一个同时考虑样本距离和标签距离的最优传输问题来建立源数据和目标数据之间的联合对应关系,然后计算可传输性得分作为负条件熵。在数据集内和数据集间转移设置下的广泛验证表明,我们的JC-NCE得分优于基于OT的NCE得分,分别获得约7%和12%的收益。 摘要:Transferability estimation is an essential problem in transfer learning to predict how good the performance is when transfer a source model (source task) to a target task. Recent analytical transferability metrics have been widely used for source model selection and multi-task learning. Earlier metrics does not work sufficiently well under the challenging cross-domain cross-task transfer settings, but recent OTCE score achieves a noteworthy performance using auxiliary tasks. A simplified version named OT-based NCE score sacrifices accuracy to be more efficient, but it can be further improved. Consequently, we propose a practical transferability metric called JC-NCE score to further improve the cross-domain cross-task transferability estimation performance, which is more efficient than the OTCE score and more accurate than the OT-based NCE score. Specifically, we build the joint correspondences between source and target data via solving an optimal transport problem with considering both the sample distance and label distance, and then compute the transferability score as the negative conditional entropy. Extensive validations under the intra-dataset and inter-dataset transfer settings demonstrate that our JC-NCE score outperforms the OT-based NCE score with about 7% and 12% gains, respectively.

【14】 Place recognition survey: An update on deep learning approaches 标题:地点再认调查:深度学习方法的最新进展

作者:Tiago Barros,Ricardo Pereira,Luís Garrote,Cristiano Premebida,Urbano J. Nunes 机构:− two distinct places look similar (also known as perceptualaliasing);The authors are with the University of Coimbra, Institute of Systemsand Robotics, Department of Electrical and Computer Engineering 备注:Under review in IEEE Transactions on Intelligent Vehicles. This work has been submitted on the 13/01/2021 to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Upon acceptance of the article by IEEE, the preprint article will be replaced with the accepted version 链接:https://arxiv.org/abs/2106.10458 摘要:自动驾驶汽车(AV)越来越能够在动态变化的复杂环境中进行导航。一个关键的组成部分,使这些智能车辆能够克服这些条件,并变得更加自主是复杂的感知和定位系统。作为定位系统的一部分,位置识别得益于其他感知任务的最新发展,如位置分类或对象识别,即随着深度学习(DL)框架的出现。本文综述了近年来位置识别的研究方法,特别是基于深度学习的方法。这项工作的贡献有两个方面:测量最近的传感器,如三维激光雷达和雷达,应用于位置识别;并将各种基于DL的位置识别工作分为有监督、无监督、半监督、并行和层次分类。首先,这项调查介绍了关键地点识别的概念,以语境化的读者。然后,讨论了传感器的特性。本次调查通过详细阐述各种基于DL的工作来进行,并对每个框架进行总结。从这次调查中得到的一些经验教训包括:NetVLAD对于有监督的端到端学习的重要性;无监督方法的优点在于就地识别,即适用于跨领域应用;或者说最近的作品越来越倾向于追求更高的性能,同时也追求更高的效率。 摘要:Autonomous Vehicles (AV) are becoming more capable of navigating in complex environments with dynamic and changing conditions. A key component that enables these intelligent vehicles to overcome such conditions and become more autonomous is the sophistication of the perception and localization systems. As part of the localization system, place recognition has benefited from recent developments in other perception tasks such as place categorization or object recognition, namely with the emergence of deep learning (DL) frameworks. This paper surveys recent approaches and methods used in place recognition, particularly those based on deep learning. The contributions of this work are twofold: surveying recent sensors such as 3D LiDARs and RADARs, applied in place recognition; and categorizing the various DL-based place recognition works into supervised, unsupervised, semi-supervised, parallel, and hierarchical categories. First, this survey introduces key place recognition concepts to contextualize the reader. Then, sensor characteristics are addressed. This survey proceeds by elaborating on the various DL-based works, presenting summaries for each framework. Some lessons learned from this survey include: the importance of NetVLAD for supervised end-to-end learning; the advantages of unsupervised approaches in place recognition, namely for cross-domain applications; or the increasing tendency of recent works to seek, not only for higher performance but also for higher efficiency.

【15】 Brain tumor grade classification Using LSTM Neural Networks with Domain Pre-Transforms 标题:基于区域预变换的LSTM神经网络在脑肿瘤分级中的应用

作者:Maedeh Sadat Fasihi,Wasfy B. Mikhael 机构:Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 备注:4 pages, 5 figures, MWSCAS2021 链接:https://arxiv.org/abs/2106.10889 摘要:图像分类方法的性能很大程度上依赖于高质量的标注,尤其是对于医学数据而言,这些标注是非常廉价的。为了缓解这一局限性,本文提出了一种基于手工特征组合的弱监督图像分类方法。我们假设将这些手工制作的特征与长短时记忆(LSTM)分类器相结合可以减少弱标记对分类准确性的不利影响。我们提出的算法是基于在小波域和离散余弦变换(DCT)域中选择适当的数据域表示。然后将这些信息输入LSTM网络,以说明数据的顺序性。所提出的有效的低维特征利用了浅-深学习模型的强大功能,以较低的计算成本获得更高的性能,我们进行了脑肿瘤分级实验,取得了256×256分辨率的最新性能,并进行了一系列综合实验,分析了各成分对性能的影响。 摘要:The performance of image classification methodsheavily relies on the high-quality annotations, which are noteasily affordable, particularly for medical data. To alleviate thislimitation, in this study, we propose a weakly supervised imageclassification method based on combination of hand-craftedfeatures. We hypothesize that integration of these hand-craftedfeatures alongside Long short-term memory (LSTM) classifiercan reduce the adverse effects of weak labels in classificationaccuracy. Our proposed algorithm is based on selecting theappropriate domain representations of the data in Wavelet andDiscrete Cosine Transform (DCT) domains. This informationis then fed into LSTM network to account for the sequentialnature of the data. The proposed efficient, low dimensionalfeatures exploit the power of shallow deep learning modelsto achieve higher performance with lower computational cost.In order to show efficacy of the proposed strategy, we haveexperimented classification of brain tumor grades and achievedthe state of the art performance with the resolution of 256 x 256. We also conducted a comprehensive set of experiments toanalyze the effect of each component on the performance.

分割|语义相关(10篇)

【1】 Distilling effective supervision for robust medical image segmentation with noisy labels 标题:提取有效监督的噪声标签鲁棒医学图像分割

作者:Jialin Shi,Ji Wu 机构: Department of Electronic Engineering, Tsinghua University, Beijing, China, Institute for Precision Medicine, Tsinghua University, Beijing, China 备注:Accepted to MICCAI 2021 链接:https://arxiv.org/abs/2106.11099 摘要:尽管深度学习方法在医学图像分割中取得了成功,但人类水平的性能依赖于大量的训练数据和高质量的注释,这些数据的收集既昂贵又耗时。事实上,存在低质量的标注和标签噪声,导致学习模型的性能不理想。噪声标签分割学习的两个主要方向是像素级噪声鲁棒训练和图像级噪声鲁棒训练。在这项工作中,我们提出了一个新的框架来解决分割与噪声标签提取有效的监督信息,从像素和图像水平。特别地,我们明确地估计每个像素的不确定性作为像素级噪声估计,并提出了同时使用原始标签和伪标签的像素级鲁棒学习。此外,我们提出了一种图像级鲁棒学习方法,以适应更多的信息,作为像素级学习的补充。我们在模拟和真实的噪声数据集上进行了大量的实验。实验结果表明,与现有的基于噪声标签的医学图像分割基线相比,该方法具有更好的性能。 摘要:Despite the success of deep learning methods in medical image segmentation tasks, the human-level performance relies on massive training data with high-quality annotations, which are expensive and time-consuming to collect. The fact is that there exist low-quality annotations with label noise, which leads to suboptimal performance of learned models. Two prominent directions for segmentation learning with noisy labels include pixel-wise noise robust training and image-level noise robust training. In this work, we propose a novel framework to address segmenting with noisy labels by distilling effective supervision information from both pixel and image levels. In particular, we explicitly estimate the uncertainty of every pixel as pixel-wise noise estimation, and propose pixel-wise robust learning by using both the original labels and pseudo labels. Furthermore, we present an image-level robust learning method to accommodate more information as the complements to pixel-level learning. We conduct extensive experiments on both simulated and real-world noisy datasets. The results demonstrate the advantageous performance of our method compared to state-of-the-art baselines for medical image segmentation with noisy labels.

【2】 Segmentation of cell-level anomalies in electroluminescence images of photovoltaic modules 标题:光伏组件电致发光图像中单元级异常的分割

作者:Urtzi Otamendi,Iñigo Martinez,Marco Quartulli,Igor G. Olaizola,Elisabeth Viles,Werther Cambarau 机构:Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebasti´an , Spain, TECNUN School of Engineering, University of Navarra, Donostia-San Sebasti´an , Spain 备注:None 链接:https://arxiv.org/abs/2106.10962 摘要:在光伏电站的运行维护中,故障的早期识别对于维持发电效率和延长组件寿命至关重要。在所有缺陷中,电池水平异常可能导致严重故障,并可能影响周围光伏组件的长期运行。这些精细缺陷通常通过高空间分辨率的电致发光(EL)成像来捕获。获取此类图像的困难限制了数据的可用性。在这项工作中,使用了多种数据资源和扩充技术来克服这一限制。目前最先进的检测方法仅从单个光伏电池图像中提取低层信息,其性能取决于可用的训练数据。在这篇文章中,我们提出了一个端到端的深度学习管道,通过EL图像从整个光伏组件中检测、定位和分割电池级异常。提出的模块化流水线结合了三种深度学习技术:1.目标检测(改进的快速RNN)、2.图像分类(EfficientNet)和3.弱监督分割(自动编码器)。管道的模块化特性允许将深度学习模型升级为最新技术的进一步改进,并将管道扩展到新功能。 摘要:In the operation & maintenance (O&M) of photovoltaic (PV) plants, the early identification of failures has become crucial to maintain productivity and prolong components' life. Of all defects, cell-level anomalies can lead to serious failures and may affect surrounding PV modules in the long run. These fine defects are usually captured with high spatial resolution electroluminescence (EL) imaging. The difficulty of acquiring such images has limited the availability of data. For this work, multiple data resources and augmentation techniques have been used to surpass this limitation. Current state-of-the-art detection methods extract barely low-level information from individual PV cell images, and their performance is conditioned by the available training data. In this article, we propose an end-to-end deep learning pipeline that detects, locates and segments cell-level anomalies from entire photovoltaic modules via EL images. The proposed modular pipeline combines three deep learning techniques: 1. object detection (modified Faster-RNN), 2. image classification (EfficientNet) and 3. weakly supervised segmentation (autoencoder). The modular nature of the pipeline allows to upgrade the deep learning models to the further improvements in the state-of-the-art and also extend the pipeline towards new functionalities.

【3】 Surgical data science for safe cholecystectomy: a protocol for segmentation of hepatocystic anatomy and assessment of the critical view of safety 标题:安全胆囊切除术的外科数据科学:肝囊解剖分割和安全临界观评估方案

作者:Pietro Mascagni,Deepak Alapatt,Alain Garcia,Nariaki Okamoto,Armine Vardazaryan,Guido Costamagna,Bernard Dallemagne,Nicolas Padoy 机构:This work was partially supported by French State Funds managed by the “Agence Nationale de la Recherche (ANR)” through, the “Investissements d’Avenir” (Investments for the Future) Program under Grant ANR-,-IAHU-, (IHU-Strasbourg) and 备注:24 pages, 34 figures 链接:https://arxiv.org/abs/2106.10916 摘要:微创图像引导手术在很大程度上依赖于视觉。因此,用于手术视频分析的深度学习模型可以支持视觉任务,例如评估腹腔镜胆囊切除术(LC)中的安全性关键视图(CVS),可能有助于提高手术的安全性和效率。然而,这些模型的性能、可靠性和再现性在很大程度上取决于其开发过程中使用的数据和注释的质量。在这里,我们提出了一个协议,检查表和视觉例子,以促进一致的注释肝囊性解剖和CVS标准。我们相信共享注释指南有助于建立可信赖的多中心数据集来评估性能的普遍性,从而加速外科视频分析深度学习模型的临床翻译。 摘要:Minimally invasive image-guided surgery heavily relies on vision. Deep learning models for surgical video analysis could therefore support visual tasks such as assessing the critical view of safety (CVS) in laparoscopic cholecystectomy (LC), potentially contributing to surgical safety and efficiency. However, the performance, reliability and reproducibility of such models are deeply dependent on the quality of data and annotations used in their development. Here, we present a protocol, checklists, and visual examples to promote consistent annotation of hepatocystic anatomy and CVS criteria. We believe that sharing annotation guidelines can help build trustworthy multicentric datasets for assessing generalizability of performance, thus accelerating the clinical translation of deep learning models for surgical video analysis.

【4】 Large-scale image segmentation based on distributed clustering algorithms 标题:基于分布式聚类算法的大规模图像分割

作者:Ran Lu,Aleksandar Zlateski,H. Sebastian Seung 备注:8 pages, 5 figures 链接:https://arxiv.org/abs/2106.10795 摘要:三维图像分割的许多方法都是基于超体素到图像区域的层次聚类。在这里,我们描述了一个分布式算法能够处理大量的超体素。该算法是递归工作的,区域被划分成块,这些块由多个worker独立地并行处理。在递归过程的每一轮中,所有维度的块大小都会加倍,直到单个块包含整个图像。最终的结果是可以证明独立于分块方案的,并且与处理整个图像而不分块是一样的。这是不寻常的,因为一对相邻区域是由界面上的某些统计属性(例如,平均值或中位数)来评分的,并且界面可以扩展到任意多个块。诀窍是延迟与块边界接触的区域的合并决策,并且只在区域完全包含在块中之后的一轮中完成它们。我们通过聚类一个在1350亿个超体素之间具有超过1.5万亿个边缘的亲和图来演示该算法,这些超体素来自于一个三维电子显微大脑图像。 摘要:Many approaches to 3D image segmentation are based on hierarchical clustering of supervoxels into image regions. Here we describe a distributed algorithm capable of handling a tremendous number of supervoxels. The algorithm works recursively, the regions are divided into chunks that are processed independently in parallel by multiple workers. At each round of the recursive procedure, the chunk size in all dimensions are doubled until a single chunk encompasses the entire image. The final result is provably independent of the chunking scheme, and the same as if the entire image were processed without division into chunks. This is nontrivial because a pair of adjacent regions is scored by some statistical property (e.g. mean or median) of the affinities at the interface, and the interface may extend over arbitrarily many chunks. The trick is to delay merge decisions for regions that touch chunk boundaries, and only complete them in a later round after the regions are fully contained within a chunk. We demonstrate the algorithm by clustering an affinity graph with over 1.5 trillion edges between 135 billion supervoxels derived from a 3D electron microscopic brain image.

【5】 Quality-Aware Memory Network for Interactive Volumetric Image Segmentation 标题:交互式体图像分割的质量感知记忆网络

作者:Tianfei Zhou,Liulei Li,Gustav Bredell,Jianwu Li,Ender Konukoglu 机构: Computer Vision Laboratory, ETH Zurich, Switzerland, School of Computer Science and Technology, Beijing Institute of Technology, China 备注:MICCAI 2021. Code: this https URL 链接:https://arxiv.org/abs/2106.10686 摘要:尽管近年来医学图像自动分割技术取得了一些进展,但全自动分割结果往往不能满足临床应用的需要,需要进一步的细化。在这项工作中,我们提出了一个质量感知记忆网络的交互式分割三维医学图像。在用户对任意切片的引导下,首先利用交互网络进行初始二维分割。质量感知存储网络随后在整个卷上双向传播初始分割估计。基于其他切片上的附加用户指导的后续细化可以以相同的方式合并。为了进一步促进交互式分割,引入了一个质量评估模块,根据每个切片的当前分割质量来建议下一个切片到另一个切片。该网络具有两个吸引人的特点:1)记忆增强网络提供了快速编码过去分割信息的能力,这些信息将被检索用于其他切片的分割;2) 质量评估模块使模型能够直接估计分割预测的质量,这允许一种主动学习范式,其中用户优先标记质量最低的切片进行多轮细化。该网络提供了一个健壮的交互式分词引擎,可以很好地推广到各种类型的用户注释(如涂鸦、方框)。在不同医学数据集上的实验结果表明,与现有技术相比,该方法具有优越性。 摘要:Despite recent progress of automatic medical image segmentation techniques, fully automatic results usually fail to meet the clinical use and typically require further refinement. In this work, we propose a quality-aware memory network for interactive segmentation of 3D medical images. Provided by user guidance on an arbitrary slice, an interaction network is firstly employed to obtain an initial 2D segmentation. The quality-aware memory network subsequently propagates the initial segmentation estimation bidirectionally over the entire volume. Subsequent refinement based on additional user guidance on other slices can be incorporated in the same manner. To further facilitate interactive segmentation, a quality assessment module is introduced to suggest the next slice to segment based on the current segmentation quality of each slice. The proposed network has two appealing characteristics: 1) The memory-augmented network offers the ability to quickly encode past segmentation information, which will be retrieved for the segmentation of other slices; 2) The quality assessment module enables the model to directly estimate the qualities of segmentation predictions, which allows an active learning paradigm where users preferentially label the lowest-quality slice for multi-round refinement. The proposed network leads to a robust interactive segmentation engine, which can generalize well to various types of user annotations (e.g., scribbles, boxes). Experimental results on various medical datasets demonstrate the superiority of our approach in comparison with existing techniques.

【6】 Exploring Semantic Relationships for Unpaired Image Captioning 标题:探索不成对图像字幕的语义关系

作者:Fenglin Liu,Meng Gao,Tianhao Zhang,Yuexian Zou 机构:ADSPLAB, School of ECE, Peking University, School of ICE, Beijing University of Posts and Telecommunications, Beijing Foreign Studies University 链接:https://arxiv.org/abs/2106.10658 摘要:近年来,图像字幕引起了学术界和工业界的极大兴趣。大多数现有的系统都是建立在由图像-句子对组成的大规模数据集上的,然而,构建这些数据集非常耗时。此外,即使是最先进的图像字幕系统,也很难实现对图像的深层理解。在这项工作中,我们通过将视觉和语言域与高级语义信息连接起来,实现了不成对的图像字幕。这种动机源于这样一个事实,即具有相同情态的语义概念可以从图像和描述中提取。为了进一步提高模型生成的字幕的质量,我们提出了语义关系浏览器,它探索语义概念之间的关系,以便更好地理解图像。在MSCOCO数据集上的大量实验表明,在没有成对数据集的情况下,我们可以生成理想的字幕。此外,该方法在配对条件下提高了5条强基线,其中苹果酒评分的最显著改善达到8%,表明该方法是有效的,并能很好地推广到各种模型。 摘要:Recently, image captioning has aroused great interest in both academic and industrial worlds. Most existing systems are built upon large-scale datasets consisting of image-sentence pairs, which, however, are time-consuming to construct. In addition, even for the most advanced image captioning systems, it is still difficult to realize deep image understanding. In this work, we achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information. The motivation stems from the fact that the semantic concepts with the same modality can be extracted from both images and descriptions. To further improve the quality of captions generated by the model, we propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image. Extensive experiments on MSCOCO dataset show that we can generate desirable captions without paired datasets. Furthermore, the proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%, demonstrating that it is effective and generalizes well to a wide range of models.

【7】 Remote Sensing Images Semantic Segmentation with General Remote Sensing Vision Model via a Self-Supervised Contrastive Learning Method 标题:基于自监督对比学习的广义遥感视觉模型遥感图像语义分割

作者:Haifeng Li,Yi Li,Guo Zhang,Ruoyun Liu,Haozhe Huang,Qing Zhu,Chao Tao 机构: Zhang is with with the State Key Laboratory of Information Engineeringin Surveying, Wuhan University 备注:13 pages, 10 figures, 4 tables 链接:https://arxiv.org/abs/2106.10605 摘要:一种新的学习范式&自监督学习(SSL)可以用来解决这类问题,它通过预先训练一个包含大量未标记图像的通用模型,然后对一个包含少量标记样本的下游任务进行微调。对比学习是SSL的一种典型方法,它可以学习一般不变特征。然而,现有的对比学习大多是针对分类任务设计的,以获得图像级的表示,对于需要像素级识别的语义分割任务来说,这种表示可能是次优的。因此,本文提出了一种基于全局风格和局部匹配的对比学习网络(GLCNet)的遥感语义分割方法。具体来说,通过全局风格对比模块,我们认为风格特征能够更好地代表图像的整体特征,从而更好地学习图像级的表达;设计了局部特征匹配对比模块,学习局部区域的表示,有利于语义分割。我们对四个遥感语义分割数据集进行了评估,实验结果表明,该方法的性能优于现有的自监督方法和ImageNet预训练方法。具体来说,在原始数据集上加1%的注释,我们的方法在ISPRS波茨坦数据集上的Kappa提高了6%,在深地球覆盖分类数据集上的Kappa提高了3%。此外,当上游任务和下游任务的数据集存在一定差异时,该方法的性能优于监督学习。本文的研究促进了自监督学习在遥感语义分割领域的发展。源代码位于https://github.com/GeoX-Lab/G-RSIM. 摘要:A new learning paradigm, self-supervised learning (SSL), can be used to solve such problems by pre-training a general model with large unlabeled images and then fine-tuning on a downstream task with very few labeled samples. Contrastive learning is a typical method of SSL, which can learn general invariant features. However, most of the existing contrastive learning is designed for classification tasks to obtain an image-level representation, which may be sub-optimal for semantic segmentation tasks requiring pixel-level discrimination. Therefore, we propose Global style and Local matching Contrastive Learning Network (GLCNet) for remote sensing semantic segmentation. Specifically, the global style contrastive module is used to learn an image-level representation better, as we consider the style features can better represent the overall image features; The local features matching contrastive module is designed to learn representations of local regions which is beneficial for semantic segmentation. We evaluate four remote sensing semantic segmentation datasets, and the experimental results show that our method mostly outperforms state-of-the-art self-supervised methods and ImageNet pre-training. Specifically, with 1\% annotation from the original dataset, our approach improves Kappa by 6\% on the ISPRS Potsdam dataset and 3\% on Deep Globe Land Cover Classification dataset relative to the existing baseline. Moreover, our method outperforms supervised learning when there are some differences between the datasets of upstream tasks and downstream tasks. Our study promotes the development of self-supervised learning in the field of remote sensing semantic segmentation. The source code is available at https://github.com/GeoX-Lab/G-RSIM.

【8】 Interactive Object Segmentation with Dynamic Click Transform 标题:基于动态点击变换的交互式对象分割

作者:Chun-Tse Lin,Wei-Chih Tu,Chih-Ting Liu,Shao-Yi Chien 机构:Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan 备注:This paper was accepted by IEEE International Conference on Image Processing (ICIP) 2021 链接:https://arxiv.org/abs/2106.10465 摘要:在交互式分割中,用户首先点击目标物体对主体进行分割,然后对标记错误的区域进行修正,迭代地细化分割模板。大多数现有的方法将这些用户提供的点击转换成交互地图,并将它们与图像连接起来作为输入张量。通常,交互贴图是通过测量每个像素到点击点的距离来确定的,忽略了点击和错误标记区域之间的关系。提出了一种由空间DCT和特征DCT组成的动态点击变换网络(DCT-Net),以更好地表示用户交互。空间离散余弦变换(DCT)根据目标尺度对每个用户提供的点击进行离散距离变换,特征离散余弦变换将提取的特征映射归一化为从点击点预测的特定分布。在三个标准的基准数据集上,我们证明了我们提出的方法的有效性,并取得了良好的性能。 摘要:In the interactive segmentation, users initially click on the target object to segment the main body and then provide corrections on mislabeled regions to iteratively refine the segmentation masks. Most existing methods transform these user-provided clicks into interaction maps and concatenate them with image as the input tensor. Typically, the interaction maps are determined by measuring the distance of each pixel to the clicked points, ignoring the relation between clicks and mislabeled regions. We propose a Dynamic Click Transform Network~(DCT-Net), consisting of Spatial-DCT and Feature-DCT, to better represent user interactions. Spatial-DCT transforms each user-provided click with individual diffusion distance according to the target scale, and Feature-DCT normalizes the extracted feature map to a specific distribution predicted from the clicked points. We demonstrate the effectiveness of our proposed method and achieve favorable performance compared to the state-of-the-art on three standard benchmark datasets.

【9】 MSN: Efficient Online Mask Selection Network for Video Instance Segmentation 标题:MSN:高效的视频实例分割在线掩码选择网络

作者:Vidit Goel,Jiachen Li,Shubhika Garg,Harsh Maheshwari,Humphrey Shi 机构:Picsart AI Research (PAIR) 备注:3rd Place Solution to the YouTube-VIS Challenge at CVPR 2021 链接:https://arxiv.org/abs/2106.10452 摘要:本文提出了一种新的视频实例分割方法,即在视频中自动生成实例级的分割模板和对象类,并对其进行跟踪。我们的方法使用掩模选择网络(MSN)在线地从分割和传播分支改进掩模,从而限制了掩模跟踪过程中的噪声积累。提出了一种基于分片卷积神经网络的MSN设计方法。该网络能够区分掩模之间非常细微的差异,并从关联的掩模中准确地选择更好的掩模。此外,我们利用时间一致性,以正向和反向的方式处理视频序列作为后处理步骤来恢复丢失的对象。该方法适用于任何视频对象分割方法。我们的方法在2021年YouTube VIS挑战赛中获得49.1分的mAP,在30多个全球团队中排名第三。我们的代码将在https://github.com/SHI-Labs/Mask-Selection-Networks. 摘要:In this work we present a novel solution for Video Instance Segmentation(VIS), that is automatically generating instance level segmentation masks along with object class and tracking them in a video. Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN) hence limiting the noise accumulation during mask tracking. We propose an effective design of MSN by using patch-based convolutional neural network. The network is able to distinguish between very subtle differences between the masks and choose the better masks out of the associated masks accurately. Further, we make use of temporal consistency and process the video sequences in both forward and reverse manner as a post processing step to recover lost objects. The proposed method can be used to adapt any video object segmentation method for the task of VIS. Our method achieves a score of 49.1 mAP on 2021 YouTube-VIS Challenge and was ranked third place among more than 30 global teams. Our code will be available at https://github.com/SHI-Labs/Mask-Selection-Networks.

【10】 Towards Single Stage Weakly Supervised Semantic Segmentation 标题:面向单阶段弱监督的语义切分

作者:Peri Akiva,Kristin Dana 机构:Rutgers University 链接:https://arxiv.org/abs/2106.10309 摘要:获取语义分割标签的昂贵过程推动了对弱监督语义分割(WSSS)方法的研究,该方法仅使用图像级、点或盒标签。由于缺乏密集的场景表示,通常需要通过多个阶段的训练和细化来增加复杂度以获得关于场景的附加语义信息。当前最先进的(SOTA)模型利用图像级标签来生成类激活映射(cam),这些类激活映射经过多个细化阶段,然后再对它们进行阈值化,从而生成用于监控的伪掩码。多阶段方法的计算成本很高,而且cam生成依赖于图像级标签,对于更复杂的场景缺乏通用性。相反,我们的方法提供了一个可推广到任意数据集的单阶段方法,即可以从头开始训练,不依赖于预先训练的主干、分类或单独的细化任务。我们利用点注释通过细化和过滤的特征生成可靠的动态伪掩模。虽然我们的方法需要的点注释只比图像级注释稍微贵一点,但我们要证明SOTA在基准数据集(PascalVOC 2012)上的性能,以及在最近的真实数据集(CRAID、CityPersons、IAD)上显著优于其他SOTA WSSS方法。 摘要:The costly process of obtaining semantic segmentation labels has driven research towards weakly supervised semantic segmentation (WSSS) methods, using only image-level, point, or box labels. The lack of dense scene representation requires methods to increase complexity to obtain additional semantic information about the scene, often done through multiple stages of training and refinement. Current state-of-the-art (SOTA) models leverage image-level labels to produce class activation maps (CAMs) which go through multiple stages of refinement before they are thresholded to make pseudo-masks for supervision. The multi-stage approach is computationally expensive, and dependency on image-level labels for CAMs generation lacks generalizability to more complex scenes. In contrary, our method offers a single-stage approach generalizable to arbitrary dataset, that is trainable from scratch, without any dependency on pre-trained backbones, classification, or separate refinement tasks. We utilize point annotations to generate reliable, on-the-fly pseudo-masks through refined and filtered features. While our method requires point annotations that are only slightly more expensive than image-level annotations, we are to demonstrate SOTA performance on benchmark datasets (PascalVOC 2012), as well as significantly outperform other SOTA WSSS methods on recent real-world datasets (CRAID, CityPersons, IAD).

Zero/Few Shot|迁移|域适配|自适应(4篇)

【1】 CUDA-GR: Controllable Unsupervised Domain Adaptation for Gaze Redirection 标题:CUDA-GR:面向凝视重定向的可控无监督域自适应

作者:Swati Jindal,Xin Eric Wang 机构:Department of Computer Science and Engineering, UC Santa Cruz 链接:https://arxiv.org/abs/2106.10852 摘要:视线重定向的目的是将图像中的视线操纵到所需的方向。然而,现有的方法在生成感知合理的图像方面存在不足。生成性对抗网络的发展在生成照片真实感图像方面取得了很好的效果。不过,它们仍然缺乏对不同图像属性提供更精细控制的能力。为了实现这种微调控制,需要为训练数据获取地面真值注释,这可能非常昂贵。在本文中,我们提出了一个无监督的领域适应框架,称为CUDA-GR,它学习从标记的源域中分离注视表征,并将它们转移到一个未标记的目标域。我们的方法可以在保持人的外观信息的同时对注视方向进行细粒度控制。结果表明,在目标域生成的图像标签对能够有效地进行知识转移,并能提高下游任务的性能。在基准数据集上的大量实验表明,该方法在定量和定性评价方面均优于现有的方法。 摘要:The aim of gaze redirection is to manipulate the gaze in an image to the desired direction. However, existing methods are inadequate in generating perceptually reasonable images. Advancement in generative adversarial networks has shown excellent results in generating photo-realistic images. Though, they still lack the ability to provide finer control over different image attributes. To enable such fine-tuned control, one needs to obtain ground truth annotations for the training data which can be very expensive. In this paper, we propose an unsupervised domain adaptation framework, called CUDA-GR, that learns to disentangle gaze representations from the labeled source domain and transfers them to an unlabeled target domain. Our method enables fine-grained control over gaze directions while preserving the appearance information of the person. We show that the generated image-labels pairs in the target domain are effective in knowledge transfer and can boost the performance of the downstream tasks. Extensive experiments on the benchmarking datasets show that the proposed method can outperform state-of-the-art techniques in both quantitative and qualitative evaluation.

【2】 Trainable Class Prototypes for Few-Shot Learning 标题:用于Few-Shot学习的可训练类原型

作者:Jianyi Li,Guizhong Liu 机构:School of Information and Communications, Xi’an Jiaotong University, Xi’an, P.R. China 备注:8 pages, 2 figures,and 3 Tables. arXiv admin note: substantial text overlap with arXiv:2008.09942 链接:https://arxiv.org/abs/2106.10846 摘要:度量学习是一种广泛应用的Few-Shot学习方法,其中原型的质量是算法的关键。本文在元训练和任务训练框架下,提出了可训练的距离测量原型,而不是人工的距离测量原型。同时为了避免幕式元训练带来的弊端,我们采用了基于自监督学习的非幕式元训练。总体而言,我们分两个阶段来解决少数镜头任务:通过自监督学习对可转移特征提取器进行元训练和对原型进行度量分类训练。此外,元训练和任务训练都采用了简单的注意机制。我们的方法在标准的Few-Shot视觉分类数据集上实现了各种已建立的Few-Shot任务的最新性能,与现有的无监督Few-Shot学习方法相比提高了约20%。 摘要:Metric learning is a widely used method for few shot learning in which the quality of prototypes plays a key role in the algorithm. In this paper we propose the trainable prototypes for distance measure instead of the artificial ones within the meta-training and task-training framework. Also to avoid the disadvantages that the episodic meta-training brought, we adopt non-episodic meta-training based on self-supervised learning. Overall we solve the few-shot tasks in two phases: meta-training a transferable feature extractor via self-supervised learning and training the prototypes for metric classification. In addition, the simple attention mechanism is used in both meta-training and task-training. Our method achieves state-of-the-art performance in a variety of established few-shot tasks on the standard few-shot visual classification dataset, with about 20% increase compared to the available unsupervised few-shot learning methods.

【3】 ToAlign: Task-oriented Alignment for Unsupervised Domain Adaptation 标题:ToAlign:面向任务的无监督领域自适应对齐

作者:Guoqiang Wei,Cuiling Lan,Wenjun Zeng,Zhibo Chen 机构: University of Science and Technology of China, Microsoft Research Asia 链接:https://arxiv.org/abs/2106.10812 摘要:无监督域自适应分类旨在提高未标记目标域的分类性能。为了缓解域转移的不利影响,许多方法将源域和目标域在特征空间中对齐。然而,通常将特征作为一个整体进行对齐,而没有明确地使领域对齐主动地服务于分类任务,从而导致次优解。为了更好地适应环境,应该调整哪些子特征尚不清楚。在本文中,我们提出了一个有效的面向任务的对齐(ToAlign)的无监督领域适应(UDA)。我们研究了哪些特征需要跨域对齐,并提出在分类任务本身产生的先验知识的指导下,通过特征分解和对齐,使域对齐主动服务于分类。特别地,我们在分类元知识的基础上,明确地将源域中的特征分解为任务相关/区别性特征和任务无关特征。在不同的域适配设置下,对各种基准(如Office Home、Visda-2017和DomainNet)进行了大量的实验,结果证明了ToAlign的有效性,这有助于实现最先进的性能。 摘要:Unsupervised domain adaptive classification intends to improve theclassification performance on unlabeled target domain. To alleviate the adverse effect of domain shift, many approaches align the source and target domains in the feature space. However, a feature is usually taken as a whole for alignment without explicitly making domain alignment proactively serve the classification task, leading to sub-optimal solution. What sub-feature should be aligned for better adaptation is under-explored. In this paper, we propose an effective Task-oriented Alignment (ToAlign) for unsupervised domain adaptation (UDA). We study what features should be aligned across domains and propose to make the domain alignment proactively serve classification by performing feature decomposition and alignment under the guidance of the prior knowledge induced from the classification taskitself. Particularly, we explicitly decompose a feature in the source domain intoa task-related/discriminative feature that should be aligned, and a task-irrelevant feature that should be avoided/ignored, based on the classification meta-knowledge. Extensive experimental results on various benchmarks (e.g., Office-Home, Visda-2017, and DomainNet) under different domain adaptation settings demonstrate theeffectiveness of ToAlign which helps achieve the state-of-the-art performance.

【4】 Task Attended Meta-Learning for Few-Shot Learning 标题:任务参与的元学习中的少数几次学习

作者:Aroof Aimen,Sahil Sidheekh,Narayanan C. Krishnan 机构:Indian Institute of Technology, Ropar 链接:https://arxiv.org/abs/2106.10642 摘要:元学习(Meta-learning,ML)是在资源受限条件下(如Few-Shot学习)学习模型的一个很有前途的发展方向。目前流行的ML方法要么学习一个可推广的初始模型,要么通过幕式训练学习一个通用的参数优化器。前一种方法利用一批任务的知识来学习最优先验知识。在这项工作中,我们研究了批处理对ML的重要性。具体来说,我们首先引入了批处理-幕式训练方案来改进泛型参数优化器的学习。我们还假设,在间歇式训练中,一个批次中的每个任务对学习最优元模型的贡献相等的共同假设不一定是真的。我们建议根据任务在元模型学习中的“重要性”对任务进行分批加权。为此,我们引入了一种以人类选择性聚焦为动机的训练课程,称为任务参与元训练。任务注意是一个独立的模块,可以与任何间歇式训练方案相结合。通过在minimagenet和tieredImageNet等复杂数据集上与非任务参与模型的比较,验证了该方法的有效性。 摘要:Meta-learning (ML) has emerged as a promising direction in learning models under constrained resource settings like few-shot learning. The popular approaches for ML either learn a generalizable initial model or a generic parametric optimizer through episodic training. The former approaches leverage the knowledge from a batch of tasks to learn an optimal prior. In this work, we study the importance of a batch for ML. Specifically, we first incorporate a batch episodic training regimen to improve the learning of the generic parametric optimizer. We also hypothesize that the common assumption in batch episodic training that each task in a batch has an equal contribution to learning an optimal meta-model need not be true. We propose to weight the tasks in a batch according to their "importance" in improving the meta-model's learning. To this end, we introduce a training curriculum motivated by selective focus in humans, called task attended meta-training, to weight the tasks in a batch. Task attention is a standalone module that can be integrated with any batch episodic training regimen. The comparisons of the models with their non-task-attended counterparts on complex datasets like miniImageNet and tieredImageNet validate its effectiveness.

半弱无监督|主动学习|不确定性(9篇)

【1】 Simple Distillation Baselines for Improving Small Self-supervised Models 标题:改进小型自监督模型的简单精馏基线

作者:Jindong Gu,Wei Liu,Yonglong Tian 机构:University of Munich, Tencent, MIT 链接:https://arxiv.org/abs/2106.11304 摘要:虽然大型自监督模型的性能已经与它们的监督模型相当,但小型模型仍然很困难。在这份报告中,我们探讨了通过蒸馏改进小型自监督模型的简单基线,称为SimDis。具体来说,我们提出了一个离线蒸馏基线,它建立了一个新的国家的最先进的,和一个在线蒸馏基线,实现了类似的性能与最小的计算开销。希望这些基线能为今后的相关研究提供有益的经验。代码位于:https://github.com/JindongGu/SimDis/ 摘要:While large self-supervised models have rivalled the performance of their supervised counterparts, small models still struggle. In this report, we explore simple baselines for improving small self-supervised models via distillation, called SimDis. Specifically, we present an offline-distillation baseline, which establishes a new state-of-the-art, and an online-distillation baseline, which achieves similar performance with minimal computational overhead. We hope these baselines will provide useful experience for relevant future research. Code is available at: https://github.com/JindongGu/SimDis/

【2】 Visual Probing: Cognitive Framework for Explaining Self-Supervised Image Representations 标题:视觉探测:解释自我监督图像表征的认知框架

作者:Witold Oleszkiewicz,Dominika Basaj,Igor Sieradzki,Michał Górszczak,Barbara Rychalska,Koryna Lewandowska,Tomasz Trzciński,Bartosz Zieliński 机构: Lewandowska is with Department of Cognitive Neuroscience and Neu-roergonomics 链接:https://arxiv.org/abs/2106.11054 摘要:最近引入的图像表示学习的自监督方法与完全监督的竞争对手相比提供了相当或更好的结果,然而相应的解释自监督方法的努力却相对滞后。基于这一观察,我们引入了一个新的视觉探测框架,通过利用自然语言处理中的探测任务来解释自监督模型。探测任务需要了解图像部分之间的语义关系。因此,我们提出了一个系统的方法来获得视觉中自然语言的类似物,例如视觉词汇、上下文和分类法。我们的建议是基于马尔的视觉计算理论和关注的特点,如纹理,形状和线条。我们在解释自我监督表征时展示了这些类比的有效性和适用性。我们的主要发现强调语言和视觉之间的关系可以作为一个有效而直观的工具来发现机器学习模型是如何工作的,独立于数据模式。我们的工作开启了一条通向更可解释和透明人工智能的研究道路。 摘要:Recently introduced self-supervised methods for image representation learning provide on par or superior results to their fully supervised competitors, yet the corresponding efforts to explain the self-supervised approaches lag behind. Motivated by this observation, we introduce a novel visual probing framework for explaining the self-supervised models by leveraging probing tasks employed previously in natural language processing. The probing tasks require knowledge about semantic relationships between image parts. Hence, we propose a systematic approach to obtain analogs of natural language in vision, such as visual words, context, and taxonomy. Our proposal is grounded in Marr's computational theory of vision and concerns features like textures, shapes, and lines. We show the effectiveness and applicability of those analogs in the context of explaining self-supervised representations. Our key findings emphasize that relations between language and vision can serve as an effective yet intuitive tool for discovering how machine learning models work, independently of data modality. Our work opens a plethora of research pathways towards more explainable and transparent AI.

【3】 Unsupervised Deep Learning by Injecting Low-Rank and Sparse Priors 标题:注入低秩稀疏先验的无监督深度学习

作者:Tomoya Sakai 机构:School of Information and Data Sciences, Nagasaki University, Nagasaki, Japan 链接:https://arxiv.org/abs/2106.10923 摘要:如果深层神经网络可以从稀疏诱导先验中学习呢?当网络由层模块(CNN、RNN等)组合而成时,除了带注释的训练数据集外,工程师很少利用归纳偏差,即现有的已知规则或先验知识。我们专注于在深度学习中使用稀疏诱导先验来鼓励网络以无监督的方式简洁地捕捉高维数据的本质。为了使用不可微稀疏诱导范数作为损失函数,我们将其近端映射插入到自动微分框架中。我们证明了无监督学习的U网络背景减法使用低秩和稀疏的先验知识。该U-Net可以在训练序列中学习运动目标而无需任何标注,并能成功地检测出测试序列中的前景目标。 摘要:What if deep neural networks can learn from sparsity-inducing priors? When the networks are designed by combining layer modules (CNN, RNN, etc), engineers less exploit the inductive bias, i.e., existing well-known rules or prior knowledge, other than annotated training data sets. We focus on employing sparsity-inducing priors in deep learning to encourage the network to concisely capture the nature of high-dimensional data in an unsupervised way. In order to use non-differentiable sparsity-inducing norms as loss functions, we plug their proximal mappings into the automatic differentiation framework. We demonstrate unsupervised learning of U-Net for background subtraction using low-rank and sparse priors. The U-Net can learn moving objects in a training sequence without any annotation, and successfully detect the foreground objects in test sequences.

【4】 Crop-Transform-Paste: Self-Supervised Learning for Visual Tracking 标题:裁剪-变换-粘贴:视觉跟踪的自监督学习

作者:Xin Li,Wenjie Pei,Zikun Zhou,Zhenyu He,Huchuan Lu,Ming-Hsuan Yang 机构:Peng Cheng Laboratory, Harbin Institute of Technology, Shenzhen, Dalian University of Technology, University of California, Merced, Google Research 备注:10 pages, 5 figures 链接:https://arxiv.org/abs/2106.10900 摘要:虽然基于深度学习的视觉跟踪方法已经取得了实质性的进展,但这些方法需要大规模和高质量的注释数据来进行充分的训练。为了消除昂贵且穷尽的注释,我们研究了用于视觉跟踪的自监督学习。在这项工作中,我们开发了裁剪变换粘贴操作,通过模拟跟踪过程中的各种场景变化,包括物体的外观变化和背景变化,能够合成足够的训练数据。由于目标状态在所有合成数据中都是已知的,因此现有的深度跟踪器可以通过常规方式进行训练,而不需要人工标注。与典型的自监督学习方法不同,本文提出的自监督学习机制可以无缝集成到现有的跟踪框架中进行训练。大量实验表明,该方法1)在少数镜头跟踪场景中取得了比有监督学习更好的性能;2) 可以处理各种跟踪挑战,如物体变形,遮挡,或由于其设计背景杂波;3) 可以结合监督学习进一步提高性能,特别是在少数镜头跟踪场景中有效。 摘要:While deep-learning based methods for visual tracking have achieved substantial progress, these schemes entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised learning for visual tracking. In this work, we develop the Crop-Transform-Paste operation, which is able to synthesize sufficient training data by simulating various kinds of scene variations during tracking, including appearance variations of objects and background changes. Since the object state is known in all synthesized data, existing deep trackers can be trained in routine ways without human annotation. Different from typical self-supervised learning methods performing visual representation learning as an individual step, the proposed self-supervised learning mechanism can be seamlessly integrated into any existing tracking framework to perform training. Extensive experiments show that our method 1) achieves favorable performance than supervised learning in few-shot tracking scenarios; 2) can deal with various tracking challenges such as object deformation, occlusion, or background clutter due to its design; 3) can be combined with supervised learning to further boost the performance, particularly effective in few-shot tracking scenarios.

【5】 Active Learning for Deep Neural Networks on Edge Devices 标题:边缘设备上深度神经网络的主动学习

作者:Yuya Senzaki,Christian Hamelain 机构:Idein Inc., Tokyo, Japan 链接:https://arxiv.org/abs/2106.10836 摘要:在边缘器件的深度神经网络(DNN)应用中,模型的不断更新是非常重要的。尽管用实际输入数据更新模型是理想的,但是由于标签和通信成本等限制,使用所有这些数据并不总是可行的。因此,有必要过滤和选择用于设备上的训练(即,主动学习)的数据。本文将一个实用的基于边缘设备的DNNs主动学习问题形式化,提出了一个通用的任务无关框架来解决这个问题,并将其归结为一个流子模最大化问题。这个框架足够轻,可以用较低的计算资源运行,但是由于子模块的特性,它提供的解决方案的质量在理论上是有保证的。通过这个框架,我们可以灵活地配置数据选择标准,包括使用以前主动学习研究中提出的方法。我们评估了我们的方法在两个分类和目标检测任务在实际环境中模拟现实生活中的情况。研究结果表明,该框架在两种任务上都优于其他方法,同时在实际设备上运行速度也很快。 摘要:When dealing with deep neural network (DNN) applications on edge devices, continuously updating the model is important. Although updating a model with real incoming data is ideal, using all of them is not always feasible due to limits, such as labeling and communication costs. Thus, it is necessary to filter and select the data to use for training (i.e., active learning) on the device. In this paper, we formalize a practical active learning problem for DNNs on edge devices and propose a general task-agnostic framework to tackle this problem, which reduces it to a stream submodular maximization. This framework is light enough to be run with low computational resources, yet provides solutions whose quality is theoretically guaranteed thanks to the submodular property. Through this framework, we can configure data selection criteria flexibly, including using methods proposed in previous active learning studies. We evaluate our approach on both classification and object detection tasks in a practical setting to simulate a real-life scenario. The results of our study show that the proposed framework outperforms all other methods in both tasks, while running at a practical speed on real devices.

【6】 Two-Stream Consensus Network: Submission to HACS Challenge 2021 Weakly-Supervised Learning Track 标题:双流共识网络:深渊翻滚挑战2021年弱监督学习路径

作者:Yuanhao Zhai,Le Wang,David Doermann,Junsong Yuan 机构:Xi’an Jiaotong University, State University of New York at Buffalo 备注:Second place solution to the HACS Weakly-Supervised Temporal Action Localization Challenge 2021. arXiv admin note: text overlap with arXiv:2010.11594 链接:https://arxiv.org/abs/2106.10829 摘要:本技术报告介绍了我们对HACS时间动作定位挑战2021的解决方案,弱监督学习轨迹。弱监督时间动作定位的目标是在给定视频级别标签的情况下,对未剪辑视频中感兴趣的动作进行时间定位和分类。我们采用双流共识网络(TSCN)作为挑战的主要框架。TSCN由一个双流基模型训练过程和一个伪地面真值学习过程组成。基础模型训练鼓励模型预测基于单一模态(即RGB或光流)的可靠预测,基于融合生成伪地面真值,并反过来用作基础模型训练的监督。在hacsv1.1.1数据集上,在没有对特征提取I3D模型进行微调的情况下,该方法在验证集上的平均mAP值达到22.20%,在测试集上的平均mAP值达到21.68%。我们的解决方案在这个挑战中排名第二,我们希望我们的方法可以作为未来学术研究的基线。 摘要:This technical report presents our solution to the HACS Temporal Action Localization Challenge 2021, Weakly-Supervised Learning Track. The goal of weakly-supervised temporal action localization is to temporally locate and classify action of interest in untrimmed videos given only video-level labels. We adopt the two-stream consensus network (TSCN) as the main framework in this challenge. The TSCN consists of a two-stream base model training procedure and a pseudo ground truth learning procedure. The base model training encourages the model to predict reliable predictions based on single modality (i.e., RGB or optical flow), based on the fusion of which a pseudo ground truth is generated and in turn used as supervision to train the base models. On the HACS v1.1.1 dataset, without fine-tuning the feature-extraction I3D models, our method achieves 22.20% on the validation set and 21.68% on the testing set in terms of average mAP. Our solution ranked the 2rd in this challenge, and we hope our method can serve as a baseline for future academic research.

【7】 Tag, Copy or Predict: A Unified Weakly-Supervised Learning Framework for Visual Information Extraction using Sequences 标题:标签、复制或预测:使用序列进行视觉信息提取的统一弱监督学习框架

作者:Jiapeng Wang,Tianwei Wang,Guozhi Tang,Lianwen Jin,Weihong Ma,Kai Ding,Yichao Huang 机构:School of Electronic and Information Engineering, South China University of Technology, China, IntSig Information Co., Ltd, Shanghai, China, Guangdong Artificial Intelligence and Digital Economy Laboratory (Pazhou Lab), Guangzhou, China 备注:IJCAI2021 链接:https://arxiv.org/abs/2106.10681 摘要:视觉信息抽取(VIE)近年来受到越来越多的关注。现有的方法通常是先将光学字符识别(OCR)结果组织成纯文本,然后利用标记级实体标注作为监督来训练序列标注模型。然而,它耗费了大量的注释成本,并且可能会导致标签混淆,OCR错误也会严重影响最终的性能。本文提出了一个统一的弱监督学习框架TCPN(Tag,Copy或Predict Network),该框架引入了一个高效的编码器来同时对二维OCR结果中的语义和布局信息进行建模;2) 一种仅利用关键信息序列作为监督的弱监督训练策略;3)一种灵活可切换的解码器,包含两种推理模式:一种(复制或预测模式)是通过在每个时间步从输入或预测中复制一个令牌来输出不同类别的关键信息序列;另一种(标记模式)是在单个前向传递中直接标记输入序列。我们的方法在几个公共基准上显示了最新的性能,这充分证明了它的有效性。 摘要:Visual information extraction (VIE) has attracted increasing attention in recent years. The existing methods usually first organized optical character recognition (OCR) results into plain texts and then utilized token-level entity annotations as supervision to train a sequence tagging model. However, it expends great annotation costs and may be exposed to label confusion, and the OCR errors will also significantly affect the final performance. In this paper, we propose a unified weakly-supervised learning framework called TCPN (Tag, Copy or Predict Network), which introduces 1) an efficient encoder to simultaneously model the semantic and layout information in 2D OCR results; 2) a weakly-supervised training strategy that utilizes only key information sequences as supervision; and 3) a flexible and switchable decoder which contains two inference modes: one (Copy or Predict Mode) is to output key information sequences of different categories by copying a token from the input or predicting one in each time step, and the other (Tag Mode) is to directly tag the input sequence in a single forward pass. Our method shows new state-of-the-art performance on several public benchmarks, which fully proves its effectiveness.

【8】 Exploring Visual Context for Weakly Supervised Person Search 标题:弱监督人员搜索的视觉情境探索

作者:Yichao Yan,Jinpeng Li,Shengcai Liao,Jie Qin,Bingbing Ni,Xiaokang Yang,Ling Shao 机构: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China, Inception Institute of Artificial Intelligence (IIAI), UAE 链接:https://arxiv.org/abs/2106.10506 摘要:人员搜索是一项具有挑战性的任务,它将行人检测和人员重新识别结合起来。现有的方法遵循完全监督的设置,其中边界框和标识注释都可用。然而,标识的注释是劳动密集型的,限制了当前框架的实用性和可伸缩性。本文创造性地研究了仅带边界框标注的弱监督人员搜索问题。我们提出了第一个框架来解决这一新的任务,即上下文引导的人搜索(CGPS),通过调查三个层次的上下文线索(即检测,记忆和场景)在无约束的自然图像。前两种方法用于提高局部和全局的区分能力,后一种方法用于提高聚类精度。尽管设计简单,但我们的CGPS在中大系统上的mAP将基准模型提高了8.3%。令人惊讶的是,它甚至实现了与两步人员搜索模型相当的性能,同时显示出更高的效率。我们的代码在https://github.com/ljpadam/CGPS. 摘要:Person search has recently emerged as a challenging task that jointly addresses pedestrian detection and person re-identification. Existing approaches follow a fully supervised setting where both bounding box and identity annotations are available. However, annotating identities is labor-intensive, limiting the practicability and scalability of current frameworks. This paper inventively considers weakly supervised person search with only bounding box annotations. We proposed the first framework to address this novel task, namely Context-Guided Person Search (CGPS), by investigating three levels of context clues (i.e., detection, memory and scene) in unconstrained natural images. The first two are employed to promote local and global discriminative capabilities, while the latter enhances clustering accuracy. Despite its simple design, our CGPS boosts the baseline model by 8.3% in mAP on CUHK-SYSU. Surprisingly, it even achieves comparable performance to two-step person search models, while displaying higher efficiency. Our code is available at https://github.com/ljpadam/CGPS.

【9】 Estimating MRI Image Quality via Image Reconstruction Uncertainty 标题:基于图像重建不确定性的MRI图像质量评价

作者:Richard Shaw,Carole H. Sudre,Sebastien Ourselin,M. Jorge Cardoso 机构:Cardoso, Dept. Medical Physics & Biomedical Engineering, University College London, UK, School of Biomedical Engineering & Imaging Sciences, King’s College London, UK, Dementia Research Centre, Institute of Neurology, University College London, UK 链接:https://arxiv.org/abs/2106.10992 摘要:医学图像分析中的质量控制(QC)是一项费时费力的工作,导致人们对自动化方法越来越感兴趣。然而,被认为适合算法处理的质量可能不同于人类感知的视觉质量度量。在这项工作中,我们提出了磁共振图像质量评估从图像重建的角度。我们使用异方差不确定性模型训练贝叶斯神经网络,从噪声数据中恢复干净的图像,提供预测的不确定性度量。该框架使我们能够将数据损坏分为可学习和不可学习两部分,并将预测不确定性解释为对图像可恢复性的估计。因此,我们认为视觉评估的质量控制不能等同于算法处理的质量控制。我们在一个多任务实验中验证了这一说法,该实验将人工制品恢复、不确定性预测和灰质分割相结合。认识到视觉质量和算法质量之间的区别会产生影响,根据下游任务的不同,仅基于“视觉质量”的原因就可以排除较少的数据。 摘要:Quality control (QC) in medical image analysis is time-consuming and laborious, leading to increased interest in automated methods. However, what is deemed suitable quality for algorithmic processing may be different from human-perceived measures of visual quality. In this work, we pose MR image quality assessment from an image reconstruction perspective. We train Bayesian CNNs using a heteroscedastic uncertainty model to recover clean images from noisy data, providing measures of uncertainty over the predictions. This framework enables us to divide data corruption into learnable and non-learnable components and leads us to interpret the predictive uncertainty as an estimation of the achievable recovery of an image. Thus, we argue that quality control for visual assessment cannot be equated to quality control for algorithmic processing. We validate this statement in a multi-task experiment combining artefact recovery with uncertainty prediction and grey matter segmentation. Recognising this distinction between visual and algorithmic quality has the impact that, depending on the downstream task, less data can be excluded based on ``visual quality" reasons alone.

时序|行为识别|姿态|视频|运动估计(12篇)

【1】 Towards Long-Form Video Understanding 标题:走向长篇视频理解

作者:Chao-Yuan Wu,Philipp Krähenbühl 机构:Philipp Kr¨ahenb¨uhl, The University of Texas at Austin 备注:CVPR 2021 链接:https://arxiv.org/abs/2106.11310 摘要:我们的世界提供了永无止境的视觉刺激流,然而今天的视觉系统只能在几秒钟内准确地识别模式。这些系统能够理解现在,但无法将它与过去或未来的事件联系起来。本文研究长格式视频理解。我们介绍了一个长格式视频建模框架,并在大规模数据集上开发了评估协议。我们表明,现有的最先进的短期模型是有限的长期形式的任务。一种新颖的基于对象中心变换器的视频识别体系结构在7种不同的任务上表现得更好。在AVA数据集上,它的性能也优于可比的最新技术。 摘要:Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.

【2】 Understanding Object Dynamics for Interactive Image-to-Video Synthesis 标题:理解交互式图像到视频合成中的对象动力学

作者:Andreas Blattmann,Timo Milbich,Michael Dorkenwald,Björn Ommer 机构:Bj¨orn Ommer, Interdisciplinary Center for Scientific Computing, HCI, Heidelberg University, Germany 备注:CVPR 2021, project page available at this https URL 链接:https://arxiv.org/abs/2106.11303 摘要:局部戳一个静态场景会有什么效果?我们提出了一种方法,学习自然寻找全局关节造成的局部操作在像素级。训练只需要运动物体的视频,而不需要物理场景的基本操作信息。我们的生成模型学习推断自然物体动力学作为对用户交互的响应,并学习不同物体身体区域之间的相互关系。给定一个物体的静态图像和一个像素的局部戳,然后该方法预测物体将如何随时间变形。与现有的视频预测工作相比,我们不合成任意真实感的视频,而是对变形进行局部交互控制。我们的模型不局限于特定的对象类别,可以将动力学传递到新的看不见的对象实例上。在不同对象上的大量实验表明,与普通的视频预测框架相比,我们的方法是有效的。项目页面位于https://bit.ly/3cxfA2L . 摘要:What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks. Project page is available at https://bit.ly/3cxfA2L .

【3】 TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? 标题:TokenLearner:8个学习过的令牌对图片和视频有什么作用?

作者:Michael S. Ryoo,AJ Piergiovanni,Anurag Arnab,Mostafa Dehghani,Anelia Angelova 机构:Google Research, Stony Brook University 链接:https://arxiv.org/abs/2106.11297 摘要:本文介绍了一种新的视觉表征学习算法,它依赖于少量的自适应学习标记,适用于图像和视频理解任务。我们的方法不依赖于手工设计的分割策略来获取视觉标记,也不依赖于处理大量密集采样的斑块来引起注意,而是学习如何在视觉数据中挖掘重要的标记。这导致了高效和有效地找到一些重要的视觉标记,并使得能够在视频的较长时间范围内,或图像中的空间内容,在这些标记之间建立成对注意的模型。我们的实验证明了在图像和视频识别任务的几个具有挑战性的基准上的强大性能。重要的是,由于我们的代币是自适应的,我们在显著减少计算量的情况下实现了竞争结果。 摘要:In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.

【4】 Applying VertexShuffle Toward 360-Degree Video Super-Resolution on Focused-Icosahedral-Mesh 标题:在聚焦二十面体网格上应用VertexShuffle实现360度视频超分辨率

作者:Na Li,Yao Liu 机构:Department of Computer Science, Binghamton University 备注:This paper introduce a new mesh representation and a new upsampling method on a mesh 链接:https://arxiv.org/abs/2106.11253 摘要:随着360度图像/视频、增强现实(AR)和虚拟现实(VR)的出现,人们对球形信号的分析和处理的需求越来越大。然而,大量的研究工作都集中在从球面信号投影出来的平面信号上,导致了像素浪费、失真等问题。球形CNN的最新进展为直接分析球形信号提供了可能。然而,他们关注的是全网格,这使得在实际应用中由于对带宽的要求非常大而难以处理。为了解决360度视频流的带宽浪费问题和节省计算量,我们利用聚焦二十面体网格来表示一个小区域,并构造矩阵将球形内容旋转到聚焦网格区域。我们还提出了一种新的顶点洗牌操作,与UGSCNN中引入的MeshConv转置操作相比,该操作可以显著提高性能和效率。我们进一步将所提出的方法应用于超分辨率模型,这是第一个提出一个直接操作360度数据的球形像素网格表示的球形超分辨率模型。为了评估我们的模型,我们还收集了一组高分辨率的360度视频来生成一个球形图像数据集。我们的实验表明,与使用简单MeshConv转置操作的基线球面超分辨率模型相比,我们提出的球面超分辨率模型在性能和推理时间方面都取得了显著的优势。综上所述,我们的模型在360度输入上取得了很好的超分辨率性能,在对网格上的16x顶点进行超分辨率处理时,平均达到32.79db的PSNR。 摘要:With the emerging of 360-degree image/video, augmented reality (AR) and virtual reality (VR), the demand for analysing and processing spherical signals get tremendous increase. However, plenty of effort paid on planar signals that projected from spherical signals, which leading to some problems, e.g. waste of pixels, distortion. Recent advances in spherical CNN have opened up the possibility of directly analysing spherical signals. However, they pay attention to the full mesh which makes it infeasible to deal with situations in real-world application due to the extremely large bandwidth requirement. To address the bandwidth waste problem associated with 360-degree video streaming and save computation, we exploit Focused Icosahedral Mesh to represent a small area and construct matrices to rotate spherical content to the focused mesh area. We also proposed a novel VertexShuffle operation that can significantly improve both the performance and the efficiency compared to the original MeshConv Transpose operation introduced in UGSCNN. We further apply our proposed methods on super resolution model, which is the first to propose a spherical super-resolution model that directly operates on a mesh representation of spherical pixels of 360-degree data. To evaluate our model, we also collect a set of high-resolution 360-degree videos to generate a spherical image dataset. Our experiments indicate that our proposed spherical super-resolution model achieves significant benefits in terms of both performance and inference time compared to the baseline spherical super-resolution model that uses the simple MeshConv Transpose operation. In summary, our model achieves great super-resolution performance on 360-degree inputs, achieving 32.79 dB PSNR on average when super-resoluting 16x vertices on the mesh.

【5】 VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning 标题:VIMPAC:基于掩蔽标记预测和对比学习的视频预训练

作者:Hao Tan,Jie Lei,Thomas Wolf,Mohit Bansal 机构:UNC Chapel Hill, Huggingface 备注:Under review, 23 Pages 链接:https://arxiv.org/abs/2106.11250 摘要:视频理解依赖于感知全局内容和建模其内部联系(例如因果关系、运动和时空对应)。为了了解这些交互作用,我们应用一个掩码,然后预测通过VQ-VAE生成的离散化视频令牌的预训练任务。与语言不同的是,文本标记更加独立,相邻的视频标记通常具有很强的相关性(例如,连续的视频帧通常看起来非常相似),因此统一地屏蔽单个标记将使任务变得过于琐碎而无法学习有用的表示。为了解决这个问题,我们提出了一种分块掩蔽策略,在空间域和时间域中掩蔽相邻的视频令牌。我们还添加了一种无需增强的对比学习方法,通过预测视频片段是否来自同一视频来进一步捕获全局内容。我们在未分级的视频上对我们的模型进行了预训练,并表明我们的预训练模型可以在多个视频理解数据集(如SSV2、g48)上达到最先进的结果。最后,对模型的可扩展性和预训练方法设计进行了详细的分析。代码发布时间https://github.com/airsplay/vimpac. 摘要:Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We pre-train our model on uncurated videos and show that our pre-trained model can reach state-of-the-art results on several video understanding datasets (e.g., SSV2, Diving48). Lastly, we provide detailed analyses on model scalability and pre-training method design. Code is released at https://github.com/airsplay/vimpac.

【6】 CLIP2Video: Mastering Video-Text Retrieval via Image CLIP 标题:CLIP2Video:通过图像剪辑掌握视频文本检索

作者:Han Fang,Pengfei Xiong,Luhui Xu,Yu Chen 机构:PCG, Tencent 链接:https://arxiv.org/abs/2106.11097 摘要:提出clip2视频网络,将图像语言预训练模型以端到端的方式传输到视频文本检索中。视频和语言学习领域的主流方法试图从大规模视频文本数据集中提取视频的时空特征以及视频和语言之间的多模式交互。与之不同的是,我们利用预先训练好的图像语言模型,将其简化为两阶段的框架,分别对图像文本进行共同学习和增强视频帧与视频文本之间的时间关系,使其能够在较小的数据集上进行训练。具体地说,基于对比语言图像预训练(CLIP)模型捕获的空间语义,我们的模型包括一个时间差分块来捕获精细时间视频帧的运动,以及一个时间对齐块来重新对齐视频片段和短语的标记并增强多模态相关性。我们进行了深入的消融研究,并在主要的文本到视频和视频到文本检索基准上取得了最先进的性能,包括在MSR-VTT、MSVD和VATEX上的检索精度的新记录。 摘要:We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

【7】 Interventional Video Grounding with Dual Contrastive Learning 标题:基于双重对比学习的介入性视频寻根

作者:Guoshun Nan,Rui Qiao,Yao Xiao,Jun Liu,Sicong Leng,Hao Zhang,Wei Lu 机构: StatNLP Research Group, Singapore University of Technology and Design, Shanghai Jiao Tong University, China, Information Systems Technology and Design, Singapore University of Technology and Design, Singapore 备注:Accepted in CVPR 2021 链接:https://arxiv.org/abs/2106.11013 摘要:视频接地的目的是定位一个时刻从一个未经剪辑的视频为一个给定的文本查询。现有的方法更多地关注视觉和语言刺激与各种基于可能性的匹配或回归策略的匹配,即P(Y | X)。因此,由于数据集的选择偏差,这些模型可能会受到语言和视频特征之间虚假相关性的影响。1) 为了揭示模型和数据背后的因果关系,我们首先从因果推理的角度提出了一种新的范式,即基于结构化因果模型(SCM)和do演算P(Y | do(X))的介入性视频接地(IVG)。然后,我们提出了一个简单而有效的方法来近似未观察到的混杂因素,因为它不能直接从数据集中采样。2) 同时,我们引入了一种双重对比学习方法(DCL),通过最大化查询和视频片段之间的互信息(MI)以及目标时刻的开始帧/结束帧与视频中其他帧之间的互信息(MI)来更好地对齐文本和视频,以学习更多信息的视觉表示。在三个标准基准上的实验表明了该方法的有效性。 摘要:Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies, i.e., P(Y|X). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we first propose a novel paradigm from the perspective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal model (SCM) and do-calculus P(Y|do(X)). Then, we present a simple yet effective method to approximate the unobserved confounder as it cannot be directly sampled from the dataset. 2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations. Experiments on three standard benchmarks show the effectiveness of our approaches.

【8】 Affect-driven Engagement Measurement from Videos 标题:基于视频的情感驱动型参与度测量

作者:Ali Abedi,Shehroz Khan 机构: majority of the recent works on objective engagementmeasurement in online programs and human-computer in-teraction are based on the visual data of participants ac-•Ali Abedi is with KITE-Toronto Rehabilitation Institute, UniversityHealth Network 备注:13 pages, 8 figures, 7 tables 链接:https://arxiv.org/abs/2106.10882 摘要:在教育和干预项目中,人的参与被认为是成功完成项目的一个主要因素。人员敬业度的自动测量为指导者实现课程目标和个性化课程交付提供了有用的信息。在这篇论文中,我们提出了一种新的方法,在虚拟学习课程中基于视频的参与度测量。我们建议使用从连续视频帧中提取的情感状态、连续的效价值和唤醒值,以及一个新的潜在情感特征向量和行为特征来测量参与度。基于深度学习的时态模型和基于传统机器学习的非时态模型分别在帧级和视频级特征上进行训练和验证。除了传统的集中式学习外,我们还将该方法应用于分散式联邦学习环境中,并研究了模型个性化在敬业度测量中的作用。我们在仅有的两个公开的视频参与度测量数据集DAiSEE和EmotiW上评估了该方法的性能,其中包含在线学习项目中学生的视频。我们的实验表明,在DAiSEE数据集上,最先进的参与度分类准确率为63.3%,正确地分类了脱离视频,在EmotiW数据集上,回归均方误差为0.0673。我们的烧蚀研究表明,有效地结合影响国家的参与测量。我们解释的结果,从实验结果的基础上心理学概念在领域的参与。 摘要:In education and intervention programs, person's engagement has been identified as a major factor in successful program completion. Automatic measurement of person's engagement provides useful information for instructors to meet program objectives and individualize program delivery. In this paper, we present a novel approach for video-based engagement measurement in virtual learning programs. We propose to use affect states, continuous values of valence and arousal extracted from consecutive video frames, along with a new latent affective feature vector and behavioral features for engagement measurement. Deep learning-based temporal, and traditional machine-learning-based non-temporal models are trained and validated on frame-level, and video-level features, respectively. In addition to the conventional centralized learning, we also implement the proposed method in a decentralized federated learning setting and study the effect of model personalization in engagement measurement. We evaluated the performance of the proposed method on the only two publicly available video engagement measurement datasets, DAiSEE and EmotiW, containing videos of students in online learning programs. Our experiments show a state-of-the-art engagement level classification accuracy of 63.3% and correctly classifying disengagement videos in the DAiSEE dataset and a regression mean squared error of 0.0673 on the EmotiW dataset. Our ablation study shows the effectiveness of incorporating affect states in engagement measurement. We interpret the findings from the experimental results based on psychology concepts in the field of engagement.

【9】 Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding 标题:增强2D-TAN:以人为中心的时空视频接地的两阶段方法

作者:Chaolei Tan,Zihang Lin,Jian-Fang Hu,Xiang Li,Wei-Shi Zheng 机构:Sun Yat-sen University, China, Meituan 备注:A technical report on our solution for Person in Context(PIC) Challenge HCVG track at CVPR 2021 workshop 链接:https://arxiv.org/abs/2106.10634 摘要:我们提出了一种有效的两阶段方法来解决以语言为中心的时空视频接地(HC-STVG)任务。在第一阶段,我们提出一个增广2D时间邻近网路(增广2D-TAN)来暂时接地与给定描述对应的目标力矩。首先,我们从两个方面对原有的2D-TAN进行了改进:首先,开发了一个基于时间上下文的bilstm聚合模块来聚合片段级表示,取代了原有的max-pooling。其次,我们建议在训练阶段使用随机串联扩充(RCA)机制。在第二阶段,我们使用预先训练好的MDETR模型通过语言查询生成每帧的边界框,并设计一套手工规则来选择MDETR在固定时刻内为每一帧输出的最佳匹配边界框。 摘要:We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task. In the first stage, we propose an Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) to temporally ground the target moment corresponding to the given description. Primarily, we improve the original 2D-TAN from two aspects: First, a temporal context-aware Bi-LSTM Aggregation Module is developed to aggregate clip-level representations, replacing the original max-pooling. Second, we propose to employ Random Concatenation Augmentation (RCA) mechanism during the training phase. In the second stage, we use pretrained MDETR model to generate per-frame bounding boxes via language query, and design a set of hand-crafted rules to select the best matching bounding box outputted by MDETR for each frame within the grounded moment.

【10】 Video Summarization through Reinforcement Learning with a 3D Spatio-Temporal U-Net 标题:基于三维时空U网的强化学习视频摘要

作者:Tianrui Liu,Qingjie Meng,Jun-Jie Huang,Athanasios Vlontzos,Daniel Rueckert,Bernhard Kainz 链接:https://arxiv.org/abs/2106.10528 摘要:智能视频摘要算法在去除冗余视频帧的同时,通过识别视频中最重要、最具解释性的内容,快速地传递视频中最相关的信息。本文介绍了用于视频摘要的3DST-UNet-RL框架。使用三维时空U网对输入视频的时空信息进行有效编码,以用于下游强化学习(RL)。RL代理从时空潜在分数学习并预测用于在视频摘要中保持或拒绝视频帧的动作。我们研究真实/膨胀的三维时空CNN特征是否比常用的二维图像特征更适合从视频中学习表示。我们的框架可以在完全无监督模式和有监督训练模式下运行。我们分析了规定的摘要长度的影响,并展示了3DST-UNet-RL在两个常用的通用视频摘要基准上的有效性的实验证据。我们还将该方法应用于一个医学视频摘要任务中。提出的视频摘要方法在不丢失重要信息的前提下,既能节省超声筛查视频的存储成本,又能提高在回顾性分析或审计过程中浏览患者视频数据的效率 摘要:Intelligent video summarization algorithms allow to quickly convey the most relevant information in videos through the identification of the most essential and explanatory content while removing redundant video frames. In this paper, we introduce the 3DST-UNet-RL framework for video summarization. A 3D spatio-temporal U-Net is used to efficiently encode spatio-temporal information of the input videos for downstream reinforcement learning (RL). An RL agent learns from spatio-temporal latent scores and predicts actions for keeping or rejecting a video frame in a video summary. We investigate if real/inflated 3D spatio-temporal CNN features are better suited to learn representations from videos than commonly used 2D image features. Our framework can operate in both, a fully unsupervised mode and a supervised training mode. We analyse the impact of prescribed summary lengths and show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks. We also applied our method on a medical video summarization task. The proposed video summarization method has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis or audit without loosing essential information

【11】 Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering 标题:满足您的需求:用于视频答疑的运动外观协同网络

作者:Ahjeong Seo,Gi-Cheon Kang,Joonhan Park,Byoung-Tak Zhang 机构:AI Institute for Seoul National University (AIIS), Hanyang University 备注:ACL 2021 链接:https://arxiv.org/abs/2106.10446 摘要:视频问答是一项任务,它需要一个人工智能代理回答问题的基础上,视频。这项任务包含三个关键挑战:(1)理解各种问题的意图,(2)捕捉输入视频的各种元素(例如,对象、动作、因果关系),以及(3)语言和视觉信息之间的跨模态基础。我们提出了运动-外观协同网络(MASN),它嵌入了基于运动和外观信息的两个跨模态特征,并根据问题的意图有选择地利用它们。MASN由运动模块、外观模块和运动外观融合模块组成。运动模块计算面向动作的跨模态关节表示,而外观模块关注输入视频的外观。最后,运动-外观融合模块将运动模块和外观模块的每个输出作为输入,进行问题引导融合。因此,MASN在TGIF-QA和MSVD-QA数据集上实现了最新的性能。通过可视化MASN的推理结果进行定性分析。代码可在https://github.com/ahjeongseo/MASN-pytorch. 摘要:Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ahjeongseo/MASN-pytorch.

【12】 Single View Physical Distance Estimation using Human Pose 标题:基于人体姿态的单视物理距离估计

作者:Xiaohan Fei,Henry Wang,Xiangyu Zeng,Lin Lee Cheong,Meng Wang,Joseph Tighe 机构:Amazon 链接:https://arxiv.org/abs/2106.10335 摘要:我们提出了一个全自动的系统,该系统可以从一幅RGB图像或一个摄像机从一个固定的有利位置观看三维场景所捕获的视频中同时估计摄像机的本质、地平面和人与人之间的物理距离。为了实现摄像机标定和距离估计的自动化,我们利用人体姿态的先验知识,提出了一种新的基于姿态的自动标定和距离估计的直接公式,在公开数据集上显示了最先进的性能。所提出的方法使得现有的摄像机系统能够测量物理距离,而不需要专用的校准过程或距离传感器,并且适用于广泛的用例,例如社会距离和工作场所安全。此外,为了能够在这一领域进行评估和推动研究,我们利用公开的MEVA数据集和额外的距离注释,产生了MEVADA——世界上第一个基于姿态的自动校准和距离估计问题的评估基准。 摘要:We propose a fully automated system that simultaneously estimates the camera intrinsics, the ground plane, and physical distances between people from a single RGB image or video captured by a camera viewing a 3-D scene from a fixed vantage point. To automate camera calibration and distance estimation, we leverage priors about human pose and develop a novel direct formulation for pose-based auto-calibration and distance estimation, which shows state-of-the-art performance on publicly available datasets. The proposed approach enables existing camera systems to measure physical distances without needing a dedicated calibration process or range sensors, and is applicable to a broad range of use cases such as social distancing and workplace safety. Furthermore, to enable evaluation and drive research in this area, we contribute to the publicly available MEVA dataset with additional distance annotations, resulting in MEVADA -- the first evaluation benchmark in the world for the pose-based auto-calibration and distance estimation problem.

医学相关(1篇)

【1】 CataNet: Predicting remaining cataract surgery duration 标题:CatanNet:预测白内障手术剩余时间

作者:Andrés Marafioti,Michel Hayoz,Mathias Gallardo,Pablo Márquez Neila,Sebastian Wolf,Martin Zinkernagel,Raphael Sznitman 机构: AIMI, ARTORG Center, University of Bern, Switzerland, Department for Ophthalmology, Inselspital, University Hospital, University of 备注:Accepted at MICCAI 2021 链接:https://arxiv.org/abs/2106.11048 摘要:白内障手术是一种挽救视力的手术,每年在世界各地进行超过1000万次。在这样一个巨大的需求下,有效组织外科病房和手术室的能力是在常规临床护理中提供这种治疗的关键。在这种情况下,估计手术过程中的剩余手术时间(RSD)是帮助简化患者吞吐量和工作流程的一种方法。为此,我们提出了CataNet,一种实时预测RSD的白内障手术方法,它结合了两个影响因素:外科医生的经验和手术的当前阶段。我们将CataNet与最新的RSD估计方法进行了比较,结果表明,即使不考虑相位和经验,CataNet的性能也优于现有的RSD估计方法。我们研究了这一改进,并表明一个重要的贡献者是我们将经过的时间集成到CataNet的特征提取器中的方式。 摘要:Cataract surgery is a sight saving surgery that is performed over 10 million times each year around the world. With such a large demand, the ability to organize surgical wards and operating rooms efficiently is critical to delivery this therapy in routine clinical care. In this context, estimating the remaining surgical duration (RSD) during procedures is one way to help streamline patient throughput and workflows. To this end, we propose CataNet, a method for cataract surgeries that predicts in real time the RSD jointly with two influential elements: the surgeon's experience, and the current phase of the surgery. We compare CataNet to state-of-the-art RSD estimation methods, showing that it outperforms them even when phase and experience are not considered. We investigate this improvement and show that a significant contributor is the way we integrate the elapsed time into CataNet's feature extractor.

GAN|对抗|攻击|生成相关(8篇)

【1】 Delving into the pixels of adversarial samples 标题:深入研究对抗性样本的像素

作者:Blerta Lindqvist 机构:Department of Computer Science, Aalto University, Helsinki, Finland 链接:https://arxiv.org/abs/2106.10996 摘要:尽管对对抗性攻击进行了广泛的研究,但我们不知道对抗性攻击是如何影响图像像素的。了解图像像素如何受到敌方攻击的影响,有可能使我们获得更好的敌方防御。基于我们发现强攻击不会转移的实例,我们深入研究像素级的对抗性示例,以仔细研究对抗性攻击如何影响图像像素值。我们考虑了几种ImageNet架构,InceptionV3、VGG19和ResNet50,以及几种强攻击。我们发现,攻击可以在像素级别上有不同的效果,这取决于分类器结构。特别是,在攻击对像素的影响中,输入预处理扮演了一个以前被忽视的角色。基于像素级检测的洞察力,我们找到了新的方法来检测一些最强的当前攻击。 摘要:Despite extensive research into adversarial attacks, we do not know how adversarial attacks affect image pixels. Knowing how image pixels are affected by adversarial attacks has the potential to lead us to better adversarial defenses. Motivated by instances that we find where strong attacks do not transfer, we delve into adversarial examples at pixel level to scrutinize how adversarial attacks affect image pixel values. We consider several ImageNet architectures, InceptionV3, VGG19 and ResNet50, as well as several strong attacks. We find that attacks can have different effects at pixel level depending on classifier architecture. In particular, input pre-processing plays a previously overlooked role in the effect that attacks have on pixels. Based on the insights of pixel-level examination, we find new ways to detect some of the strongest current attacks.

【2】 Confidence-Guided Radiology Report Generation 标题:置信度引导的放射学报告生成

作者:Yixin Wang,Zihao Lin,Jiang Tian,zhongchao shi,Yang Zhang,Jianping Fan,Zhiqiang He 机构:ICT, Chinese Academy of Sciences, Duke University, Lenovo, UNC Charlotte 备注:16 pages 链接:https://arxiv.org/abs/2106.10887 摘要:医学影像学在临床诊断和治疗中起着举足轻重的作用。受自动图像字幕的重大进展的启发,各种基于深度学习(DL)的体系结构被提出用于生成医学图像的放射报告。然而,模型的不确定性(即模型的可靠性/生成报告的置信度)仍然是一个有待探讨的问题。在本文中,我们提出了一种新的方法来显式地量化视觉不确定性和文本不确定性的放射报告生成任务。这种多模态不确定性能够充分地捕获报告级和句子级的模型置信度得分,因此它们被进一步利用来加权损失,以实现更全面的模型优化。我们的实验结果表明,我们提出的模型不确定性表征和估计方法可以为放射报告生成提供更可靠的置信度,我们提出的不确定性加权损失可以实现更全面的模型优化,并在公共放射报告数据集上获得最先进的性能。 摘要:Medical imaging plays a pivotal role in diagnosis and treatment in clinical practice. Inspired by the significant progress in automatic image captioning, various deep learning (DL)-based architectures have been proposed for generating radiology reports for medical images. However, model uncertainty (i.e., model reliability/confidence on report generation) is still an under-explored problem. In this paper, we propose a novel method to explicitly quantify both the visual uncertainty and the textual uncertainty for the task of radiology report generation. Such multi-modal uncertainties can sufficiently capture the model confidence scores at both the report-level and the sentence-level, and thus they are further leveraged to weight the losses for achieving more comprehensive model optimization. Our experimental results have demonstrated that our proposed method for model uncertainty characterization and estimation can provide more reliable confidence scores for radiology report generation, and our proposed uncertainty-weighted losses can achieve more comprehensive model optimization and result in state-of-the-art performance on a public radiology report dataset.

【3】 Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes 标题:Total Generate:用于生成人脸、手、身体和自然场景的循环生成对抗性网络

作者:Hao Tang,Nicu Sebe 机构: StarGANHao Tang and Nicu Sebe are with the Department of Information Engineer-ing and Computer Science (DISI), University of Trento 备注:Accepted to TMM, an extended version of a paper published in ACM MM 2019. arXiv admin note: substantial text overlap with arXiv:1908.00999 链接:https://arxiv.org/abs/2106.10876 摘要:提出了一种新的、统一的循环生成对抗网络(C2GAN),用于生成人脸、手、身体和自然场景。我们提出的C2GAN是一个跨模态模型,以交互方式探索输入图像数据和制导数据的联合开发。C2GAN包含两个不同的生成器,即图像生成生成器和制导生成生成器。两个发生器以端到端的方式相互连接和训练,并显式地形成三个循环子网,即一个图像生成循环和两个制导生成循环。每个循环的目的是重建输入域,同时产生一个有用的输出,涉及到另一个循环的生成。通过这种方式,各周期相互约束,隐含地提供来自图像和制导模式的互补信息,并在各周期之间带来额外的监督梯度,从而促进整个模型的更稳健优化。对四个子任务的大量实验结果表明,与现有模型相比,C2GAN能有效地生成更真实的图像。代码可在https://github.com/Ha0Tang/C2GAN. 摘要:We propose a novel and unified Cycle in Cycle Generative Adversarial Network (C2GAN) for generating human faces, hands, bodies, and natural scenes. Our proposed C2GAN is a cross-modal model exploring the joint exploitation of the input image data and guidance data in an interactive manner. C2GAN contains two different generators, i.e., an image-generation generator and a guidance-generation generator. Both generators are mutually connected and trained in an end-to-end fashion and explicitly form three cycled subnets, i.e., one image generation cycle and two guidance generation cycles. Each cycle aims at reconstructing the input domain and simultaneously produces a useful output involved in the generation of another cycle. In this way, the cycles constrain each other implicitly providing complementary information from both image and guidance modalities and bringing an extra supervision gradient across the cycles, facilitating a more robust optimization of the whole model. Extensive results on four guided image-to-image translation subtasks demonstrate that the proposed C2GAN is effective in generating more realistic images compared with state-of-the-art models. The code is available at https://github.com/Ha0Tang/C2GAN.

【4】 Structured Sparse R-CNN for Direct Scene Graph Generation 标题:用于直接场景图生成的结构化稀疏R-CNN

作者:Yao Teng,Limin Wang 机构:State Key Laboratory for Novel Software Technology, Nanjing University, China 备注:Technical report 链接:https://arxiv.org/abs/2106.10815 摘要:场景图生成(SGG)是检测图像中实体对及其关系的一种方法。现有的SGG方法通常使用多级流水线将该任务分解为目标检测、关系图构造和稠密或稠密到稀疏关系预测。相反,从SGG作为直接集预测的角度出发,本文提出了一种简单、稀疏、统一的关系检测框架,称为结构化稀疏R-CNN。该方法的关键是一组可学习的三元组查询和结构化的三元组检测器,它们可以从训练集中以端到端的方式进行联合优化。具体来说,三元组查询对实体对位置、类别及其关系的一般先验信息进行编码,并为后续的细化提供关系检测的初始猜测。三重态检测器采用级联动态头设计,逐步细化关系检测结果。此外,针对结构化稀疏R-CNN训练困难的问题,提出了一种基于暹罗稀疏R-CNN知识提取的放松增强训练策略。针对不平衡数据分布,提出了自适应聚焦参数和平均logit方法。我们在两个基准上进行了实验:视觉基因组和开放图像,结果表明我们的方法达到了最先进的性能。同时,我们进行了深入的烧蚀研究,为我们在三重态探测器设计和训练策略中的结构化建模提供见解。 摘要:Scene graph generation (SGG) is to detect entity pairs with their relations in an image. Existing SGG approaches often use multi-stage pipelines to decompose this task into object detection, relation graph construction, and dense or dense-to-sparse relation prediction. Instead, from a perspective on SGG as a direct set prediction, this paper presents a simple, sparse, and unified framework for relation detection, termed as Structured Sparse R-CNN. The key to our method is a set of learnable triplet queries and structured triplet detectors which could be jointly optimized from the training set in an end-to-end manner. Specifically, the triplet queries encode the general prior for entity pair locations, categories, and their relations, and provide an initial guess of relation detection for subsequent refinement. The triplet detector presents a cascaded dynamic head design to progressively refine the results of relation detection. In addition, to relieve the training difficulty of Structured Sparse R-CNN, we propose a relaxed and enhanced training strategy based on knowledge distillation from a Siamese Sparse R-CNN. We also propose adaptive focusing parameter and average logit approach for imbalance data distribution. We perform experiments on two benchmarks: Visual Genome and Open Images, and the results demonstrate that our method achieves the state-of-the-art performance. Meanwhile, we perform in-depth ablation studies to provide insights on our structured modeling in triplet detector design and training strategies.

【5】 Adversarial Manifold Matching via Deep Metric Learning for Generative Modeling 标题:基于深度度量学习的对抗性流形匹配产生式建模

作者:Mengyu Dai,Haibin Hang 机构:Microsoft, University of Delaware 链接:https://arxiv.org/abs/2106.10777 摘要:我们提出了一种生成模型的流形匹配方法,它包括一个分布生成器(或数据生成器)和一个度量生成器。在我们的框架中,我们将真实的数据集看作嵌入高维欧氏空间的流形。分布生成器的目标是生成样本,这些样本遵循围绕真实数据流形压缩的某种分布。它是通过使用两组点的几何形状描述符(例如质心和直径)和学习的距离度量来匹配两组点来实现的;度量生成器利用真实数据和生成的样本来学习距离度量,该距离度量接近真实数据流形上的某个固有测地距离。生成的距离度量进一步用于流形匹配。在训练过程中,两个网络同时学习。我们将该方法应用于无监督学习和有监督学习任务中:在无条件图像生成任务中,与已有的生成模型相比,该方法取得了较好的效果;在超分辨率任务中,我们将该框架融入到基于感知的模型中,通过生成具有更自然纹理的样本来提高视觉质量。理论分析和实际数据实验都证明了该框架的可行性和有效性。 摘要:We propose a manifold matching approach to generative models which includes a distribution generator (or data generator) and a metric generator. In our framework, we view the real data set as some manifold embedded in a high-dimensional Euclidean space. The distribution generator aims at generating samples that follow some distribution condensed around the real data manifold. It is achieved by matching two sets of points using their geometric shape descriptors, such as centroid and $p$-diameter, with learned distance metric; the metric generator utilizes both real data and generated samples to learn a distance metric which is close to some intrinsic geodesic distance on the real data manifold. The produced distance metric is further used for manifold matching. The two networks are learned simultaneously during the training process. We apply the approach on both unsupervised and supervised learning tasks: in unconditional image generation task, the proposed method obtains competitive results compared with existing generative models; in super-resolution task, we incorporate the framework in perception-based models and improve visual qualities by producing samples with more natural textures. Both theoretical analysis and real data experiments guarantee the feasibility and effectiveness of the proposed framework.

【6】 Attack to Fool and Explain Deep Networks 标题:攻击愚人并解释深层网络

作者:Naveed Akhtar,Muhammad A. A. K. Jalwana,Mohammed Bennamoun,Ajmal Mian 机构:•All authors are with the Department of Computer Science and SoftwareEngineering, University of Western Australia 备注:To appear in IEEE TPAMI. arXiv admin note: text overlap with arXiv:1905.11544 链接:https://arxiv.org/abs/2106.10606 摘要:深度视觉模型容易受到输入的对抗性干扰。尽管这些信号是经过精心设计的,但在人类看来,它们仍然像噪音一样。这一观察结果导致了深层视觉表征与人类感知不一致的论点。我们通过提供人类对抗性干扰中有意义模式的证据来反驳这一论点。我们首先提出一种攻击,愚弄一个网络,使其混淆整个类别的对象(源类)和目标标签。我们的攻击还限制了来自非源类的样本的非故意愚弄,从而限制了用于网络愚弄的人类定义的语义概念。我们证明了所提出的攻击不仅导致了扰动中规则几何模式的出现,而且揭示了深模型决策边界的深刻信息。进一步探讨这一现象,我们改变了攻击的“对抗性”目标,将其作为“解释”深层视觉表象的工具。我们表明,通过仔细的通道和投影的扰动计算我们的方法,我们可以可视化一个模型的理解人类定义的语义概念。最后,我们利用扰动的可解释性,通过攻击具有对抗性的鲁棒“分类器”来进行图像生成、修复和交互式图像处理。总之,我们的主要贡献是一种新的实用主义对抗攻击,它随后被转化为一种解释视觉模型的工具。这篇文章还为我们的攻击在多个有趣的应用中超越敌方目标做出了次要贡献。 摘要:Deep visual models are susceptible to adversarial perturbations to inputs. Although these signals are carefully crafted, they still appear noise-like patterns to humans. This observation has led to the argument that deep visual representation is misaligned with human perception. We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations. We first propose an attack that fools a network to confuse a whole category of objects (source class) with a target label. Our attack also limits the unintended fooling by samples from non-sources classes, thereby circumscribing human-defined semantic notions for network fooling. We show that the proposed attack not only leads to the emergence of regular geometric patterns in the perturbations, but also reveals insightful information about the decision boundaries of deep models. Exploring this phenomenon further, we alter the `adversarial' objective of our attack to use it as a tool to `explain' deep visual representation. We show that by careful channeling and projection of the perturbations computed by our method, we can visualize a model's understanding of human-defined semantic notions. Finally, we exploit the explanability properties of our perturbations to perform image generation, inpainting and interactive image manipulation by attacking adversarialy robust `classifiers'.In all, our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models. The article also makes secondary contributions in terms of establishing the utility of our attack beyond the adversarial objective with multiple interesting applications.

【7】 Deep Generative Learning via Schrödinger Bridge 标题:基于薛定谔桥的深度生成性学习

作者:Gefei Wang,Yuling Jiao,Qian Xu,Yang Wang,Can Yang 机构: 20 17; 1Department of Mathematics, The Hong Kong University ofScience and Technology, China 2School of Mathemat-ics and Statistics, Wuhan University 备注:None 链接:https://arxiv.org/abs/2106.10410 摘要:我们提出用熵插值法学习一个具有薛定谔桥的生成模型。基于Kullback-Leibler散度,生成性学习任务可以表示为参考分布和目标分布之间的插值。在总体水平上,这种熵插值是通过在$[0,1]$上带有时变漂移项的SDE来描述的。在样本水平上,我们通过在Euler-Maruyama方法中插入由深分数估计量和深密度比估计量估计的漂移项,导出了薛定谔桥算法。在目标分布的一些温和光滑性假设下,我们证明了分数估计量和密度比估计量的相合性,并建立了所提出的Schr{o}dinger桥方法的相合性。我们的理论结果保证了我们所学习的分布收敛到目标分布。在多模态综合数据和基准数据上的实验结果支持了我们的理论发现,并表明通过Schr{o}dinger桥的生成性模型与最新的GANs模型具有可比性,表明了一种新的生成性学习形式。我们证明了它在图像插值和图像修复中的有效性。 摘要:We propose to learn a generative model via entropy interpolation with a Schr\"{o}dinger Bridge. The generative learning task can be formulated as interpolating between a reference distribution and a target distribution based on the Kullback-Leibler divergence. At the population level, this entropy interpolation is characterized via an SDE on $[0,1]$ with a time-varying drift term. At the sample level, we derive our Schr\"{o}dinger Bridge algorithm by plugging the drift term estimated by a deep score estimator and a deep density ratio estimator into the Euler-Maruyama method. Under some mild smoothness assumptions of the target distribution, we prove the consistency of both the score estimator and the density ratio estimator, and then establish the consistency of the proposed Schr\"{o}dinger Bridge approach. Our theoretical results guarantee that the distribution learned by our approach converges to the target distribution. Experimental results on multimodal synthetic data and benchmark data support our theoretical findings and indicate that the generative model via Schr\"{o}dinger Bridge is comparable with state-of-the-art GANs, suggesting a new formulation of generative learning. We demonstrate its usefulness in image interpolation and image inpainting.

【8】 Dynamical Deep Generative Latent Modeling of 3D Skeletal Motion 标题:三维骨骼运动的动态深生成潜在建模

作者:Amirreza Farnoosh,Sarah Ostadabbas 机构:Received: date Accepted: date 链接:https://arxiv.org/abs/2106.10393 摘要:在本文中,我们提出了一个贝叶斯切换动态模型来分割随时间变化的三维姿态数据,该模型揭示了数据中可解释的模式,并且具有生成性。我们的模型将高度相关的骨架数据分解为一组在低维潜在框架中切换时间过程的空间基础。我们将这些时间过程参数化为切换深向量自回归先验,以适应多模态和高阶非线性相互依赖。这就产生了一个动态深层生成的潜在模型,该模型利用近似变分推理解析了三维姿态数据动力学中有意义的内在状态,实现了复杂骨骼运动的低层次动态生成和分割。我们对四个生物运动数据集(包括蝙蝠飞行、莎莎舞、步行和高尔夫数据集)进行的实验证明,与最先进的方法相比,我们的模型具有优越的性能。 摘要:In this paper, we propose a Bayesian switching dynamical model for segmentation of 3D pose data over time that uncovers interpretable patterns in the data and is generative. Our model decomposes highly correlated skeleton data into a set of few spatial basis of switching temporal processes in a low-dimensional latent framework. We parameterize these temporal processes with regard to a switching deep vector autoregressive prior in order to accommodate both multimodal and higher-order nonlinear inter-dependencies. This results in a dynamical deep generative latent model that parses the meaningful intrinsic states in the dynamics of 3D pose data using approximate variational inference, and enables a realistic low-level dynamical generation and segmentation of complex skeleton movements. Our experiments on four biological motion data containing bat flight, salsa dance, walking, and golf datasets substantiate superior performance of our model in comparison with the state-of-the-art methods.

自动驾驶|车辆|车道检测等(4篇)

【1】 Attention-based Neural Network for Driving Environment Complexity Perception 标题:基于注意力的神经网络在驾驶环境复杂性感知中的应用

作者:Ce Zhang,Azim Eskandarian,Xuelai Du 机构:He is a Ph.D. student at Virginia Tech Autonomous Systems and Intelligent, Machines (ASIM) Lab., predicts the affordance for driving actions (vehicle angle 备注:Accepted by 2021 IEEE Intelligent Transportation Systems Conference 链接:https://arxiv.org/abs/2106.11277 摘要:环境感知对自动驾驶汽车(AV)的安全至关重要。现有的AV感知算法大多没有研究周围环境的复杂度,没有考虑环境复杂度参数。提出了一种新的基于注意的神经网络模型来预测周围驾驶环境的复杂程度。该模型以自然驾驶视频和相应的车辆动力学参数作为输入。它由一个Yolo-v3目标检测算法、一个热图生成算法、基于CNN的特征提取器和基于注意力的特征提取器组成,用于视频和时间序列车辆动力学数据输入以提取特征。该算法的输出是一个环境复杂度参数。利用Berkeley-DeepDrive数据集(BDD数据集)和主观标注的环境复杂度水平对算法进行模型训练和验证。提出的基于注意的网络对周围环境的复杂度进行分类,平均分类准确率达到91.22%。结果表明,该算法能够准确地预测环境复杂度水平,并可用于未来AVs的环境感知研究。 摘要:Environment perception is crucial for autonomous vehicle (AV) safety. Most existing AV perception algorithms have not studied the surrounding environment complexity and failed to include the environment complexity parameter. This paper proposes a novel attention-based neural network model to predict the complexity level of the surrounding driving environment. The proposed model takes naturalistic driving videos and corresponding vehicle dynamics parameters as input. It consists of a Yolo-v3 object detection algorithm, a heat map generation algorithm, CNN-based feature extractors, and attention-based feature extractors for both video and time-series vehicle dynamics data inputs to extract features. The output from the proposed algorithm is a surrounding environment complexity parameter. The Berkeley DeepDrive dataset (BDD Dataset) and subjectively labeled surrounding environment complexity levels are used for model training and validation to evaluate the algorithm. The proposed attention-based network achieves 91.22% average classification accuracy to classify the surrounding environment complexity. It proves that the environment complexity level can be accurately predicted and applied for future AVs' environment perception studies.

【2】 One Million Scenes for Autonomous Driving: ONCE Dataset 标题:自动驾驶的一百万个场景:一次数据集

作者:Jiageng Mao,Minzhe Niu,Chenhan Jiang,Hanxue Liang,Xiaodan Liang,Yamin Li,Chaoqiang Ye,Wei Zhang,Zhenguo Li,Jie Yu,Hang Xu,Chunjing Xu 机构: 1 The Chinese University of Hong Kong 2 Huawei Noah’s Ark Lab 3 Sun Yat-Sen University 4 Huawei IAS BU Vehicle Cloud Service† Corresponding authors 链接:https://arxiv.org/abs/2106.11037 摘要:目前自主驾驶中的感知模型已经臭名昭著,因为它非常依赖大量的注释数据来覆盖不可见的情况并解决长尾问题。另一方面,从未标记的大规模采集数据中学习和增量自训练的强识别模型越来越受到人们的关注,并可能成为下一代工业级强鲁棒感知模型在自主驾驶中的解决方案。然而,研究界普遍存在着真实场景数据不足的问题,这阻碍了未来全/半/自监督三维感知方法的探索。在本文中,我们介绍了一次(一百万个场景)的数据集,用于自动驾驶场景中的三维目标检测。ONCE数据集由100万个激光雷达场景和700万个相应的相机图像组成。数据从144个驾驶小时中选取,比现有最大的3D自动驾驶数据集(如nuScenes和Waymo)长20倍,并在不同的地区、时段和天气条件下收集。为了促进未来利用未标记数据进行三维检测的研究,我们还提供了一个基准,在这个基准中,我们在一次数据集上复制和评估各种自我监督和半监督方法。我们对这些方法进行了广泛的分析,并提供了与所用数据规模相关的有价值的观察结果。有关数据、代码和更多信息,请访问https://once-for-auto-driving.github.io/index.html. 摘要:Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at https://once-for-auto-driving.github.io/index.html.

【3】 Neural Network Facial Authentication for Public Electric Vehicle Charging Station 标题:公共电动汽车充电站的神经网络人脸识别

作者:Muhamad Amin Husni Abdul Haris,Sin Liang Lim 机构:Multimedia University, Cyberjaya, Malaysia. 备注:None 链接:https://arxiv.org/abs/2106.10432 摘要:本研究旨在探讨并比较Dlib ResNet与K-最近邻(KNN)分类器的人脸识别正确率。特别是当使用Dlib ResNet作为亚洲人种族的数据集时,据报道,当涉及到亚洲人的面孔时,数据集的准确性存在缺陷。比较都是在使用方向梯度直方图(HOG)方法提取的人脸向量上实现的,并且使用相同的数据集进行公平比较。在电动汽车(EV)充电站中通过面部识别对用户进行认证展示了这种认证系统的实际用例。 摘要:This study is to investigate and compare the facial recognition accuracy performance of Dlib ResNet against a K-Nearest Neighbour (KNN) classifier. Particularly when used against a dataset from an Asian ethnicity as Dlib ResNet was reported to have an accuracy deficiency when it comes to Asian faces. The comparisons are both implemented on the facial vectors extracted using the Histogram of Oriented Gradients (HOG) method and use the same dataset for a fair comparison. Authentication of a user by facial recognition in an electric vehicle (EV) charging station demonstrates a practical use case for such an authentication system.

【4】 A system of vision sensor based deep neural networks for complex driving scene analysis in support of crash risk assessment and prevention 标题:基于视觉传感器的深度神经网络复杂驾驶场景分析系统支持碰撞风险评估和预防

作者:Muhammad Monjurul Karim,Yu Li,Ruwen Qin,Zhaozheng Yin 机构:Department of Civil Engineering, Stony Brook University, Stony Brook, NY , USA, Department of Computer Science 备注:11 Pages, 8 Figures, Presented in TRB conference 链接:https://arxiv.org/abs/2106.10319 摘要:为了帮助人类驾驶员和自动驾驶车辆评估碰撞风险,使用车辆上的仪表板摄像头和深度学习算法进行驾驶场景分析至关重要。尽管这些技术越来越普及,但为此目的的驾驶场景分析仍然是一个挑战。这主要是由于缺乏用于分析碰撞风险指标和碰撞可能性的带注释的大型图像数据集,以及缺乏从复杂驾驶场景中提取大量所需信息的有效方法。为了填补这一空白,本文开发了一个场景分析系统。该系统的多任务神经网络由两个多任务神经网络组成,分别进行场景分类,为每个场景提供四个标签。该系统将deeplabv3和yolov3结合起来,检测和定位危险行人和最近的车辆。所有已识别的信息都可以为自动驾驶车辆或人类驾驶员提供态势感知,以识别周围交通的碰撞风险。为了解决交通事故研究中注释图像数据集的不足,本文开发了两个全新的数据集并向公众开放,这两个数据集在训练所提出的深度神经网络方面是有效的。文中进一步对多网的性能和系统的效率进行了评估。通过典型实例进一步说明了综合场景分析。结果表明,所开发的系统和数据集对驾驶场景分析的有效性,以及对碰撞风险评估和碰撞预防的支持性。 摘要:To assist human drivers and autonomous vehicles in assessing crash risks, driving scene analysis using dash cameras on vehicles and deep learning algorithms is of paramount importance. Although these technologies are increasingly available, driving scene analysis for this purpose still remains a challenge. This is mainly due to the lack of annotated large image datasets for analyzing crash risk indicators and crash likelihood, and the lack of an effective method to extract lots of required information from complex driving scenes. To fill the gap, this paper develops a scene analysis system. The Multi-Net of the system includes two multi-task neural networks that perform scene classification to provide four labels for each scene. The DeepLab v3 and YOLO v3 are combined by the system to detect and locate risky pedestrians and the nearest vehicles. All identified information can provide the situational awareness to autonomous vehicles or human drivers for identifying crash risks from the surrounding traffic. To address the scarcity of annotated image datasets for studying traffic crashes, two completely new datasets have been developed by this paper and made available to the public, which were proved to be effective in training the proposed deep neural networks. The paper further evaluates the performance of the Multi-Net and the efficiency of the developed system. Comprehensive scene analysis is further illustrated with representative examples. Results demonstrate the effectiveness of the developed system and datasets for driving scene analysis, and their supportiveness for crash risk assessment and crash prevention.

Attention注意力(2篇)

【1】 FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild 标题:FP-Age:利用面部解析注意力进行野外面部年龄估计

作者:Yiming Lin,Jie Shen,Yujiang Wang,Maja Pantic 备注:Code and data will be available on this https URL 链接:https://arxiv.org/abs/2106.11145 摘要:基于图像的年龄估计旨在从人脸图像中预测一个人的年龄。它用于各种实际应用中。尽管端到端的deep模型在基准数据集上取得了令人印象深刻的年龄估计结果,但是由于头部姿势、面部表情和遮挡的巨大变化所带来的挑战,它们在野外的性能仍然有很大的改进空间。为了解决这一问题,我们提出了一种简单而有效的方法,将人脸语义明确地融入到年龄估计中,使得该模型能够在不考虑头部姿势和非刚性变形的情况下,从未对齐的人脸图像中正确地聚焦信息量最大的人脸成分。为此,我们设计了一个基于人脸分析的网络来学习不同尺度的语义信息,并设计了一个新的人脸分析注意模块来利用这些语义特征进行年龄估计。为了评估我们的方法在野外的数据,我们还介绍了一个新的具有挑战性的大规模基准称为imdbclean。该数据集是通过使用约束聚类方法半自动清理有噪声的IMDB-WIKI数据集创建的。通过对imdbclean和其他基准数据集的综合实验,在数据集内和数据集间的评估协议下,我们证明了我们的方法始终优于所有现有的年龄估计方法,并取得了新的性能。据我们所知,我们的工作首次尝试利用人脸分析的注意来实现语义感知的年龄估计,这可能对其他高层次的人脸分析任务有启发作用。 摘要:Image-based age estimation aims to predict a person's age from facial images. It is used in a variety of real-world applications. Although end-to-end deep models have achieved impressive results for age estimation on benchmark datasets, their performance in-the-wild still leaves much room for improvement due to the challenges caused by large variations in head pose, facial expressions, and occlusions. To address this issue, we propose a simple yet effective method to explicitly incorporate facial semantics into age estimation, so that the model would learn to correctly focus on the most informative facial components from unaligned facial images regardless of head pose and non-rigid deformation. To this end, we design a face parsing-based network to learn semantic information at different scales and a novel face parsing attention module to leverage these semantic features for age estimation. To evaluate our method on in-the-wild data, we also introduce a new challenging large-scale benchmark called IMDB-Clean. This dataset is created by semi-automatically cleaning the noisy IMDB-WIKI dataset using a constrained clustering method. Through comprehensive experiment on IMDB-Clean and other benchmark datasets, under both intra-dataset and cross-dataset evaluation protocols, we show that our method consistently outperforms all existing age estimation methods and achieves a new state-of-the-art performance. To the best of our knowledge, our work presents the first attempt of leveraging face parsing attention to achieve semantic-aware age estimation, which may be inspiring to other high level facial analysis tasks.

【2】 CenterAtt: Fast 2-stage Center Attention Network 标题:CenterAtt:快速2级中心关注网络

作者:Jianyun Xu,Xin Tang,Jian Dou,Xu Shu,Yushi Zhu 机构:Hikvision Research Institute 链接:https://arxiv.org/abs/2106.10493 摘要:在本技术报告中,我们介绍了在waymo开放数据集实时三维检测的挑战下,海康威视激光雷达探测的方法。我们的竞争解决方案是建立在中心点三维检测框架。研究了中心点的几种变体,包括中心注意头和特征金字塔网络颈。为了实现实时检测,采用了batchnorm合并、半精度浮点网络和GPU加速体素化等方法。通过使用这些方法,我们的团队在waymo开放数据集中的所有实时3D检测挑战方法中排名第6。 摘要:In this technical report, we introduce the methods of HIKVISION_LiDAR_Det in the challenge of waymo open dataset real-time 3D detection. Our solution for the competition are built upon Centerpoint 3D detection framework. Several variants of CenterPoint are explored, including center attention head and feature pyramid network neck. In order to achieve real time detection, methods like batchnorm merge, half-precision floating point network and GPU-accelerated voxelization process are adopted. By using these methods, our team ranks 6th among all the methods on real-time 3D detection challenge in the waymo open dataset.

人脸|人群计数(1篇)

【1】 Prediction of the facial growth direction with Machine Learning methods 标题:基于机器学习方法的人脸生长方向预测

作者:Stanisław Kaźmierczak,Zofia Juszka,Piotr Fudalej,Jacek Mańdziuk 机构:Warsaw University of Technology, Warsaw, Poland, Prof. Loster’s Orthodontics, ul. Bart�lomieja Nowodworskiego , -, Krakow, Poland, Department of Orthodontics, Jagiellonian University in Krakow, Krakow, Poland 链接:https://arxiv.org/abs/2106.10464 摘要:预测面部生长(FG)方向的首次尝试是在半个多世纪前做出的。尽管经过多次尝试和时间的流逝,一个令人满意的方法尚未建立,这个问题仍然对医学专家提出了挑战。据我们所知,这是第一个机器学习方法预测FG方向。数据分析揭示了问题的内在复杂性,解释了基于二维X射线图像的FG方向预测困难的原因。为了进行增长预测,我们采用了各种各样的算法,从logistic回归、树集合到神经网络,并考虑了三种稍微不同的问题公式。分类精度在71%到75%之间。 摘要:First attempts of prediction of the facial growth (FG) direction were made over half of a century ago. Despite numerous attempts and elapsed time, a satisfactory method has not been established yet and the problem still poses a challenge for medical experts. To our knowledge, this paper is the first Machine Learning approach to the prediction of FG direction. Conducted data analysis reveals the inherent complexity of the problem and explains the reasons of difficulty in FG direction prediction based on 2D X-ray images. To perform growth forecasting, we employ a wide range of algorithms, from logistic regression, through tree ensembles to neural networks and consider three, slightly different, problem formulations. The resulting classification accuracy varies between 71% and 75%.

跟踪(2篇)

【1】 Multiple Object Tracking with Mixture Density Networks for Trajectory Estimation 标题:用于轨迹估计的混合密度网络多目标跟踪

作者:Andreu Girbau,Xavier Giró-i-Nieto,Ignasi Rius,Ferran Marqués 机构:Universitat Politecnica de Catalunya, AutomaticTV 备注:Best paper runner up on CVPR 2021 RVSU workshop 链接:https://arxiv.org/abs/2106.10950 摘要:多目标跟踪面临着一些挑战,可以通过轨迹信息来缓解这些挑战。了解物体的后部位置有助于消除歧义和解决诸如遮挡、重新识别和身份转换等情况。在这项工作中,我们证明了轨迹估计可以成为跟踪的一个关键因素,并提出了TrajE,一个基于递归混合密度网络的轨迹估计,作为一个通用的模块,可以添加到现有的目标跟踪。为了提供一些轨迹假设,我们的方法使用波束搜索。同时,基于相同的估计轨迹,我们建议在遮挡发生后重建轨迹。我们将TrajE集成到两种最先进的跟踪算法中,CenterTrack[63]和Tracktor[3]。他们在MOTA Challenge 2017测试集中的表现在MOTA得分上分别提高了6.3分和0.3分,在IDF1得分上分别提高了1.8分和3.1分,开创了CenterTrack+TrajE配置的新水平 摘要:Multiple object tracking faces several challenges that may be alleviated with trajectory information. Knowing the posterior locations of an object helps disambiguating and solving situations such as occlusions, re-identification, and identity switching. In this work, we show that trajectory estimation can become a key factor for tracking, and present TrajE, a trajectory estimator based on recurrent mixture density networks, as a generic module that can be added to existing object trackers. To provide several trajectory hypotheses, our method uses beam search. Also, relying on the same estimated trajectory, we propose to reconstruct a track after an occlusion occurs. We integrate TrajE into two state of the art tracking algorithms, CenterTrack [63] and Tracktor [3]. Their respective performances in the MOTChallenge 2017 test set are boosted 6.3 and 0.3 points in MOTA score, and 1.8 and 3.1 in IDF1, setting a new state of the art for the CenterTrack+TrajE configuration

【2】 Learning to Track Object Position through Occlusion 标题:学习通过遮挡跟踪对象位置

作者:Satyaki Chakraborty,Martial Hebert 机构:Carnegie Mellon University 链接:https://arxiv.org/abs/2106.10766 摘要:遮挡是目标探测器和跟踪器面临的最重要的挑战之一。虽然目标检测和跟踪在过去都受到了广泛的关注,但该领域现有的大多数方法都没有针对遮挡目标进行检测和跟踪。然而,对于不同的自主任务来说,通过遮挡来检测或跟踪感兴趣的物体是一个长期的挑战。传统的方法是使用视觉目标跟踪器和显式遮挡建模经验漂移,并作出一些基本假设的数据。我们建议在基于区域的视频对象检测器成功的基础上,采用“通过检测跟踪”的方法来解决这一问题。我们的视频级目标检测器在其核心使用了一种新的循环计算单元,即使在遮挡情况下也能实现目标特征的长期传播。最后,我们将我们的方法与现有的最先进的视频对象检测器进行了比较,结果表明我们的方法在从互联网上收集的家具装配视频数据集上取得了很好的效果,其中螺钉、螺母和螺栓等小对象经常被摄像机的视点遮挡。 摘要:Occlusion is one of the most significant challenges encountered by object detectors and trackers. While both object detection and tracking has received a lot of attention in the past, most existing methods in this domain do not target detecting or tracking objects when they are occluded. However, being able to detect or track an object of interest through occlusion has been a long standing challenge for different autonomous tasks. Traditional methods that employ visual object trackers with explicit occlusion modeling experience drift and make several fundamental assumptions about the data. We propose to address this with a `tracking-by-detection` approach that builds upon the success of region based video object detectors. Our video level object detector uses a novel recurrent computational unit at its core that enables long term propagation of object features even under occlusion. Finally, we compare our approach with existing state-of-the-art video object detectors and show that our approach achieves superior results on a dataset of furniture assembly videos collected from the internet, where small objects like screws, nuts, and bolts often get occluded from the camera viewpoint.

裁剪|量化|加速|压缩相关(1篇)

【1】 Sparse Training via Boosting Pruning Plasticity with Neuroregeneration 标题:神经再生增强修剪可塑性的稀疏训练

作者:Shiwei Liu,Tianlong Chen,Xiaohan Chen,Zahra Atashgahi,Lu Yin,Huanyu Kou,Li Shen,Mykola Pechenizkiy,Zhangyang Wang,Decebal Constantin Mocanu 机构:Eindhoven University of Technology,University of Texas at Austin, University of Twente,University of Leeds,JD Explore Academy 链接:https://arxiv.org/abs/2106.10404 摘要:彩票假设(LTH)和单次网络剪枝(SNIP)的研究在训练后剪枝(迭代幅度剪枝)和训练前剪枝(初始化剪枝)上引起了广泛的关注。前一种方法的计算量非常大,而后一种方法的性能往往很差。相比之下,在训练剪枝过程中,一类同时具有训练/推理效率和可比性能的剪枝方法暂时较少被探索。为了更好地理解训练剪枝过程,我们从剪枝可塑性(被剪枝网络恢复原始性能的能力)的角度定量研究了整个训练过程中剪枝的效果。剪枝可塑性可以帮助解释文献中关于神经网络剪枝的其他一些经验观察。我们进一步发现,通过注射一种叫做神经再生的脑激励机制,修剪的可塑性可以得到显著改善,即再生与修剪相同数量的连接。基于剪枝可塑性的观点,我们设计了一种新的渐进幅度剪枝(GMP)方法,称为零代价神经再生渐进剪枝(GraNet)及其动态稀疏训练(DST)变体(GraNet-ST)。它们都是先进的。也许最令人印象深刻的是,后者首次使用ImageNet上的ResNet-50,大大提高了稀疏到稀疏的训练性能。我们将发布所有代码。 摘要:Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter category of methods usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. Based on the insights from pruning plasticity, we design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet), and its dynamic sparse training (DST) variant (GraNet-ST). Both of them advance state of the art. Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet. We will release all codes.

蒸馏|知识提取(1篇)

【1】 Knowledge Distillation via Instance-level Sequence Learning 标题:基于实例级序列学习的知识提炼

作者:Haoran Zhao,Xin Sun,Junyu Dong,Zihe Dong,Qiong Li 机构: Ocean University of China 链接:https://arxiv.org/abs/2106.10885 摘要:近年来,人们提出了从教师网络中提取一般知识来指导学生网络的方法。现有的大多数方法都是通过输入从数据中均匀抽取的随机小批量序列,将知识从教师网络传递给学生。相反,我们认为,紧凑的学生网络应该引导逐步使用样本排序在一个有意义的序列。从而逐步缩小师生网络在特征表示上的差距。在这项工作中,我们提供了一个课程学习知识提炼的框架,通过实例级序列学习。它利用早期的学生网络作为快照,为学生网络的下一个训练阶段创建课程。我们在CIFAR-10、CIFAR-100、SVHN和CINIC-10数据集上进行了广泛的实验。与几种最先进的方法相比,我们的框架以较少的迭代次数获得了最佳的性能。 摘要:Recently, distillation approaches are suggested to extract general knowledge from a teacher network to guide a student network. Most of the existing methods transfer knowledge from the teacher network to the student via feeding the sequence of random mini-batches sampled uniformly from the data. Instead, we argue that the compact student network should be guided gradually using samples ordered in a meaningful sequence. Thus, it can bridge the gap of feature representation between the teacher and student network step by step. In this work, we provide a curriculum learning knowledge distillation framework via instance-level sequence learning. It employs the student network of the early epoch as a snapshot to create a curriculum for the student network's next training phase. We carry out extensive experiments on CIFAR-10, CIFAR-100, SVHN and CINIC-10 datasets. Compared with several state-of-the-art methods, our framework achieves the best performance with fewer iterations.

视觉解释|视频理解VQA|caption等(2篇)

【1】 TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning 标题:TCIC:主题概念、跨语言学习和图像字幕视觉

作者:Zhihao Fan,Zhongyu Wei,Siyuan Wang,Ruize Wang,Zejun Li,Haijun Shan,Xuanjing Huang 机构:Zhejiang Lab, Research Institute of Intelligent and Complex Systems, Fudan University, China 备注:IJCAI2021 链接:https://arxiv.org/abs/2106.10936 摘要:现有的图像字幕的研究通常是用一个具有低层次事实(对象和关系)的场景图来表示图像,而没有捕捉到高层次的语义。在本文中,我们提出了一个主题概念扩展图像字幕(TCIC)框架,其中包含了主题概念来表示高级跨模态语义。在实践中,我们将主题概念建模为记忆向量,并提出了主题节点转换器(Transformer-with-theme-Nodes,TTN)来整合这些向量用于图像字幕。考虑到主题概念可以从图像和字幕中学习,我们提出了两种基于TTN的主题概念表征学习设置。在视觉方面,TTN被配置为将基于场景图的特征和主题概念作为视觉表示学习的输入。在语言方面,TTN被配置为将字幕和主题概念作为文本表示重构的输入。这两种设置都旨在使用相同的基于转换器的解码器生成目标字幕。在训练过程中,我们进一步将从图像中学习到的主题概念的表达与相应的字幕对齐,以加强跨模态学习。在MS-COCO上的实验结果表明,与一些最新的模型相比,我们的方法是有效的。 摘要:Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.

【2】 VQA-Aid: Visual Question Answering for Post-Disaster Damage Assessment and Analysis 标题:VQA-Aid:灾后损失评估与分析的可视化问答

作者:Argho Sarkar,Maryam Rahnemoonfar 机构:Bina Lab, University of Maryland, Baltimore County, Maryland, USA 备注:4 pages, 2 figures 链接:https://arxiv.org/abs/2106.10548 摘要:与无人机(UAV)集成的可视化问答系统在灾后损失评估方面有很大的发展潜力。向受影响地区提供援助高度依赖于实时数据评估和分析。可视化问答的范围是理解场景并提供与查询相关的答案,这无疑加快了灾难后的恢复过程。在这项工作中,我们通过展示我们最近开发的被称为“迈克尔”飓风期间收集的VQA数据集,并比较基准VQA模型的性能,讨论了可视化问答(VQA)任务在灾后损害评估中的重要性。 摘要:Visual Question Answering system integrated with Unmanned Aerial Vehicle (UAV) has a lot of potentials to advance the post-disaster damage assessment purpose. Providing assistance to affected areas is highly dependent on real-time data assessment and analysis. Scope of the Visual Question Answering is to understand the scene and provide query related answer which certainly faster the recovery process after any disaster. In this work, we address the importance of \textit{visual question answering (VQA)} task for post-disaster damage assessment by presenting our recently developed VQA dataset called \textit{HurMic-VQA} collected during hurricane Michael, and comparing the performances of baseline VQA models.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 One-to-many Approach for Improving Super-Resolution 标题:提高超分辨率的一对多方法

作者:Sieun Park,Eunho Lee 机构:Goldsmiths, University of London, London, SE,NW, Paul Math School, Chung-cheong bukdo 链接:https://arxiv.org/abs/2106.10437 摘要:超分辨率(SR)是一个一对多的任务,有多种可能的解决方案。然而,以往的研究并不关注这一特征。对于一对多管道,生成器应该能够生成重建的多个估计,并且不会因为生成相似且同样真实的图像而受到惩罚。为了实现这一点,我们建议在剩余密集块(RRDB)中的每一个残差之后加入加权像素噪声,以使生成器能够生成各种图像。我们修改了严格的内容丢失,只要内容一致,就不会惩罚重建图像中的随机变化。此外,我们观察到,在DIV2K和DIV8K数据集中,有一些没有焦点的区域提供了毫无帮助的指导方针。我们使用[10]的方法过滤训练数据中的模糊区域。最后,我们修改鉴别器以接收低解析度影像作为参考影像与目标影像,以提供更好的回馈给产生器。使用我们提出的方法,我们能够提高ESRGAN在x4知觉SR中的性能,并在x16知觉极限SR中获得最先进的LPIPS分数。 摘要:Super-resolution (SR) is a one-to-many task with multiple possible solutions. However, previous works were not concerned about this characteristic. For a one-to-many pipeline, the generator should be able to generate multiple estimates of the reconstruction, and not be penalized for generating similar and equally realistic images. To achieve this, we propose adding weighted pixel-wise noise after every Residual-in-Residual Dense Block (RRDB) to enable the generator to generate various images. We modify the strict content loss to not penalize the stochastic variation in reconstructed images as long as it has consistent content. Additionally, we observe that there are out-of-focus regions in the DIV2K, DIV8K datasets that provide unhelpful guidelines. We filter blurry regions in the training data using the method of [10]. Finally, we modify the discriminator to receive the low-resolution image as a reference image along with the target image to provide better feedback to the generator. Using our proposed methods, we were able to improve the performance of ESRGAN in x4 perceptual SR and achieve the state-of-the-art LPIPS score in x16 perceptual extreme SR.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 DiGS : Divergence guided shape implicit neural representation for unoriented point clouds 标题:DIGS:无方向点云的散度引导形状隐式神经表示

作者:Yizhak Ben-Shabat,Chamin Hewa Koneputugodage,Stephen Gould 机构:The Australian National University, Technion Israel Institute of Technology 链接:https://arxiv.org/abs/2106.10811 摘要:最近,神经形状表示在形状分析和重建任务中被证明是有效的。现有的神经网络方法需要点坐标和相应的法向量来学习形状的隐式水平集。通常不提供法向量作为原始数据,因此需要近似和重定位作为预处理阶段,这两个阶段都会引入噪声。本文提出了一种不需要法向量作为输入的散度引导的形状表示学习方法。我们证明,在距离函数的散度上加入软约束有利于平滑解,该解可靠地确定梯度方向以匹配每个点的未知法线,在某些情况下甚至比直接使用地面真值法线向量的方法更好。此外,我们提出了一种新的正弦形状表示网络的几何初始化方法,进一步提高了收敛到期望解的能力。我们评估了我们的方法在表面重建任务中的有效性,并与其他无定向方法相比显示了最先进的性能,与定向方法相比,显示了相当的性能。 摘要:Neural shape representations have recently shown to be effective in shape analysis and reconstruction tasks. Existing neural network methods require point coordinates and corresponding normal vectors to learn the implicit level sets of the shape. Normal vectors are often not provided as raw data, therefore, approximation and reorientation are required as pre-processing stages, both of which can introduce noise. In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input. We show that incorporating a soft constraint on the divergence of the distance function favours smooth solutions that reliably orients gradients to match the unknown normal at each point, in some cases even better than approaches that use ground truth normal vectors directly. Additionally, we introduce a novel geometric initialization method for sinusoidal shape representation networks that further improves convergence to the desired solution. We evaluate the effectiveness of our approach on the task of surface reconstruction and show state-of-the-art performance compared to other unoriented methods and on-par performance compared to oriented methods.

多模态(1篇)

【1】 Contrastive Multi-Modal Clustering 标题:对比多模态聚类

作者:Jie Xu,Huayi Tang,Yazhou Ren,Xiaofeng Zhu,Lifang He 机构:University of Electronic Science and Technology of China, Lehigh University 链接:https://arxiv.org/abs/2106.11193 摘要:多模态聚类是从多个模态或视图中挖掘互补信息的一种方法,越来越受到人们的关注。然而,现有的研究很少集中于提取多模态的高层语义信息进行聚类。本文提出了对比多模态聚类(CMMC),通过对比学习挖掘高层语义信息。具体来说,我们的框架包括三个部分(1) 多个自动编码器进行了优化,以保持每个模态的多样性,学习互补信息(2) 提出了一个特征对比模块,从不同的语态中学习常见的高级语义特征(3) 标签对比模块旨在学习所有模式的一致聚类分配。通过提出的多模态对比学习,在保持低层潜在特征多样性的同时,最大化了高层特征的互信息。此外,为了利用学习到的高级语义特征,我们进一步通过解决最大匹配问题来生成伪标签,以微调聚类分配。大量实验表明,CMMC具有良好的可扩展性,并优于现有的多模式聚类方法。 摘要:Multi-modal clustering, which explores complementary information from multiple modalities or views, has attracted people's increasing attentions. However, existing works rarely focus on extracting high-level semantic information of multiple modalities for clustering. In this paper, we propose Contrastive Multi-Modal Clustering (CMMC) which can mine high-level semantic information via contrastive learning. Concretely, our framework consists of three parts. (1) Multiple autoencoders are optimized to maintain each modality's diversity to learn complementary information. (2) A feature contrastive module is proposed to learn common high-level semantic features from different modalities. (3) A label contrastive module aims to learn consistent cluster assignments for all modalities. By the proposed multi-modal contrastive learning, the mutual information of high-level features is maximized, while the diversity of the low-level latent features is maintained. In addition, to utilize the learned high-level semantic features, we further generate pseudo labels by solving a maximum matching problem to fine-tune the cluster assignments. Extensive experiments demonstrate that CMMC has good scalability and outperforms state-of-the-art multi-modal clustering methods.

3D|3D重建等相关(1篇)

【1】 3D Shape Registration Using Spectral Graph Embedding and Probabilistic Matching 标题:基于谱图嵌入和概率匹配的三维形状配准

作者:Avinash Sharma,Radu Horaud,Diana Mateus 机构:Inria Grenoble Rhˆone-Alpes, avenue de l’Europe, Montbonnot Saint-Martin, France 链接:https://arxiv.org/abs/2106.11166 摘要:提出了一种基于谱图理论和概率匹配的三维形状配准方法。三维形状分析的任务包括跟踪、识别、配准等。考虑到不同采集设备采集到的数据具有很大的可变性,在单个框架中分析三维数据仍然是一项具有挑战性的任务。三维形状配准就是这样一项具有挑战性的形状分析任务。本章的主要贡献是将谱图匹配与拉普拉斯嵌入相结合,将谱图匹配方法推广到超大图。由于图的嵌入表示是通过降维得到的,我们认为现有的基于谱的方法不容易应用。讨论了精确图同构和不精确图同构问题的解,回顾了组合图拉普拉斯算子的主要谱性质;我们提供了一个新的分析通勤时间嵌入,使我们能够解释后者的PCA的一个图形,并选择适当的维数相关的嵌入度量空间;我们推导了一个单位超球标准化的通勤时间嵌入,使我们能够注册两个不同的采样形状;我们提出了一种新的方法来寻找特征值特征向量排序和特征向量符号,该方法利用特征特征签名(直方图)对等距形状变形保持不变,并且很适合于谱图匹配框架,提出了一种基于期望最大化点配准算法的概率形状匹配公式,该算法在对齐特征基和寻找顶点到顶点分配之间交替进行。 摘要:We address the problem of 3D shape registration and we propose a novel technique based on spectral graph theory and probabilistic matching. The task of 3D shape analysis involves tracking, recognition, registration, etc. Analyzing 3D data in a single framework is still a challenging task considering the large variability of the data gathered with different acquisition devices. 3D shape registration is one such challenging shape analysis task. The main contribution of this chapter is to extend the spectral graph matching methods to very large graphs by combining spectral graph matching with Laplacian embedding. Since the embedded representation of a graph is obtained by dimensionality reduction we claim that the existing spectral-based methods are not easily applicable. We discuss solutions for the exact and inexact graph isomorphism problems and recall the main spectral properties of the combinatorial graph Laplacian; We provide a novel analysis of the commute-time embedding that allows us to interpret the latter in terms of the PCA of a graph, and to select the appropriate dimension of the associated embedded metric space; We derive a unit hyper-sphere normalization for the commute-time embedding that allows us to register two shapes with different samplings; We propose a novel method to find the eigenvalue-eigenvector ordering and the eigenvector signs using the eigensignature (histogram) which is invariant to the isometric shape deformations and fits well in the spectral graph matching framework, and we present a probabilistic shape matching formulation using an expectation maximization point registration algorithm which alternates between aligning the eigenbases and finding a vertex-to-vertex assignment.

其他神经网络|深度学习|模型|建模(11篇)

【1】 Multi-VAE: Learning Disentangled View-common and View-peculiar Visual Representations for Multi-view Clustering 标题:Multi-VAE:学习用于多视图聚类的非纠缠视图公共和视图特有的视觉表示

作者:Jie Xu,Yazhou Ren,Huayi Tang,Xiaorong Pu,Xiaofeng Zhu,Ming Zeng,Lifang He 机构:University of Electronic Science and Technology of China, Carnegie Mellon University,Lehigh University 链接:https://arxiv.org/abs/2106.11232 摘要:多视图聚类是一个由来已久的重要研究课题,主要研究从不同的角度挖掘互补信息。然而,现有的研究往往在一个公共特征空间中融合多个视图的表示或处理聚类,这可能导致它们的纠缠,特别是对于视觉表示。为了解决这个问题,我们提出了一个新的基于VAE的多视图聚类框架(multi-VAE)。具体地说,在生成模型中定义了一个视图公共变量和多个视图特殊变量。视图公共变量的先验服从近似离散的Gumbel-Softmax分布,引入该分布提取多视图的公共聚类因子。同时,视图特有变量的先验服从连续高斯分布,用来表示每个视图特有的视觉因素。通过控制互信息容量来分离视图的公共表示和视图的特殊表示,可以分离出多个视图的连续视觉信息,从而有效地挖掘出它们的公共离散聚类信息。实验结果表明,与现有的聚类方法相比,Multi-VAE在获得更好的聚类性能的同时,具有良好的可解释性。 摘要:Multi-view clustering, a long-standing and important research problem, focuses on mining complementary information from diverse views. However, existing works often fuse multiple views' representations or handle clustering in a common feature space, which may result in their entanglement especially for visual representations. To address this issue, we present a novel VAE-based multi-view clustering framework (Multi-VAE) by learning disentangled visual representations. Concretely, we define a view-common variable and multiple view-peculiar variables in the generative model. The prior of view-common variable obeys approximately discrete Gumbel Softmax distribution, which is introduced to extract the common cluster factor of multiple views. Meanwhile, the prior of view-peculiar variable follows continuous Gaussian distribution, which is used to represent each view's peculiar visual factors. By controlling the mutual information capacity to disentangle the view-common and view-peculiar representations, continuous visual information of multiple views can be separated so that their common discrete cluster information can be effectively mined. Experimental results demonstrate that Multi-VAE enjoys the disentangled and explainable visual representations, while obtaining superior clustering performance compared with state-of-the-art methods.

【2】 Automatic Plant Cover Estimation with CNNs Automatic Plant Cover Estimation with Convolutional Neural Networks 标题:基于CNNs的植被覆盖度自动估算卷积神经网络植被覆盖度自动估算

作者:Matthias Körschens,Paul Bodesheim,Christine Römermann,Solveig Franziska Bucher,Mirco Migliavacca,Josephine Ulrich,Joachim Denzler 机构: Friedrich Schiller University Jena 链接:https://arxiv.org/abs/2106.11154 摘要:监测植物对环境变化的响应是植物多样性研究的关键。然而,这项工作目前仍由实地的植物学家手工完成。这项工作是非常艰苦的,获得的数据是,虽然遵循一个标准化的方法来估计植物覆盖率,通常是主观的,有一个粗略的时间分辨率。为了弥补这些不足,我们研究了利用卷积神经网络(CNNs)自动从图像中提取相关数据的方法,重点研究了9种草本植物的群落组成和物种覆盖率。为此,我们研究了几种标准的CNN结构和不同的预训练方法。我们发现,我们优于我们以前的方法在更高的图像分辨率使用自定义CNN的平均绝对误差为5.16%。除了这些调查,我们还进行了误差分析的基础上,时间方面的植物覆盖图像。这一分析使我们深入了解了自动方法的问题所在,比如遮挡和可能由时间变化引起的错误分类。 摘要:Monitoring the responses of plants to environmental changes is essential for plant biodiversity research. This, however, is currently still being done manually by botanists in the field. This work is very laborious, and the data obtained is, though following a standardized method to estimate plant coverage, usually subjective and has a coarse temporal resolution. To remedy these caveats, we investigate approaches using convolutional neural networks (CNNs) to automatically extract the relevant data from images, focusing on plant community composition and species coverages of 9 herbaceous plant species. To this end, we investigate several standard CNN architectures and different pretraining methods. We find that we outperform our previous approach at higher image resolutions using a custom CNN with a mean absolute error of 5.16%. In addition to these investigations, we also conduct an error analysis based on the temporal aspect of the plant cover images. This analysis gives insight into where problems for automatic approaches lie, like occlusion and likely misclassifications caused by temporal changes.

【3】 A Game-Theoretic Taxonomy of Visual Concepts in DNNs 标题:DNNs中视觉概念的博弈论分类

作者:Xu Cheng,Chuntung Chu,Yi Zheng,Jie Ren,Quanshi Zhang 机构:Shanghai Jiao Tong University, University of Toronto, China University of Mining & Technology 备注:12 pages 链接:https://arxiv.org/abs/2106.10938 摘要:本文从一个新的角度,即图像中像素之间的博弈论多阶交互,重新思考了DNN如何编码不同复杂性的视觉概念。除了对象的分类法和纹理和形状的认知分类法之外,我们还提供了一种新的视觉概念分类法,帮助我们从概念复杂性的角度解释形状和纹理的编码。这样,基于多阶相互作用,我们发现了DNNs编码纹理的三种不同的信号处理行为。此外,我们还发现DNN编码形状的灵活性低于编码纹理的灵活性。此外,我们还分析了DNNs如何对异常样本进行编码,并探讨了网络结构对交互的影响。此外,我们还阐明了多阶交互在实际应用中的关键作用。该代码将在论文被接受时发布。 摘要:In this paper, we rethink how a DNN encodes visual concepts of different complexities from a new perspective, i.e. the game-theoretic multi-order interactions between pixels in an image. Beyond the categorical taxonomy of objects and the cognitive taxonomy of textures and shapes, we provide a new taxonomy of visual concepts, which helps us interpret the encoding of shapes and textures, in terms of concept complexities. In this way, based on multi-order interactions, we find three distinctive signal-processing behaviors of DNNs encoding textures. Besides, we also discover the flexibility for a DNN to encode shapes is lower than the flexibility of encoding textures. Furthermore, we analyze how DNNs encode outlier samples, and explore the impacts of network architectures on interactions. Additionally, we clarify the crucial role of the multi-order interactions in real-world applications. The code will be released when the paper is accepted.

【4】 PIANO: A Parametric Hand Bone Model from Magnetic Resonance Imaging 标题:钢琴:一种基于磁共振成像的参数化手骨模型

作者:Yuwei Li,Minye Wu,Yuyao Zhang,Lan Xu,Jingyi Yu 机构:Shanghai Engineering Research Center of Intelligent Vision and Imaging, School of Information, Science and Technology, ShanghaiTech University, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences 备注:Accepted to IJCAI 2021 链接:https://arxiv.org/abs/2106.10893 摘要:手部建模对于身临其境的VR/AR、动作理解或人类保健至关重要。现有的参数化模型只考虑手的形状、姿势或纹理,而没有对骨骼等解剖属性进行建模,这对于真实的手生物力学分析是必不可少的。在本文中,我们提出了钢琴,第一个参数化骨模型的人的手从磁共振数据。我们的钢琴模型在生物学上是正确的,动画制作简单,并且是可差分的,与传统的只基于外表面的手模型相比,它以数据驱动的方式实现了更精确的内手运动结构的解剖学建模。此外,我们的PIANO模型可以应用于神经网络层,使得训练具有细粒度的语义损失,这为从MRI甚至RGB图像数据驱动细粒度手部骨骼解剖和语义理解开辟了新的任务。我们把我们的模型公之于众。 摘要:Hand modeling is critical for immersive VR/AR, action understanding, or human healthcare. Existing parametric models account only for hand shape, pose, or texture, without modeling the anatomical attributes like bone, which is essential for realistic hand biomechanics analysis. In this paper, we present PIANO, the first parametric bone model of human hands from MRI data. Our PIANO model is biologically correct, simple to animate, and differentiable, achieving more anatomically precise modeling of the inner hand kinematic structure in a data-driven manner than the traditional hand models based on the outer surface only. Furthermore, our PIANO model can be applied in neural network layers to enable training with a fine-grained semantic loss, which opens up the new task of data-driven fine-grained hand bone anatomic and semantic understanding from MRI or even RGB images. We make our model publicly available.

【5】 Neighborhood Contrastive Learning for Novel Class Discovery 标题:邻域对比学习在新颖类发现中的应用

作者:Zhun Zhong,Enrico Fini,Subhankar Roy,Zhiming Luo,Elisa Ricci,Nicu Sebe 机构:University of Trento, Xiamen University ,Fondazione Bruno Kessler 备注:CVPR 2021 链接:https://arxiv.org/abs/2106.10731 摘要:在本文中,我们讨论了新类发现(NCD),即在给定已知类的标记数据集的一组未标记样本中发现新类的任务。我们利用NCD的特点建立了一个新的框架,称为邻域对比学习(NCL),用来学习对聚类性能有重要影响的区分性表示。我们的贡献是双重的。首先,我们发现在标记集上训练的特征提取器生成表示,其中泛型查询样本及其邻居可能共享同一类。我们利用这一观察结果,通过对比学习来检索和聚合伪正对,从而鼓励模型学习更多的区分性表征。其次,我们注意到大多数实例很容易被网络所识别,对对比损失的贡献较小。为了克服这个问题,我们提出在特征空间中混合标记样本和未标记样本来产生硬底片。我们通过实验证明了这两个因素对聚类性能的显著贡献,并使我们的模型大大优于最先进的方法(例如,在CIFAR-100上聚类准确率+13%,在ImageNet上聚类准确率+8%)。 摘要:In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes. We exploit the peculiarities of NCD to build a new framework, named Neighborhood Contrastive Learning (NCL), to learn discriminative representations that are important to clustering performance. Our contribution is twofold. First, we find that a feature extractor trained on the labeled set generates representations in which a generic query sample and its neighbors are likely to share the same class. We exploit this observation to retrieve and aggregate pseudo-positive pairs with contrastive learning, thus encouraging the model to learn more discriminative representations. Second, we notice that most of the instances are easily discriminated by the network, contributing less to the contrastive loss. To overcome this issue, we propose to generate hard negatives by mixing labeled and unlabeled samples in the feature space. We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin (e.g., clustering accuracy +13% on CIFAR-100 and +8% on ImageNet).

【6】 NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction 标题:Neus:用于多视点重建的体绘制学习神经隐式曲面

作者:Peng Wang,Lingjie Liu,Yuan Liu,Christian Theobalt,Taku Komura,Wenping Wang 机构:†The University of Hong Kong ‡Max Planck Institute for Informatics, ⋄Texas A&M University 备注:22 pages, 17 figures 链接:https://arxiv.org/abs/2106.10689 摘要:我们提出了一种新的神经表面重建方法,称为NeuS,用于从二维图像输入重建具有高保真度的物体和场景。现有的神经表面重建方法,如DVR和IDR,需要前景掩模作为监督,容易陷入局部极小值,难以重建自遮挡严重或结构较薄的物体。同时,最近用于新视图合成的神经方法,例如NeRF及其变体,使用体绘制来产生具有优化鲁棒性的神经场景表示,即使对于高度复杂的对象也是如此。然而,从这种学习的隐式表示中提取高质量的曲面是困难的,因为表示中没有足够的曲面约束。在NeuS中,我们提出将曲面表示为有符号距离函数(SDF)的零水平集,并提出一种新的体绘制方法来训练神经SDF表示。我们观察到传统的体绘制方法在表面重建时会产生固有的几何误差(即偏差),因此提出了一种在一阶近似中没有偏差的新公式,从而在没有掩模监督的情况下也能得到更精确的表面重建。在DTU数据集和BlendedMVS数据集上的实验表明,NeuS在高质量的曲面重建方面优于现有技术,特别是对于具有复杂结构和自遮挡的物体和场景。 摘要:We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inputs. Existing neural surface reconstruction approaches, such as DVR and IDR, require foreground mask as supervision, easily get trapped in local minima, and therefore struggle with the reconstruction of objects with severe self-occlusion or thin structures. Meanwhile, recent neural methods for novel view synthesis, such as NeRF and its variants, use volume rendering to produce a neural scene representation with robustness of optimization, even for highly complex objects. However, extracting high-quality surfaces from this learned implicit representation is difficult because there are not sufficient surface constraints in the representation. In NeuS, we propose to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation. We observe that the conventional volume rendering method causes inherent geometric errors (i.e. bias) for surface reconstruction, and therefore propose a new formulation that is free of bias in the first order of approximation, thus leading to more accurate surface reconstruction even without the mask supervision. Experiments on the DTU dataset and the BlendedMVS dataset show that NeuS outperforms the state-of-the-arts in high-quality surface reconstruction, especially for objects and scenes with complex structures and self-occlusion.

【7】 Practical Assessment of Generalization Performance Robustness for Deep Networks via Contrastive Examples 标题:基于对比实例的深度网络泛化性能健壮性实用评估

作者:Xuanyu Wu,Xuhong Li,Haoyi Xiong,Xiao Zhang,Siyu Huang,Dejing Dou 机构:University of Pennsylvania, Philadelphia, USA, Baidu Research, Baidu Inc., China, Tsinghua University, Beijing, China, Nanyang Technological University, Singapore 链接:https://arxiv.org/abs/2106.10653 摘要:为了补充深度神经网络(DNNs)泛化性能评价的测试集,提出了用数据变换的训练图像作为对比示例。在这项工作中,我们提出了一个实用的框架ContRE(“ContRE”一词在法语中的意思是“反对”或“反对”),它使用对比示例来进行DNN泛化性能评估。具体而言,ContRE遵循对比学习的假设,即具有良好泛化性能的稳健DNN模型能够在不同的数据转换下从同一幅图像中提取一致的特征集并做出一致的预测。在训练集上结合一组设计良好的数据转换随机策略,ContRE在生成的对比示例上采用分类错误和Fisher比率来评估和分析深度模型的泛化性能,并辅以测试集。为了证明控制的有效性和效率,在三个开源基准数据集上使用各种DNN模型进行了广泛的实验,并进行了深入的烧蚀研究和适用性分析。我们的实验结果证实:(1)深度模型在对比实例上的行为与测试集上的内容有很强的相关性;(2)ContRE是一种在不同环境下对测试集的泛化性能进行补充的稳健度量。 摘要:Training images with data transformations have been suggested as contrastive examples to complement the testing set for generalization performance evaluation of deep neural networks (DNNs). In this work, we propose a practical framework ContRE (The word "contre" means "against" or "versus" in French.) that uses Contrastive examples for DNN geneRalization performance Estimation. Specifically, ContRE follows the assumption in contrastive learning that robust DNN models with good generalization performance are capable of extracting a consistent set of features and making consistent predictions from the same image under varying data transformations. Incorporating with a set of randomized strategies for well-designed data transformations over the training set, ContRE adopts classification errors and Fisher ratios on the generated contrastive examples to assess and analyze the generalization performance of deep models in complement with a testing set. To show the effectiveness and the efficiency of ContRE, extensive experiments have been done using various DNN models on three open source benchmark datasets with thorough ablation studies and applicability analyses. Our experiment results confirm that (1) behaviors of deep models on contrastive examples are strongly correlated to what on the testing set, and (2) ContRE is a robust measure of generalization performance complementing to the testing set in various settings.

【8】 CompConv: A Compact Convolution Module for Efficient Feature Learning 标题:CompConv:一种用于高效特征学习的紧凑型卷积模块

作者:Chen Zhang,Yinghao Xu,Yujun Shen 机构:Zhejiang University, The Chinese University of Hong Kong 链接:https://arxiv.org/abs/2106.10486 摘要:卷积神经网络(CNNs)在各种计算机视觉任务中取得了显著的成功,但其计算量巨大。为了解决这个问题,现有的方法要么压缩经过良好训练的大规模模型,要么通过精心设计的网络结构学习轻量级模型。在这项工作中,我们对卷积算子进行了深入的研究,卷积算子是CNNs中使用的基本单元,以减少其计算量。特别是,我们提出了一个紧凑的卷积模块,称为CompConv,以促进有效的特征学习。通过分治策略,CompConv可以节省大量的计算量和参数来生成特定的维度特征图。此外,CompConv谨慎地将输入特征集成到输出中,以有效地继承输入信息。更重要的是,新的CompConv是一个即插即用模块,可以直接应用于现代CNN结构,以取代香草卷积层,而无需进一步的努力。大量的实验结果表明,CompConv可以充分压缩CNN的基准结构,但几乎没有牺牲性能,超过了其他竞争对手。 摘要:Convolutional Neural Networks (CNNs) have achieved remarkable success in various computer vision tasks but rely on tremendous computational cost. To solve this problem, existing approaches either compress well-trained large-scale models or learn lightweight models with carefully designed network structures. In this work, we make a close study of the convolution operator, which is the basic unit used in CNNs, to reduce its computing load. In particular, we propose a compact convolution module, called CompConv, to facilitate efficient feature learning. With the divide-and-conquer strategy, CompConv is able to save a great many computations as well as parameters to produce a certain dimensional feature map. Furthermore, CompConv discreetly integrates the input features into the outputs to efficiently inherit the input information. More importantly, the novel CompConv is a plug-and-play module that can be directly applied to modern CNN structures to replace the vanilla convolution layers without further effort. Extensive experimental results suggest that CompConv can adequately compress the benchmark CNN structures yet barely sacrifice the performance, surpassing other competitors.

【9】 Multi-Contextual Design of Convolutional Neural Network for Steganalysis 标题:用于隐写分析的卷积神经网络的多上下文设计

作者:Brijesh Singh,Arijit Sur,Pinaki Mitra 机构: and Pinaki Mitra are with Department ofComputer Science & Engineering, Indian Institute of Technology Guwa-hati 备注:This work has been submitted to the IEEE Transactions on Information Forensics and Security (IEEE-TIFS) for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 链接:https://arxiv.org/abs/2106.10430 摘要:近年来,基于深度学习的隐写分析分类器由于其先进的性能而受到广泛的关注。大多数深度隐写分析分类器通常采用高通滤波器作为预处理步骤提取噪声残差,并将其输入深度模型进行分类。研究发现,最近的隐写嵌入并不总是限制其在高频区的嵌入;相反,他们按照嵌入策略分发它。因此,除了噪声残差,学习嵌入区域是另一项具有挑战性的任务。与传统方法不同的是,本文提出的模型首先利用学习的去噪核提取噪声残差,以提高信噪比。经过预处理后,将稀疏的噪声残差反馈给一种新的多上下文卷积神经网络(M-CNET),该网络利用不同的上下文大小来学习噪声残差的稀疏和低幅度表示。通过引入自注意模块,进一步提高了模型的性能,使模型聚焦于易于隐写嵌入的区域。通过一组综合实验,证明了该方案的有效性。此外,还进行了烧蚀研究,验证了所提出的体系结构中各个模块的贡献。 摘要:In recent times, deep learning-based steganalysis classifiers became popular due to their state-of-the-art performance. Most deep steganalysis classifiers usually extract noise residuals using high-pass filters as preprocessing steps and feed them to their deep model for classification. It is observed that recent steganographic embedding does not always restrict their embedding in the high-frequency zone; instead, they distribute it as per embedding policy. Therefore, besides noise residual, learning the embedding zone is another challenging task. In this work, unlike the conventional approaches, the proposed model first extracts the noise residual using learned denoising kernels to boost the signal-to-noise ratio. After preprocessing, the sparse noise residuals are fed to a novel Multi-Contextual Convolutional Neural Network (M-CNET) that uses heterogeneous context size to learn the sparse and low-amplitude representation of noise residuals. The model performance is further improved by incorporating the Self-Attention module to focus on the areas prone to steganalytic embedding. A set of comprehensive experiments is performed to show the proposed scheme's efficacy over the prior arts. Besides, an ablation study is given to justify the contribution of various modules of the proposed architecture.

【10】 Underwater Image Restoration via Contrastive Learning and a Real-world Dataset 标题:基于对比学习和真实数据集的水下图像复原

作者:Junlin Han,Mehrdad Shoeiby,Tim Malthus,Elizabeth Botha,Janet Anstee,Saeed Anwar,Ran Wei,Mohammad Ali Armin,Hongdong Li,Lars Petersson 机构: Aus-tralian National University (ANU) 备注:In submission, code/dataset are at this https URL arXiv admin note: text overlap with arXiv:2103.09697 链接:https://arxiv.org/abs/2106.10718 摘要:水下图像复原对于揭示水下世界具有重要意义。在过去的几十年里,已经发展了许多技术和算法。然而,由于成像/传感、光照和折射几何畸变的基本困难,在获取清晰的水下图像时,还没有对水下图像恢复进行全面的评估。为了弥补这一差距,我们构建了一个大规模的真实水下图像数据集,称为“HICRD”(鹭岛珊瑚礁数据集),目的是对现有方法进行基准测试,并支持开发新的基于深度学习的方法。我们采用精确的水参数(漫射衰减系数)来生成参考图像。在未配对训练集中,有2000幅参考复原图像和6003幅原始水下图像。在此基础上,提出了一种基于无监督图像到图像转换框架的水下图像复原方法。我们提出的方法利用对比学习和生成对抗网络来最大化原始图像和恢复图像之间的互信息。通过大量的实验和与现有方法的比较,进一步证明了该方法的优越性。我们的代码和数据集可以在GitHub上公开获得。 摘要:Underwater image restoration is of significant importance in unveiling the underwater world. Numerous techniques and algorithms have been developed in the past decades. However, due to fundamental difficulties associated with imaging/sensing, lighting, and refractive geometric distortions, in capturing clear underwater images, no comprehensive evaluations have been conducted of underwater image restoration. To address this gap, we have constructed a large-scale real underwater image dataset, dubbed `HICRD' (Heron Island Coral Reef Dataset), for the purpose of benchmarking existing methods and supporting the development of new deep-learning based methods. We employ accurate water parameter (diffuse attenuation coefficient) in generating reference images. There are 2000 reference restored images and 6003 original underwater images in the unpaired training set. Further, we present a novel method for underwater image restoration based on unsupervised image-to-image translation framework. Our proposed method leveraged contrastive learning and generative adversarial networks to maximize the mutual information between raw and restored images. Extensive experiments with comparisons to recent approaches further demonstrate the superiority of our proposed method. Our code and dataset are publicly available at GitHub.

【11】 Nuclei Grading of Clear Cell Renal Cell Carcinoma in Histopathological Image by Composite High-Resolution Network 标题:基于复合高分辨率网络的肾透明细胞癌组织病理学图像细胞核分级

作者:Zeyu Gao,Jiangbo Shi,Xianli Zhang,Yang Li,Haichuan Zhang,Jialun Wu,Chunbao Wang,Deyu Meng,Chen Li 机构: School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi , China, National Engineering Lab for Big Data Analytics, Xi’an Jiaotong University, Xi’an 备注:Accepted by MICCAI 2021 链接:https://arxiv.org/abs/2106.10641 摘要:肾透明细胞癌(ccRCC)的分级是影响预后的重要因素,因此ccRCC细胞核分级是肾透明细胞癌病理分析的重要内容。计算机辅助细胞核分级的目的是通过自动识别组织病理学图像中肿瘤细胞核的级别,提高病理学家的工作效率,同时降低误诊率。这样的任务需要精确地分割和分类细胞核。然而,现有的核分割和分类方法大多不能处理核分级的类间相似性,不能直接应用于ccRCC的分级任务。本文提出了一种用于ccRCC核分级的复合高分辨率网络。具体地说,我们提出了一个称为W-Net的分割网络,它可以分离聚集的核。然后,基于两个高分辨率特征抽取器(HRFEs)将核的细粒度分类重构为两个跨类别分类任务。两个HRFEs通过一个复合连接与W-Net共享同一个主干编码器,这样分割任务的有意义的特征就可以被继承用于分类任务。最后,应用头部融合块生成每个核的预测标记。此外,我们还介绍了一个用于ccRCC细胞核分级的数据集,其中包含1000个图像块和70945个带注释的细胞核。我们证明,我们提出的方法达到国家的最先进的性能相比,现有的方法在这个大型ccRCC分级数据集。 摘要:The grade of clear cell renal cell carcinoma (ccRCC) is a critical prognostic factor, making ccRCC nuclei grading a crucial task in RCC pathology analysis. Computer-aided nuclei grading aims to improve pathologists' work efficiency while reducing their misdiagnosis rate by automatically identifying the grades of tumor nuclei within histopathological images. Such a task requires precisely segment and accurately classify the nuclei. However, most of the existing nuclei segmentation and classification methods can not handle the inter-class similarity property of nuclei grading, thus can not be directly applied to the ccRCC grading task. In this paper, we propose a Composite High-Resolution Network for ccRCC nuclei grading. Specifically, we propose a segmentation network called W-Net that can separate the clustered nuclei. Then, we recast the fine-grained classification of nuclei to two cross-category classification tasks, based on two high-resolution feature extractors (HRFEs) which are proposed for learning these two tasks. The two HRFEs share the same backbone encoder with W-Net by a composite connection so that meaningful features for the segmentation task can be inherited for the classification task. Last, a head-fusion block is applied to generate the predicted label of each nucleus. Furthermore, we introduce a dataset for ccRCC nuclei grading, containing 1000 image patches with 70945 annotated nuclei. We demonstrate that our proposed method achieves state-of-the-art performance compared to existing methods on this large ccRCC grading dataset.

其他(20篇)

【1】 How Do Adam and Training Strategies Help BNNs Optimization? 标题:ADAM和训练策略如何帮助BNN优化?

作者:Zechun Liu,Zhiqiang Shen,Shichao Li,Koen Helwegen,Dong Huang,Kwang-Ting Cheng 机构: IntroductionBinary Neural Networks (BNNs) have gained increasingattention in recent years due to the high compression ra-Equal contribution 1Hong Kong University of Science andTechnology 2Carnegie Mellon University 3Plumerai 备注:ICML 2021. Code and models are available at this https URL 链接:https://arxiv.org/abs/2106.11309 摘要:最佳性能的二元神经网络(BNNs)通常是通过Adam优化及其多步训练来实现的。然而,据我们所知,很少有研究探讨Adam在BNN优化方面优于SGD等其他优化器的根本原因,或者提供支持特定训练策略的分析性解释。为了解决这个问题,在本文中,我们首先研究了在训练过程中的梯度和权重的轨迹。我们证明了Adam中二阶动量的正则化效应对于恢复BNNs中由于激活饱和而死亡的权值是至关重要的。我们发现Adam通过其自适应学习速率策略,能够更好地处理BNNs的粗糙损失面,并获得更好的最优解,具有更高的泛化能力。此外,我们还考察了实值权值在二元网络中的有趣作用,揭示了权值衰减对BNN优化稳定性和迟滞性的影响。通过大量的实验和分析,我们得到了一个简单的训练方案,建立在现有的基于Adam的优化基础上,使用与最先进的ReActNet相同的体系结构,在ImageNet数据集上实现了70.5%的top-1精度,同时实现了1.1%的更高精度。代码和型号可在https://github.com/liuzechun/AdamBNN. 摘要:The best performing Binary Neural Networks (BNNs) are usually attained using Adam optimization and its multi-step training variants. However, to the best of our knowledge, few studies explore the fundamental reasons why Adam is superior to other optimizers like SGD for BNN optimization or provide analytical explanations that support specific training strategies. To address this, in this paper we first investigate the trajectories of gradients and weights in BNNs during the training process. We show the regularization effect of second-order momentum in Adam is crucial to revitalize the weights that are dead due to the activation saturation in BNNs. We find that Adam, through its adaptive learning rate strategy, is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. Furthermore, we inspect the intriguing role of the real-valued weights in binary networks, and reveal the effect of weight decay on the stability and sluggishness of BNN optimization. Through extensive experiments and analysis, we derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset using the same architecture as the state-of-the-art ReActNet while achieving 1.1% higher accuracy. Code and models are available at https://github.com/liuzechun/AdamBNN.

【2】 Fast Simultaneous Gravitational Alignment of Multiple Point Sets 标题:多点集快速同时重力对齐

作者:Vladislav Golyanik,Soshi Shimada,Christian Theobalt 机构:Max Planck Institute for Informatics, SIC 备注:None 链接:https://arxiv.org/abs/2106.11308 摘要:多个无序点集的同时刚性对准问题是一个对任意输入都无偏的问题,近年来引起了人们越来越多的兴趣,一些可靠的方法也被提出。虽然对噪声和聚集异常值具有显著的鲁棒性,但当前的方法需要复杂的初始化方案,并且不能很好地扩展到大型点集。本文提出了一种新的弹性多点集同时配准技术,将多点集解释为粒子群在相互诱导力场中的刚性运动。由于改进了模拟,改变了物理定律,并用2^D树(D是空间维数)加速了全局多连接点相互作用,我们的多体引力方法(MBGA)对噪声和缺失数据具有鲁棒性,同时比以前的方法(10^5点或更多点)支持更大的点集。在不同的实验环境下,MBGA在精度和运行时间上都优于几种基线点集对齐方法。我们为社区提供源代码,以促进结果的可重复性。 摘要:The problem of simultaneous rigid alignment of multiple unordered point sets which is unbiased towards any of the inputs has recently attracted increasing interest, and several reliable methods have been newly proposed. While being remarkably robust towards noise and clustered outliers, current approaches require sophisticated initialisation schemes and do not scale well to large point sets. This paper proposes a new resilient technique for simultaneous registration of multiple point sets by interpreting the latter as particle swarms rigidly moving in the mutually induced force fields. Thanks to the improved simulation with altered physical laws and acceleration of globally multiply-linked point interactions with a 2^D-tree (D is the space dimensionality), our Multi-Body Gravitational Approach (MBGA) is robust to noise and missing data while supporting more massive point sets than previous methods (with 10^5 points and more). In various experimental settings, MBGA is shown to outperform several baseline point set alignment approaches in terms of accuracy and runtime. We make our source code available for the community to facilitate the reproducibility of the results.

【3】 Neural Marching Cubes 标题:神经行进立方体

作者:Zhiqin Chen,Hao Zhang 机构: Simon Fraser University 链接:https://arxiv.org/abs/2106.11272 摘要:介绍了一种数据驱动的从离散隐式场中提取三角形网格的方法——神经行进立方体(NMC)。经典的MC是由分离到单个立方体的粗糙镶嵌模板定义的。虽然提出了更精细的细分,但在确定每个立方体中的顶点位置和局部网格拓扑时,它们都会进行启发式假设,例如三线性。原则上,这些方法都不能重建几何特征,这些特征揭示了相邻立方体(例如,尖锐边缘)之间的一致性或依赖性,因为这些信息无法解释,从而导致对真实潜在隐式场的错误估计。为了应对这些挑战,我们从深度学习的角度重新铸造MC,通过设计更易于保留几何特征的镶嵌模板,并从训练网格中学习顶点位置和网格拓扑,从附近的立方体中获取上下文信息。我们开发了一个紧凑的每立方参数化来表示输出三角形网格,同时与神经处理兼容,因此可以使用一个简单的三维卷积网络来进行训练。我们证明了每个立方体中所有适用于我们设计的拓扑情况都可以通过我们的表示很容易地导出,并且通过遵循一些设计准则也可以自然而有效地获得所得到的细分。此外,我们的网络在有限的感受野中学习局部特征,因此它可以很好地推广到新的形状和新的数据集。我们通过对所有已知的MC变体进行定量和定性的比较来评估我们的神经MC方法。特别是,我们展示了我们的网络恢复尖锐特征的能力,例如边缘和角落,这是MC及其变体的一个长期问题。我们的网络也比以前的方法更精确地重建局部网格拓扑。 摘要:We introduce Neural Marching Cubes (NMC), a data-driven approach for extracting a triangle mesh from a discretized implicit field. Classical MC is defined by coarse tessellation templates isolated to individual cubes. While more refined tessellations have been proposed, they all make heuristic assumptions, such as trilinearity, when determining the vertex positions and local mesh topologies in each cube. In principle, none of these approaches can reconstruct geometric features that reveal coherence or dependencies between nearby cubes (e.g., a sharp edge), as such information is unaccounted for, resulting in poor estimates of the true underlying implicit field. To tackle these challenges, we re-cast MC from a deep learning perspective, by designing tessellation templates more apt at preserving geometric features, and learning the vertex positions and mesh topologies from training meshes, to account for contextual information from nearby cubes. We develop a compact per-cube parameterization to represent the output triangle mesh, while being compatible with neural processing, so that a simple 3D convolutional network can be employed for the training. We show that all topological cases in each cube that are applicable to our design can be easily derived using our representation, and the resulting tessellations can also be obtained naturally and efficiently by following a few design guidelines. In addition, our network learns local features with limited receptive fields, hence it generalizes well to new shapes and new datasets. We evaluate our neural MC approach by quantitative and qualitative comparisons to all well-known MC variants. In particular, we demonstrate the ability of our network to recover sharp features such as edges and corners, a long-standing issue of MC and its variants. Our network also reconstructs local mesh topologies more accurately than previous approaches.

【4】 Reliability and Validity of Image-Based and Self-Reported Skin Phenotype Metrics 标题:基于图像和自我报告的皮肤表型指标的信度和效度

作者:John J. Howard,Yevgeniy B. Sirotin,Jerry L. Tipton,Arun R. Vemury 机构: Vemury works at the United States Department of Homeland Security 备注:11 pages, 5 figures 链接:https://arxiv.org/abs/2106.11240 摘要:随着越来越多的人脸识别系统的采用,重要的是要确保这些技术在人口群体中有足够的表现。最近,在探索性能差异时,肤色等表型被认为是传统种族类别的优越替代品。然而,对于如何在生物特征性能评估或更广泛的人工智能中恰当地测量肤色,人们几乎没有共识。在这项研究中,我们探讨了面部区域亮度测量(FALMs)之间的关系估计从图像和地面真相皮肤读数收集使用一个设备来测量人类皮肤。从同一个体的不同图像估计的FALM相对于地面真实FALM有显著差异。只有通过更好地控制采集(相机、背景和环境),才能减少这种变化。接下来,我们比较了地面真相FALM和Fitzpatrick皮肤类型(FST)的分类使用标准,在个人,医学调查和显示FST是很差的预测肤色。最后,我们展示了如何噪声估计FALM导致错误选择人口统计学差异的解释因素。这些结果表明,用于生物特征性能评估的肤色测量必须来自客观的、有特征的和受控的来源。此外,尽管这是一个目前实践的方法,估计FST类别和FALM从不受控制的图像不能提供一个适当的衡量肤色。 摘要:With increasing adoption of face recognition systems, it is important to ensure adequate performance of these technologies across demographic groups. Recently, phenotypes such as skin-tone, have been proposed as superior alternatives to traditional race categories when exploring performance differentials. However, there is little consensus regarding how to appropriately measure skin-tone in evaluations of biometric performance or in AI more broadly. In this study, we explore the relationship between face-area-lightness-measures (FALMs) estimated from images and ground-truth skin readings collected using a device designed to measure human skin. FALMs estimated from different images of the same individual varied significantly relative to ground-truth FALM. This variation was only reduced by greater control of acquisition (camera, background, and environment). Next, we compare ground-truth FALM to Fitzpatrick Skin Types (FST) categories obtained using the standard, in-person, medical survey and show FST is poorly predictive of skin-tone. Finally, we show how noisy estimation of FALM leads to errors selecting explanatory factors for demographic differentials. These results demonstrate that measures of skin-tone for biometric performance evaluations must come from objective, characterized, and controlled sources. Further, despite this being a currently practiced approach, estimating FST categories and FALMs from uncontrolled imagery does not provide an appropriate measure of skin-tone.

【5】 Can poachers find animals from public camera trap images? 标题:偷猎者能从公共相机捕捉到的图像中找到动物吗?

作者:Sara Beery,Elizabeth Bondi 机构:California Institute of Technology†, Harvard University‡, Equal contribution 备注:CV4Animals Workshop at CVPR 2021 链接:https://arxiv.org/abs/2106.11236 摘要:为了保护包含敏感、高目标物种的相机陷阱数据的位置,许多生态学家在发布数据时随机混淆相机的经纬度。例如,他们可以为网络中的每个摄像机发布一个随机位置,该位置位于真实摄像机位置的1公里半径范围内。在本文中,我们研究了地理模糊处理对维护相机陷阱位置隐私的鲁棒性,并通过一个案例研究表明,一些简单、直观的启发式算法和公开可用的卫星光栅可以用来减少可能包含相机的面积87%(假设在1km内随机模糊),证明地理模糊可能不如以前认为的有效。 摘要:To protect the location of camera trap data containing sensitive, high-target species, many ecologists randomly obfuscate the latitude and longitude of the camera when publishing their data. For example, they may publish a random location within a 1km radius of the true camera location for each camera in their network. In this paper, we investigate the robustness of geo-obfuscation for maintaining camera trap location privacy, and show via a case study that a few simple, intuitive heuristics and publicly available satellite rasters can be used to reduce the area likely to contain the camera by 87% (assuming random obfuscation within 1km), demonstrating that geo-obfuscation may be less effective than previously believed.

【6】 Does Optimal Source Task Performance Imply Optimal Pre-training for a Target Task? 标题:优化源任务性能是否意味着目标任务的最佳预训练?

作者:Steven Gutstein,Brent Lance,Sanjay Shakkottai 机构: of Electrical and Computer Engineeringat the University of Texas at Austin 链接:https://arxiv.org/abs/2106.11174 摘要:预训练的深网通常用于提高神经网络的精度和训练时间。一般认为,对网络进行预训练以获得最佳的源任务性能最有利于网络学习任意目标任务。这通常不是真的。停止源任务训练,在达到最佳性能之前,可以创建一个更适合学习新任务的预训练网络。我们做了几个实验来证明这种效果,以及训练量和学习率的影响。此外,我们发现,这反映了学习能力的普遍丧失,甚至延伸到重新学习源任务 摘要:Pre-trained deep nets are commonly used to improve accuracies and training times for neural nets. It is generally assumed that pre-training a net for optimal source task performance best prepares it to learn an arbitrary target task. This is generally not true. Stopping source task training, prior to optimal performance, can create a pre-trained net better suited for learning a new task. We performed several experiments demonstrating this effect, as well as the influence of amount of training and of learning rate. Additionally, we show that this reflects a general loss of learning ability that even extends to relearning the source task

【7】 Graceful Degradation and Related Fields 标题:优美退化及其相关领域

作者:Jack Dymond 机构:School of Electronics and Computer Science, University of Southampton 链接:https://arxiv.org/abs/2106.11119 摘要:当机器学习模型遇到的数据超出了训练所依据的分布范围时,它们往往表现得很差,最突出的是对错误预测的过度自信。这种行为将对现实世界的机器学习系统产生灾难性的影响。在这个领域中,优雅退化指的是在遇到这种分布外数据时对模型性能的优化。这项工作提出了一个优雅退化的定义和讨论,以及它可以应用于部署的视觉系统。在此之后,对相关领域进行了调查,新颖地将优雅退化问题分为主动和被动两种方法。在被动方法中,优雅退化由模型以自包含的方式处理和实现,在主动方法中,当遇到认知不确定性时,模型被更新。这项工作传达了问题的重要性,旨在促进机器学习策略的发展,意识到优雅退化。 摘要:When machine learning models encounter data which is out of the distribution on which they were trained they have a tendency to behave poorly, most prominently over-confidence in erroneous predictions. Such behaviours will have disastrous effects on real-world machine learning systems. In this field graceful degradation refers to the optimisation of model performance as it encounters this out-of-distribution data. This work presents a definition and discussion of graceful degradation and where it can be applied in deployed visual systems. Following this a survey of relevant areas is undertaken, novelly splitting the graceful degradation problem into active and passive approaches. In passive approaches, graceful degradation is handled and achieved by the model in a self-contained manner, in active approaches the model is updated upon encountering epistemic uncertainties. This work communicates the importance of the problem and aims to prompt the development of machine learning strategies that are aware of graceful degradation.

【8】 Pre-training also Transfers Non-Robustness 标题:预训练也传递了非健壮性

作者:Jiaming Zhang,Jitao Sang,Qi Yi,Huiwen Dong,Jian Yu 机构:Beijing Jiaotong University, Beijing Normal University 链接:https://arxiv.org/abs/2106.10989 摘要:预训练使许多任务取得了最新成果。尽管预训练对泛化有显著的贡献,但我们在本研究中观察到,预训练也将非鲁棒性从预训练模型转化为微调模型。以图像分类为例,首先对各种数据集和网络主干进行实验,探讨影响鲁棒性的因素。进一步分析了微调模型与标准模型的差异,揭示了导致非鲁棒性传递的原因。最后,通过正则化目标任务和源任务之间的差异,提出了一种简单的鲁棒预训练方法。结果验证了该方法在抑制非鲁棒性和保持泛化方面的有效性。 摘要:Pre-training has enabled many state-of-the-art results on many tasks. In spite of its recognized contribution to generalization, we observed in this study that pre-training also transfers the non-robustness from pre-trained model into the fine-tuned model. Using image classification as an example, we first conducted experiments on various datasets and network backbones to explore the factors influencing robustness. Further analysis is conducted on examining the difference between the fine-tuned model and standard model to uncover the reason leading to the non-robustness transfer. Finally, we introduce a simple robust pre-training solution by regularizing the difference between target and source tasks. Results validate the effectiveness in alleviating non-robustness and preserving generalization.

【9】 Moving in a 360 World: Synthesizing Panoramic Parallaxes from a Single Panorama 标题:在360世界中移动:从单个全景图合成全景视差

作者:Ching-Yu Hsu,Cheng Sun,Hwann-Tzong Chen 机构:National Tsing Hua University 链接:https://arxiv.org/abs/2106.10859 摘要:我们提出了全向神经辐射场(OmniNeRF),这是第一种利用视差合成全景图的方法。近年来,新的视点合成研究主要集中在有限视野的透视图像上,需要在特定条件下获得足够的图像。相反,OmniNeRF可以在给定一幅等矩形图像作为训练数据的情况下生成未知视点的全景图像。为此,我们提出在不同的虚拟摄像机位置,通过在三维世界和不同的二维全景坐标之间来回投影来增强单个RGB-D全景图。通过这样做,我们能够优化一个全向神经辐射场,从固定中心的全向视角收集可见像素,从不同的摄像机位置估计新的视角。结果,提出的OmniNeRF实现了具有视差效果的新颖全景图的令人信服的渲染。我们在合成数据集和真实数据集上展示了我们的每个建议的有效性。 摘要:We present Omnidirectional Neural Radiance Fields (OmniNeRF), the first method to the application of parallax-enabled novel panoramic view synthesis. Recent works for novel view synthesis focus on perspective images with limited field-of-view and require sufficient pictures captured in a specific condition. Conversely, OmniNeRF can generate panorama images for unknown viewpoints given a single equirectangular image as training data. To this end, we propose to augment the single RGB-D panorama by projecting back and forth between a 3D world and different 2D panoramic coordinates at different virtual camera positions. By doing so, we are able to optimize an Omnidirectional Neural Radiance Field with visible pixels collecting from omnidirectional viewing angles at a fixed center for the estimation of new viewing angles from varying camera positions. As a result, the proposed OmniNeRF achieves convincing renderings of novel panoramic views that exhibit the parallax effect. We showcase the effectiveness of each of our proposals on both synthetic and real-world datasets.

【10】 Robust Pooling through the Data Mode 标题:通过数据模式实现强健的池化

作者:Ayman Mukhaimar,Ruwan Tennakoon,Chow Yin Lai,Reza Hoseinnezhad,AlirezaBab-Hadiashar 机构:Bab-Hadiashar∗, ∗School of Engineering, RMIT University, Melbourne, Australia, †School of Science, RMIT University, Melbourne, Australia, ‡ Department of Electronic & Electrical Engineering, University College London, UK 备注:under consideration at Computer Vision and Image Understanding 链接:https://arxiv.org/abs/2106.10850 摘要:由于数据中经常出现噪声和异常点,从点云数据中进行学习一直是一项具有挑战性的任务。这些数据的不精确性会显著影响最先进的深度学习网络的性能以及它们对对象进行分类或分割的能力。虽然有一些健壮的深度学习方法,但对于实时应用程序来说,它们的计算成本太高。本文提出了一种深度学习的解决方案,其中包括一个新的健壮池层,该层大大增强了网络的健壮性,并且执行速度明显快于现有的方法。建议的池层使用两种方法(RANSAC和histogram)在模式/簇中查找数据,因为簇表示模型。我们将池层测试到基于点和基于图的神经网络等框架中,测试结果表明,与健壮的最新方法相比,池层具有更强的健壮性。 摘要:The task of learning from point cloud data is always challenging due to the often occurrence of noise and outliers in the data. Such data inaccuracies can significantly influence the performance of state-of-the-art deep learning networks and their ability to classify or segment objects. While there are some robust deep learning approaches, they are computationally too expensive for real-time applications. This paper proposes a deep learning solution that includes a novel robust pooling layer which greatly enhances network robustness and performs significantly faster than state-of-the-art approaches. The proposed pooling layer looks for data a mode/cluster using two methods, RANSAC, and histogram, as clusters are indicative of models. We tested the pooling layer into frameworks such as Point-based and graph-based neural networks, and the tests showed enhanced robustness as compared to robust state-of-the-art methods.

【11】 Mobile Sensing for Multipurpose Applications in Transportation 标题:移动传感在交通运输中的多用途应用

作者:Armstrong Aboah,Michael Boeding,Yaw Adu-Gyamfi 机构:PhD Student, Civil and Environmental Engineering Department, University of Missouri-Columbia, Computer Science Department, Assistant Professor, This paper is being submitted to the Transportation Research Board (TRB) for presentation at the 链接:https://arxiv.org/abs/2106.10733 摘要:为了解决当前的交通问题,需要进行常规和一致的数据收集。当使用复杂的机器收集数据时,数据收集的成本会显著增加。由于这一限制,国家运输部门努力收集一致的数据,以便及时分析和解决运输问题。智能手机集成传感器的最新进展带来了一种更经济实惠的数据采集方法。本研究的主要目标是开发和实现一个智能手机数据采集应用程序。目前设计的应用程序包括三个主要模块:前端图形用户界面(GUI),一个传感器模块和一个后端模块。当前端用户界面允许与应用程序交互时,传感器模块在应用程序使用时收集相关数据,如视频和加速计读数。另一方面,后端由firebase存储组成,用于存储收集的数据。与其他开发的用于收集路面信息的应用程序相比,当前的应用程序并不过度依赖于互联网,因此可以在互联网访问受限的地区使用该应用程序。通过收集连接密苏里州哥伦比亚市和密苏里州堪萨斯城的i70W高速公路上的数据,对开发的应用程序进行了评估。对数据进行了多种用途的分析,包括计算国际平整度指数(IRI),识别路面病害,了解驾驶员的行为和环境,应用结果表明,应用程序采集的数据质量较高。 摘要:Routine and consistent data collection is required to address contemporary transportation issues.The cost of data collection increases significantly when sophisticated machines are used to collect data. Due to this constraint, State Departments of Transportation struggles to collect consistent data for analyzing and resolving transportation problems in a timely manner. Recent advancements in the sensors integrated into smartphones have resulted in a more affordable method of data collection.The primary objective of this study is to develop and implement a smartphone application for data collection.The currently designed app consists of three major modules: a frontend graphical user interface (GUI), a sensor module, and a backend module. While the frontend user interface enables interaction with the app, the sensor modules collect relevant data such as video and accelerometer readings while the app is in use. The backend, on the other hand, is made up of firebase storage, which is used to store the gathered data.In comparison to other developed apps for collecting pavement information, this current app is not overly reliant on the internet enabling the app to be used in areas of restricted internet access.The developed application was evaluated by collecting data on the i70W highway connecting Columbia, Missouri, and Kansas City, Missouri.The data was analyzed for a variety of purposes, including calculating the International Roughness Index (IRI), identifying pavement distresses, and understanding driver's behaviour and environment .The results of the application indicate that the data collected by the app is of high quality.

【12】 CAMERAS: Enhanced Resolution And Sanity preserving Class Activation Mapping for image saliency 标题:摄像机:增强分辨率和保持理智的图像显著类激活映射

作者:Mohammad A. A. K. Jalwana,Naveed Akhtar,Mohammed Bennamoun,Ajmal Mian 机构:Computer Science and Software Engineering, The University of Western Australia., Code: 备注:IEEE CVPR 2021 paper 链接:https://arxiv.org/abs/2106.10649 摘要:反向传播图像显著性旨在通过估计输入中单个像素的模型中心重要性来解释模型预测。然而,网络中早期层的类不敏感只允许使用较深层的低分辨率激活图进行显著性计算,从而导致图像显著性受损。纠正这一点可能会导致理智的失败。我们提出相机,一种技术来计算高保真反向传播显著性地图不需要任何外部先验和保持地图健全。该方法系统地对激活图和反向传播梯度进行多尺度积累和融合,计算出精确的显著性图。从精确的图像显著性到不同模型输入特征相对重要性的表达,以及视觉相似对象的模型感知之间的精确区分,我们的高分辨率映射为本文提出的黑盒深度视觉模型提供了多种新颖的见解。我们还通过将攻击信号集中在地图确定的精确区域,大幅降低攻击信号的范数,证明了我们的显著性地图在对抗设置中的实用性。我们的方法也启发了新的评价指标和健全检查这一发展中的研究方向。此处提供代码https://github.com/VisMIL/CAMERAS 摘要:Backpropagation image saliency aims at explaining model predictions by estimating model-centric importance of individual pixels in the input. However, class-insensitivity of the earlier layers in a network only allows saliency computation with low resolution activation maps of the deeper layers, resulting in compromised image saliency. Remedifying this can lead to sanity failures. We propose CAMERAS, a technique to compute high-fidelity backpropagation saliency maps without requiring any external priors and preserving the map sanity. Our method systematically performs multi-scale accumulation and fusion of the activation maps and backpropagated gradients to compute precise saliency maps. From accurate image saliency to articulation of relative importance of input features for different models, and precise discrimination between model perception of visually similar objects, our high-resolution mapping offers multiple novel insights into the black-box deep visual models, which are presented in the paper. We also demonstrate the utility of our saliency maps in adversarial setup by drastically reducing the norm of attack signals by focusing them on the precise regions identified by our maps. Our method also inspires new evaluation metrics and a sanity check for this developing research direction. Code is available here https://github.com/VisMIL/CAMERAS

【13】 FloorPP-Net: Reconstructing Floor Plans using Point Pillars for Scan-to-BIM 标题:FloorPP-NET:使用点柱进行扫描到BIM重建平面图

作者:Yijie Wu,Fan Xue 机构:The University of Hong Kong, Pokfulam, Hong Kong, China 链接:https://arxiv.org/abs/2106.10635 摘要:针对建筑信息模型的扫描任务,提出了一种基于深度学习的点云处理方法FloorPP-Net。楼层平面网首先将输入的楼层点云转换为点柱,然后预测角点和边点,输出楼层平面图。总之,FloorPP-Net为Scan-to-Floor-Plan(Scan2FP)任务建立了一个端到端的有监督学习框架。在与CVPR 2021联合举办的第一届国际BIM扫描挑战赛中,FloorPP Net在平面图重建赛道中排名第二。未来的工作包括一般的边缘建议,二维平面正则化和三维BIM重建。 摘要:This paper presents a deep learning-based point cloud processing method named FloorPP-Net for the task of Scan-to-BIM (building information model). FloorPP-Net first converts the input point cloud of a building story into point pillars (PP), then predicts the corners and edges to output the floor plan. Altogether, FloorPP-Net establishes an end-to-end supervised learning framework for the Scan-to-Floor-Plan (Scan2FP) task. In the 1st International Scan-to-BIM Challenge held in conjunction with CVPR 2021, FloorPP-Net was ranked the second runner-up in the floor plan reconstruction track. Future work includes general edge proposals, 2D plan regularization, and 3D BIM reconstruction.

【14】 ReGO: Reference-Guided Outpainting for Scenery Image 标题:REGO:参考导引的风景图像外绘

作者:Yaxiong Wang,Yunchao Wei,Xueming Qian,Li Zhu,Yi Yang 机构: Xi’an Jiaotong University 备注:Image outpainting, 13 pages 链接:https://arxiv.org/abs/2106.10601 摘要:我们的目标是解决具有挑战性但实际的风景图像在这项工作的任务。近年来,生成性对抗学习通过为给定的图像生成语义一致的内容,极大地促进了图像输出。然而,现有的方法往往存在纹理模糊和生成部分伪影的问题,使得整体的输出结果缺乏真实性。为了克服这一缺点,本文研究了一种从邻域(即参考图像)中借用像素来合成纹理丰富结果的方法,称为ReGO。特别地,ReGO设计了一个自适应内容选择(ACS)模块来传输参考图像的像素,对目标图像进行纹理补偿。为了防止所生成零件的样式受到参考图像的影响,进一步提出了样式排序损失的方法来扩充ReGO,以合成样式一致的结果。在NS6K~\cite{yangx}和NS8K~\cite{wang}两个流行的基准上进行了大量的实验,很好地证明了我们的ReGO的有效性。 摘要:We aim to tackle the challenging yet practical scenery image outpainting task in this work. Recently, generative adversarial learning has significantly advanced the image outpainting by producing semantic consistent content for the given image. However, the existing methods always suffer from the blurry texture and the artifacts of the generative part, making the overall outpainting results lack authenticity. To overcome the weakness, this work investigates a principle way to synthesize texture-rich results by borrowing pixels from its neighbors (\ie, reference images), named \textbf{Re}ference-\textbf{G}uided \textbf{O}utpainting (ReGO). Particularly, the ReGO designs an Adaptive Content Selection (ACS) module to transfer the pixel of reference images for texture compensating of the target one. To prevent the style of the generated part from being affected by the reference images, a style ranking loss is further proposed to augment the ReGO to synthesize style-consistent results. Extensive experiments on two popular benchmarks, NS6K~\cite{yangzx} and NS8K~\cite{wang}, well demonstrate the effectiveness of our ReGO.

【15】 GLIB: Towards Automated Test Oracle for Graphically-Rich Applications 标题:GLib:面向图形丰富的应用程序的自动化测试Oracle

作者:Ke Chen,Yufei Li,Yingfeng Chen,Changjie Fan,Zhipeng Hu,Wei Yang 机构:Fuxi AI Lab in Netease, Hangzhou, China, The university of Texas at Dallas, Dallas, USA 备注:Accepted by ESEC/FSE 2021 链接:https://arxiv.org/abs/2106.10507 摘要:图形丰富的应用程序(如游戏)无处不在,图形用户界面(GUI)的视觉效果非常吸引人,它在软件应用程序和最终用户之间架起了一座桥梁。然而,各种类型的图形故障可能会产生这样的GUI复杂性,并已成为软件兼容性问题的主要组成部分之一。我们对网易公司(NetEase Inc.)游戏开发团队的bug报告进行的研究表明,图形故障经常发生在GUI呈现过程中,并严重降低了图形丰富的应用程序(如视频游戏)的质量。针对此类应用程序的现有自动化测试技术主要集中于生成各种GUI测试序列,并检查测试序列是否会导致崩溃。这些技术需要人类不断地注意捕获非崩溃性的bug,比如导致图形故障的bug。在本文中,我们将介绍自动化测试oracle的第一步,以便在图形丰富的应用程序中检测非崩溃错误。具体来说,我们提出了基于代码的数据增强技术的\texttt{GLIB}来检测游戏GUI故障。我们在20个真实世界的游戏应用程序(有bug报告)上对\texttt{GLIB}进行了评估,结果表明\texttt{GLIB}可以达到100%的准确率和99.5%的召回率。在另外14款真实游戏(没有bug报告)上的实际应用进一步证明了\textt{GLIB}可以有效地发现GUI的bug,目前为止\textt{GLIB}报告的53个bug中有48个已经得到确认和修复。 摘要:Graphically-rich applications such as games are ubiquitous with attractive visual effects of Graphical User Interface (GUI) that offers a bridge between software applications and end-users. However, various types of graphical glitches may arise from such GUI complexity and have become one of the main component of software compatibility issues. Our study on bug reports from game development teams in NetEase Inc. indicates that graphical glitches frequently occur during the GUI rendering and severely degrade the quality of graphically-rich applications such as video games. Existing automated testing techniques for such applications focus mainly on generating various GUI test sequences and check whether the test sequences can cause crashes. These techniques require constant human attention to captures non-crashing bugs such as bugs causing graphical glitches. In this paper, we present the first step in automating the test oracle for detecting non-crashing bugs in graphically-rich applications. Specifically, we propose \texttt{GLIB} based on a code-based data augmentation technique to detect game GUI glitches. We perform an evaluation of \texttt{GLIB} on 20 real-world game apps (with bug reports available) and the result shows that \texttt{GLIB} can achieve 100\% precision and 99.5\% recall in detecting non-crashing bugs such as game GUI glitches. Practical application of \texttt{GLIB} on another 14 real-world games (without bug reports) further demonstrates that \texttt{GLIB} can effectively uncover GUI glitches, with 48 of 53 bugs reported by \texttt{GLIB} having been confirmed and fixed so far.

【16】 Unbalanced Feature Transport for Exemplar-based Image Translation 标题:基于样本的图像翻译中的不平衡特征传输

作者:Fangneng Zhan,Yingchen Yu,Kaiwen Cui,Gongjie Zhang,Shijian Lu,Jianxiong Pan,Changgong Zhang,Feiying Ma,Xuansong Xie,Chunyan Miao 机构: Nanyang Technological University, DAMO Academy, Alibaba Group 备注:Accepted to CVPR 2021 链接:https://arxiv.org/abs/2106.10482 摘要:尽管GANs在不同条件输入(如语义分割和边缘映射)的图像翻译中取得了巨大的成功,但生成具有参考风格的高保真真实图像仍然是条件图像到图像翻译的一个巨大挑战。本文提出了一个通用的图像翻译框架,该框架结合了图像翻译中条件输入和风格样本之间的特征对齐的最佳传输。最优传输的引入显著降低了多对一特征匹配的约束,同时在条件输入和样本之间建立了精确的语义对应。我们设计了一种新的非平衡最优传输来解决在条件输入和样本之间广泛存在的具有偏差分布的特征之间的传输问题。此外,我们设计了一个语义激活规范化方案,成功地将样本的风格特征注入到图像翻译过程中。对多个图像翻译任务的大量实验表明,该方法在定性和定量上均优于现有的图像翻译方法。 摘要:Despite the great success of GANs in images translation with different conditioned inputs such as semantic segmentation and edge maps, generating high-fidelity realistic images with reference styles remains a grand challenge in conditional image-to-image translation. This paper presents a general image translation framework that incorporates optimal transport for feature alignment between conditional inputs and style exemplars in image translation. The introduction of optimal transport mitigates the constraint of many-to-one feature matching significantly while building up accurate semantic correspondences between conditional inputs and exemplars. We design a novel unbalanced optimal transport to address the transport between features with deviational distributions which exists widely between conditional inputs and exemplars. In addition, we design a semantic-activation normalization scheme that injects style features of exemplars into the image translation process successfully. Extensive experiments over multiple image translation tasks show that our method achieves superior image translation qualitatively and quantitatively as compared with the state-of-the-art.

【17】 Informative Class Activation Maps 标题:信息性类激活图

作者:Zhenyue Qin,Dongwoo Kim,Tom Gedeon 机构:Equal contribution 1School of Computing, Australian Na-tional University 2GSAI 备注:arXiv admin note: substantial text overlap with arXiv:1911.10688 链接:https://arxiv.org/abs/2106.10472 摘要:我们研究如何评估一个特定标签图像中区域的定量信息含量。为此,我们将类激活映射与信息论联系起来。我们开发了一个信息类激活图(infoCAM)。在给定的分类任务中,infoCAM描述了如何将局部区域的信息累积到整个图像的信息中,从而得到一个标签。因此,我们可以利用infoCAM定位标签的最具信息性的特征。当应用于图像分类任务时,infoCAM在弱监督目标定位任务中的性能优于传统的分类地图。我们在微型ImageNet上实现了最先进的结果。 摘要:We study how to evaluate the quantitative information content of a region within an image for a particular label. To this end, we bridge class activation maps with information theory. We develop an informative class activation map (infoCAM). Given a classification task, infoCAM depict how to accumulate information of partial regions to that of the entire image toward a label. Thus, we can utilise infoCAM to locate the most informative features for a label. When applied to an image classification task, infoCAM performs better than the traditional classification map in the weakly supervised object localisation task. We achieve state-of-the-art results on Tiny-ImageNet.

【18】 The Animal ID Problem: Continual Curation 标题:动物身份问题:可持续发展

作者:Charles V. Stewart,Jason R. Parham,Jason Holmberg,Tanya Y. Berger-Wolf 机构: StewartRensselaer Polytechnic Institutestewart, Berger-WolfOhio State Universityberger-wolf 备注:4 pages, 2 figures, non-archival in 2021 CVPR workshop 链接:https://arxiv.org/abs/2106.10377 摘要:希望能激发从图像中识别个体动物的新研究,我们建议将此问题表述为图像和动物身份的人机连续处理。这是一个开放世界的识别问题,在这个问题上,大多数新的动物在算法最初训练和部署后进入系统。如本文所定义的,持续的策展需要(1)改进当前识别方法的有效性,(2)允许无决策可能性的成对验证算法,以及(3)寻求人类输入以指导策展过程的算法决策机制。错误度量必须评估识别算法的能力,不仅识别只见过一两次的动物,而且识别数据库中没有的新动物。衡量整个系统性能的一个重要指标是精度,它是所需人力投入量的函数。 摘要:Hoping to stimulate new research in individual animal identification from images, we propose to formulate the problem as the human-machine Continual Curation of images and animal identities. This is an open world recognition problem, where most new animals enter the system after its algorithms are initially trained and deployed. Continual Curation, as defined here, requires (1) an improvement in the effectiveness of current recognition methods, (2) a pairwise verification algorithm that allows the possibility of no decision, and (3) an algorithmic decision mechanism that seeks human input to guide the curation process. Error metrics must evaluate the ability of recognition algorithms to identify not only animals that have been seen just once or twice but also recognize new animals not in the database. An important measure of overall system performance is accuracy as a function of the amount of human input required.

【19】 Reversible Colour Density Compression of Images using cGANs 标题:利用cGANs实现图像的可逆色密度压缩

作者:Arun Jose,Abraham Francis 备注:7 pages, 2 figures 链接:https://arxiv.org/abs/2106.10542 摘要:使用颜色密度的图像压缩在历史上是不可行的无损解压缩。我们通过学习图像之间的映射和要训练的损失函数,检验了使用条件生成对抗网络使这种转换更可行。我们表明,这种方法是有效的产生视觉无损代,表明有效的彩色压缩是可行的。 摘要:Image compression using colour densities is historically impractical to decompress losslessly. We examine the use of conditional generative adversarial networks in making this transformation more feasible, through learning a mapping between the images and a loss function to train on. We show that this method is effective at producing visually lossless generations, indicating that efficient colour compression is viable.

【20】 Direct Reconstruction of Linear Parametric Images from Dynamic PET Using Nonlocal Deep Image Prior 标题:利用非局部深度图像先验直接从动态PET重建线性参数图像

作者:Kuang Gong,Ciprian Catana,Jinyi Qi,Quanzheng Li 机构: Mas-sachusetts General Hospital and Harvard Medical School 备注:10 pages, 10 figures 链接:https://arxiv.org/abs/2106.10359 摘要:将PET成像模型与示踪剂动力学相结合,提出了直接从PET正弦图中估计参数图像的直接重建方法。由于接收计数有限,直接重建框架产生的参数图像的信噪比和分辨率仍然有限。近年来,有监督的深度学习方法已经成功地应用于医学图像去噪/重建中。对于静态PET成像,通过延长扫描时间可以获得高质量的训练标签。然而,这对于动态PET成像是不可行的,因为扫描时间已经足够长了。在这项工作中,我们提出了一个无监督的深度学习框架,直接从动态PET参数重建,并在Patlak模型和相对平衡Logan模型上进行了测试。患者的解剖先验图像(可从PET/CT或PET/MR扫描中获得)作为网络输入提供流形约束,还用于构造核层以执行非局部特征去噪。线性动力学模型作为1x1卷积层嵌入到网络结构中。训练目标函数基于PET统计模型。基于18F-FDG和11C-PiB示踪剂动态数据集的评估表明,该框架优于传统的基于核方法的直接重建方法。 摘要:Direct reconstruction methods have been developed to estimate parametric images directly from the measured PET sinograms by combining the PET imaging model and tracer kinetics in an integrated framework. Due to limited counts received, signal-to-noise-ratio (SNR) and resolution of parametric images produced by direct reconstruction frameworks are still limited. Recently supervised deep learning methods have been successfully applied to medical imaging denoising/reconstruction when large number of high-quality training labels are available. For static PET imaging, high-quality training labels can be acquired by extending the scanning time. However, this is not feasible for dynamic PET imaging, where the scanning time is already long enough. In this work, we proposed an unsupervised deep learning framework for direct parametric reconstruction from dynamic PET, which was tested on the Patlak model and the relative equilibrium Logan model. The patient's anatomical prior image, which is readily available from PET/CT or PET/MR scans, was supplied as the network input to provide a manifold constraint, and also utilized to construct a kernel layer to perform non-local feature denoising. The linear kinetic model was embedded in the network structure as a 1x1 convolution layer. The training objective function was based on the PET statistical model. Evaluations based on dynamic datasets of 18F-FDG and 11C-PiB tracers show that the proposed framework can outperform the traditional and the kernel method-based direct reconstruction methods.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-22,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档