计算机视觉学术速递[7.28]

公众号-arXiv每日学术速递

发布于 2021-07-29 14:29:34

2K0

发布于 2021-07-29 14:29:34

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

cs.CV 方向，今日共计65篇

Transformer(2篇)

【1】 Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers 标题：基于域自适应检测Transformer的序列特征比对研究

作者：Wen Wang,Yang Cao,Jing Zhang,Fengxiang He,Zheng-Jun Zha,Yonggang Wen,Dacheng Tao 机构： University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 备注：Accepted by ACM MM2021. Source code is available at: this https URL 链接：https://arxiv.org/abs/2107.12636 摘要：近年来，检测Transformer在目标检测方面显示出了良好的效果，并引起了人们越来越多的关注。然而，如何开发有效的域自适应技术来提高其跨域性能仍然是一个未知数。在本文中，我们深入研究了这个主题，并通过实证发现，CNN主干上的直接特征分布对齐只能带来有限的改进，因为它不能保证Transformer中用于预测的域不变序列特征。为了解决这个问题，我们提出了一种新的序列特征对齐（SFA）方法，该方法是专为检测Transformer的自适应设计的。从技术上讲，SFA由一个基于域查询的特征对齐（DQFA）模块和一个基于令牌的特征对齐（TDA）模块组成。在DQFA中，一种新的域查询被用来从两个域的令牌序列聚合和对齐全局上下文。当DQFA分别部署在transformer编码器和解码器中时，它减少了全局特征表示和对象关系中的域差异。同时，TDA在两个域的序列中对齐令牌特征，这分别减少了transformer编码器和解码器中本地和实例级特征表示的域间隙。此外，本文还提出了一种新的二部匹配一致性损失算法，以增强鲁棒目标检测的特征鉴别能力。在三个具有挑战性的基准上进行的实验表明，SFA的性能优于现有的领域自适应目标检测方法。代码已在以下位置提供：https://github.com/encounter1997/SFA. 摘要：Detection transformers have recently shown promising object detection results and attracted increasing attention. However, how to develop effective domain adaptation techniques to improve its cross-domain performance remains unexplored and unclear. In this paper, we delve into this topic and empirically find that direct feature distribution alignment on the CNN backbone only brings limited improvements, as it does not guarantee domain-invariant sequence features in the transformer for prediction. To address this issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially designed for the adaptation of detection transformers. Technically, SFA consists of a domain query-based feature alignment (DQFA) module and a token-wise feature alignment (TDA) module. In DQFA, a novel domain query is used to aggregate and align global context from the token sequence of both domains. DQFA reduces the domain discrepancy in global feature representations and object relations when deploying in the transformer encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence from both domains, which reduces the domain gaps in local and instance-level feature representations in the transformer encoder and decoder, respectively. Besides, a novel bipartite matching consistency loss is proposed to enhance the feature discriminability for robust object detection. Experiments on three challenging benchmarks show that SFA outperforms state-of-the-art domain adaptive object detection methods. Code has been made available at: https://github.com/encounter1997/SFA.

【2】 PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution 标题：PiSLTRc：带内容感知卷积的位置知晓手语转换器

作者：Pan Xie,Mengyi Zhao,Xiaohui Hu 机构： directly learning thePan Xie and Mengyi Zhao are with the School of Automation Science andElectrical Engineering, Beihang University 链接：https://arxiv.org/abs/2107.12600 摘要：由于Transformer在学习长期依赖方面的优势，手语Transformer模型在手语识别和翻译方面取得了显著的进展。然而，Transformer有几个问题阻碍了它更好地理解手语。第一个问题是，自我注意机制以帧方式学习符号视频表示，忽略了符号手势的时间语义结构。其次，绝对位置编码的注意机制是不知道方向和距离的，从而限制了它的能力。为了解决这些问题，我们提出了一种新的模型体系结构，即PiSLTRc，它具有两个显著的特征：（i）内容感知和位置感知卷积层。具体来说，我们使用一种新的内容感知邻域收集方法来显式地选择相关特征。然后，我们聚集这些特征与位置通知时间卷积层，从而产生强大的邻域增强符号表示(ii）将相对位置信息注入编码器、解码器甚至编码器-解码器交叉注意中的注意机制。与vanilla Transformer模型相比，我们的模型在三个大型手语基准测试上的表现始终更好：PHOENIX-2014、PHOENIX-2014-T和CSL。此外，大量的实验表明，该方法在翻译质量上达到了最先进的水平，并有1.6美元的BLEU改进。摘要：Since the superiority of Transformer in learning long-term dependency, the sign language Transformer model achieves remarkable progress in Sign Language Recognition (SLR) and Translation (SLT). However, there are several issues with the Transformer that prevent it from better sign language understanding. The first issue is that the self-attention mechanism learns sign video representation in a frame-wise manner, neglecting the temporal semantic structure of sign gestures. Secondly, the attention mechanism with absolute position encoding is direction and distance unaware, thus limiting its ability. To address these issues, we propose a new model architecture, namely PiSLTRc, with two distinctive characteristics: (i) content-aware and position-aware convolution layers. Specifically, we explicitly select relevant features using a novel content-aware neighborhood gathering method. Then we aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation. (ii) injecting the relative position information to the attention mechanism in the encoder, decoder, and even encoder-decoder cross attention. Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks: PHOENIX-2014, PHOENIX-2014-T and CSL. Furthermore, extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on translation quality with $+1.6$ BLEU improvements.

检测相关(11篇)

【1】 Real-time Keypoints Detection for Autonomous Recovery of the Unmanned Ground Vehicle 标题：无人地面车辆自主回收的实时关键点检测

作者：Jie Li,Sheng Zhang,Kai Han,Xia Yuan,Chunxia Zhao,Yu Liu 备注：IET Image Processing, code: this https URL 链接：https://arxiv.org/abs/2107.12852 摘要：小型无人地面车辆（UGV）和大型无人航母车辆的结合，使得在实际应用中，如危险场景中的救援，具有更大的灵活性。用于引导小型无人值守地面车辆返回运载工具的自主回收系统是实现两辆车无缝结合的重要组成部分。该文提出了一种新颖的自主恢复框架，利用一种低成本的单目视觉系统对UGV进行精确定位和姿态估计。首先，我们引入一种称为UGV-KPNet的轻量级卷积神经网络，从单目相机拍摄的图像中检测小型UGV的关键点。UGV-KPNet具有计算效率高、参数少、实时提供像素级的精确关键点检测结果等优点。然后，利用检测到的关键点估计UGV的六自由度姿态，获得UGV的定位和姿态信息。此外，我们是第一个建立一个大规模的真实世界的UGV关键点数据集。实验结果表明，该系统在UGV关键点检测的精度和速度上都达到了最先进的水平，可以进一步提高UGV六自由度姿态估计的精度。摘要：The combination of a small unmanned ground vehicle (UGV) and a large unmanned carrier vehicle allows more flexibility in real applications such as rescue in dangerous scenarios. The autonomous recovery system, which is used to guide the small UGV back to the carrier vehicle, is an essential component to achieve a seamless combination of the two vehicles. This paper proposes a novel autonomous recovery framework with a low-cost monocular vision system to provide accurate positioning and attitude estimation of the UGV during navigation. First, we introduce a light-weight convolutional neural network called UGV-KPNet to detect the keypoints of the small UGV from the images captured by a monocular camera. UGV-KPNet is computationally efficient with a small number of parameters and provides pixel-level accurate keypoints detection results in real-time. Then, six degrees of freedom pose is estimated using the detected keypoints to obtain positioning and attitude information of the UGV. Besides, we are the first to create a large-scale real-world keypoints dataset of the UGV. The experimental results demonstrate that the proposed system achieves state-of-the-art performance in terms of both accuracy and speed on UGV keypoint detection, and can further boost the 6-DoF pose estimation for the UGV.

【2】 Clickbait Detection in YouTube Videos 标题：YouTube视频中的点击诱饵检测

作者：Ruchira Gothankar,Fabio Di Troia,Mark Stamp 链接：https://arxiv.org/abs/2107.12791 摘要：YouTube视频通常包括迷人的描述和有趣的缩略图，旨在增加浏览量，从而增加发布视频的人的收入。这就鼓励人们发布clickbait视频，其中的内容可能与标题、描述或缩略图有很大的不同。实际上，用户被诱骗点击点击诱饵视频。在这项研究中，我们认为检测点击点击YouTube视频的挑战性问题。我们用多种最先进的机器学习技术进行实验，使用各种文本特征。摘要：YouTube videos often include captivating descriptions and intriguing thumbnails designed to increase the number of views, and thereby increase the revenue for the person who posted the video. This creates an incentive for people to post clickbait videos, in which the content might deviate significantly from the title, description, or thumbnail. In effect, users are tricked into clicking on clickbait videos. In this research, we consider the challenging problem of detecting clickbait YouTube videos. We experiment with multiple state-of-the-art machine learning techniques using a variety of textual features.

【3】 Discriminative-Generative Representation Learning for One-Class Anomaly Detection 标题：一类异常检测的判别-生成表示学习

作者：Xuan Xia,Xizhou Pan,Xing He,Jingfei Zhang,Ning Ding,Lin Ma 机构：Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen , P. R. China., Institute of Robotics and Intelligent Manufacturing, Chinese University of Hong Kong, Shenzhen, P. R. China., ∗Corresponding author. 备注：e.g.:13 pages, 6 figures 链接：https://arxiv.org/abs/2107.12753 摘要：生成对抗网作为一种生成性自监督学习方法，在异常检测领域得到了广泛的研究。然而，由于生成器过于关注像素级的细节，其表示学习能力受到限制，而且生成器很难像识别器那样有效地从标签预测借口任务中学习抽象的语义表示。为了提高生成器的表示学习能力，提出了一种结合生成方法和判别方法的自监督学习框架。生成器不再通过重建误差来学习表示，而是通过鉴别器的引导，可以从为鉴别器设计的借口任务中获益。我们的判别生成表征学习方法的性能接近于判别方法，并且在速度上有很大的优势。我们在一类异常检测任务中使用的方法在多个基准数据集上的性能显著优于现有的几种方法，在CIFAR-10和MVTAD上，性能最好的GAN基线的性能分别提高了6%和2%。摘要：As a kind of generative self-supervised learning methods, generative adversarial nets have been widely studied in the field of anomaly detection. However, the representation learning ability of the generator is limited since it pays too much attention to pixel-level details, and generator is difficult to learn abstract semantic representations from label prediction pretext tasks as effective as discriminator. In order to improve the representation learning ability of generator, we propose a self-supervised learning framework combining generative methods and discriminative methods. The generator no longer learns representation by reconstruction error, but the guidance of discriminator, and could benefit from pretext tasks designed for discriminative methods. Our discriminative-generative representation learning method has performance close to discriminative methods and has a great advantage in speed. Our method used in one-class anomaly detection task significantly outperforms several state-of-the-arts on multiple benchmark data sets, increases the performance of the top-performing GAN-based baseline by 6% on CIFAR-10 and 2% on MVTAD.

【4】 DV-Det: Efficient 3D Point Cloud Object Detection with Dynamic Voxelization 标题：DV-DET：基于动态体素化的高效三维点云目标检测

作者：Zhaoyu Su,Pin Siang Tan,Yu-Hsing Wang 机构：DESR Laboratory, Department of Civil and Environmental Engineering, Hong Kong University of Science and Technology 链接：https://arxiv.org/abs/2107.12707 摘要：在这项工作中，我们提出了一个新的两阶段的框架，有效的三维点云目标检测。我们没有将点云转换为二维鸟瞰投影，而是直接在三维空间中解析原始点云数据，同时获得了令人印象深刻的效率和精度。为了实现这一目标，我们提出了动态体素化，一种在局部尺度上对点进行动态体素化的方法。通过这样做，我们保留了具有三维体素的点云几何体，因此免除了对昂贵的mlp从点坐标学习的依赖。另一方面，我们本质上仍然遵循与逐点方法（例如，点网）相同的处理模式，并且不再像传统卷积那样受到量化问题的影响。为了进一步优化速度，我们提出了基于网格的降采样和体素化方法，并提供了不同的CUDA实现以适应训练和推理阶段的不同需求。我们在KITTI三维目标检测数据集（75fps）和Waymo开放数据集（25fps）上进行了实验，取得了令人满意的结果。摘要：In this work, we propose a novel two-stage framework for the efficient 3D point cloud object detection. Instead of transforming point clouds into 2D bird eye view projections, we parse the raw point cloud data directly in the 3D space yet achieve impressive efficiency and accuracy. To achieve this goal, we propose dynamic voxelization, a method that voxellizes points at local scale on-the-fly. By doing so, we preserve the point cloud geometry with 3D voxels, and therefore waive the dependence on expensive MLPs to learn from point coordinates. On the other hand, we inherently still follow the same processing pattern as point-wise methods (e.g., PointNet) and no longer suffer from the quantization issue like conventional convolutions. For further speed optimization, we propose the grid-based downsampling and voxelization method, and provide different CUDA implementations to accommodate to the discrepant requirements during training and inference phases. We highlight our efficiency on KITTI 3D object detection dataset with 75 FPS and on Waymo Open dataset with 25 FPS inference speed with satisfactory accuracy.

【5】 Dynamic and Static Object Detection Considering Fusion Regions and Point-wise Features 标题：融合区域和逐点特征的动态静电目标检测

作者：Andrés Gómez,Thomas Genevois,Jerome Lussereau,Christian Laugier 备注：6 pages, 7 figures 链接：https://arxiv.org/abs/2107.12692 摘要：目标检测是自主车辆与道路使用者安全交互的关键问题。深度学习方法使目标检测方法的发展具有更好的性能。然而，从实时检测的目标中获取更多的特征仍然是一个挑战。主要原因是来自环境对象的更多信息可以提高自主车辆面对不同城市情况的能力。提出了一种检测自主车辆前方静态和动态目标的新方法。我们的方法还可以从检测到的物体中获得其他特征，比如它们的位置、速度和航向。我们开发了我们的提议融合了环境对YoloV3的解释结果和贝叶斯过滤器。为了证明我们的方案的性能，我们通过一个基准数据集和从自主平台获得的真实数据来评估它。我们将结果与另一种方法进行了比较。摘要：Object detection is a critical problem for the safe interaction between autonomous vehicles and road users. Deep-learning methodologies allowed the development of object detection approaches with better performance. However, there is still the challenge to obtain more characteristics from the objects detected in real-time. The main reason is that more information from the environment's objects can improve the autonomous vehicle capacity to face different urban situations. This paper proposes a new approach to detect static and dynamic objects in front of an autonomous vehicle. Our approach can also get other characteristics from the objects detected, like their position, velocity, and heading. We develop our proposal fusing results of the environment's interpretations achieved of YoloV3 and a Bayesian filter. To demonstrate our proposal's performance, we asses it through a benchmark dataset and real-world data obtained from an autonomous platform. We compared the results achieved with another approach.

【6】 Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection 标题：用于任意形状文本检测的自适应边界建议网络

作者：Shi-Xue Zhang,Xiaobin Zhu,Chun Yang,Hongfa Wang,Xu-Cheng Yin 机构：School of Computer and Communication Engineering, University of Science and Technology Beijing, USTB-EEasyTech Joint Lab of Artificial Intelligence, Tencent Technology (Shenzhen) Co., Ltd 备注：None 链接：https://arxiv.org/abs/2107.12664 摘要：由于场景文本的复杂性和多样性，任意形状文本的检测是一项具有挑战性的任务。在这项工作中，我们提出了一个新的自适应边界建议网络来检测任意形状的文本，它可以学习直接产生准确的边界任意形状的文本没有任何后处理。该方法主要由边界建议模型和自适应边界变形模型组成。采用多层扩张卷积构造的边界建议模型产生先验信息（包括分类图、距离场和方向场）和粗边界建议。自适应边界变形模型是一个编解码网络，其中编码器主要由一个图卷积网络（GCN）和一个递归神经网络（RNN）组成。该方法利用边界建议模型中的先验信息，通过迭代的方式进行边界变形，得到文本实例的形状，从而直接有效地生成精确的文本边界，无需复杂的后处理。在公开数据集上的大量实验证明了我们方法的最新性能。摘要：Arbitrary shape text detection is a challenging task due to the high complexity and variety of scene texts. In this work, we propose a novel adaptive boundary proposal network for arbitrary shape text detection, which can learn to directly produce accurate boundary for arbitrary shape text without any post-processing. Our method mainly consists of a boundary proposal model and an innovative adaptive boundary deformation model. The boundary proposal model constructed by multi-layer dilated convolutions is adopted to produce prior information (including classification map, distance field, and direction field) and coarse boundary proposals. The adaptive boundary deformation model is an encoder-decoder network, in which the encoder mainly consists of a Graph Convolutional Network (GCN) and a Recurrent Neural Network (RNN). It aims to perform boundary deformation in an iterative way for obtaining text instance shape guided by prior information from the boundary proposal model.In this way, our method can directly and efficiently generate accurate text boundaries without complex post-processing. Extensive experiments on publicly available datasets demonstrate the state-of-the-art performance of our method.

【7】 Unsupervised Outlier Detection using Memory and Contrastive Learning 标题：基于记忆和对比学习的无监督孤立点检测

作者：Ning Huyan,Dou Quan,Xiangrong Zhang,Xuefeng Liang,Jocelyn Chanussot,Licheng Jiao 机构：School of Artificial Intelligence, Xidian University, Shaanxi, China, Research center of Inria Grenoble-Rhone-Alpes, France 链接：https://arxiv.org/abs/2107.12642 摘要：离群点检测是机器学习中创建良好、可靠数据的最重要过程之一。大多数离群点检测方法通过假设离群点比正常样本（inliers）更难恢复来利用辅助重建任务。然而，这并不总是正确的，尤其是对于基于自动编码器（AE）的模型。即使训练数据中没有异常值，也可以恢复某些异常值，因为它们不限制特征学习。相反，我们认为离群点检测可以在特征空间中通过测量离群点和内联线之间的特征距离来完成。然后，我们提出了一个使用记忆模块和对比学习模块的框架MCOD。内存模块约束特征的一致性，这些特征代表正常数据。对比学习模块学习到了更多的区分特征，从而增强了离群值和内联值之间的区别。在四个基准数据集上的大量实验表明，我们提出的MCOD方法取得了相当好的性能，并优于九种最先进的方法。摘要：Outlier detection is one of the most important processes taken to create good, reliable data in machine learning. The most methods of outlier detection leverage an auxiliary reconstruction task by assuming that outliers are more difficult to be recovered than normal samples (inliers). However, it is not always true, especially for auto-encoder (AE) based models. They may recover certain outliers even outliers are not in the training data, because they do not constrain the feature learning. Instead, we think outlier detection can be done in the feature space by measuring the feature distance between outliers and inliers. We then propose a framework, MCOD, using a memory module and a contrastive learning module. The memory module constrains the consistency of features, which represent the normal data. The contrastive learning module learns more discriminating features, which boosts the distinction between outliers and inliers. Extensive experiments on four benchmark datasets show that our proposed MCOD achieves a considerable performance and outperforms nine state-of-the-art methods.

【8】 CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows 标题：CFLOW-AD：基于条件归一化流定位的实时无监督异常检测

作者：Denis Gudovskiy,Shun Ishizaka,Kazuki Kozuka 机构：Panasonic AI Lab, USA, Panasonic Technology Division, Japan 备注：Accepted to WACV 2022. Preprint 链接：https://arxiv.org/abs/2107.12571 摘要：在标记不可行的情况下，以及在异常样本在列车数据中完全缺失的情况下，基于定位的无监督异常检测具有许多实际应用。虽然最近提出的用于此类数据设置的模型实现了高精度度量，但其复杂性是实时处理的限制因素。在本文中，我们提出了一个实时模型，并分析推导了它与先验方法的关系。我们的CFLOW-AD模型是基于一个用于定位异常检测的条件规范化流框架。特别地，CFLOW-AD由一个有区别的预训练编码器和一个多尺度生成解码器组成，后者显式地估计编码特征的可能性。我们的方法产生了一个计算效率和内存效率都很高的模型：CFLOW-AD在相同的输入设置下比现有的最新技术快10倍，而且更小。我们在MVTec数据集上的实验表明，CFLOW-AD在检测任务上比以前的方法有0.36%的AUROC，在定位任务上比以前的方法有1.12%的AUROC和2.5%的AUPRO。我们用完全可复制的实验来开放代码。摘要：Unsupervised anomaly detection with localization has many practical applications when labeling is infeasible and, moreover, when anomaly examples are completely missing in the train data. While recently proposed models for such data setup achieve high accuracy metrics, their complexity is a limiting factor for real-time processing. In this paper, we propose a real-time model and analytically derive its relationship to prior methods. Our CFLOW-AD model is based on a conditional normalizing flow framework adopted for anomaly detection with localization. In particular, CFLOW-AD consists of a discriminatively pretrained encoder followed by a multi-scale generative decoders where the latter explicitly estimate likelihood of the encoded features. Our approach results in a computationally and memory-efficient model: CFLOW-AD is faster and smaller by a factor of 10x than prior state-of-the-art with the same input setting. Our experiments on the MVTec dataset show that CFLOW-AD outperforms previous methods by 0.36% AUROC in detection task, by 1.12% AUROC and 2.5% AUPRO in localization task, respectively. We open-source our code with fully reproducible experiments.

【9】 Parallel Detection for Efficient Video Analytics at the Edge 标题：一种高效边缘视频分析的并行检测方法

作者：Yanzhao Wu,Ling Liu,Ramana Kompella 机构：School of Computer Science, Georgia Institute of Technology, Atlanta, Georgia, Cisco Systems, Inc., West Tasman Dr., San Jose, California 链接：https://arxiv.org/abs/2107.12563 摘要：深度神经网络（DNN）训练的目标检测器广泛应用于许多关键任务系统中，用于边缘实时视频分析，如自动驾驶和视频监控。这些关键任务边缘服务中的一个常见性能要求是边缘设备上在线对象检测的近实时延迟。然而，即使使用训练有素的DNN目标检测器，边缘的在线检测质量也可能会因为许多原因而恶化，例如在异构边缘设备上运行DNN目标检测模型的能力有限，当检测处理速率明显慢于输入视频帧速率时，由于随机帧丢失而导致检测质量下降。本文针对这些问题，利用多模型多设备检测并行性，在具有异构边缘设备的边缘系统中实现快速目标检测。首先，我们分析了在边缘运行一个训练有素的DNN模型进行实时在线目标检测的性能瓶颈。我们以离线检测为参考模型，通过分析视频流的传入速率、目标检测的视频处理速率和视频流实时检测可视化的输出速率之间的不匹配来分析根本原因。其次，我们研究了利用多模型检测并行性进行性能优化。结果表明，该模型并行检测方法可以有效地提高FPS检测的处理速度，最大限度地减小不同边缘设备上FPS与输入视频帧速率的差异。我们使用SSD300和YOLOv3对不同视频流速率的基准视频进行了评估。结果表明，利用多模型检测并行性可以加快在线目标检测处理速度，并提供接近实时的目标检测性能，从而实现边缘的高效视频分析。摘要：Deep Neural Network (DNN) trained object detectors are widely deployed in many mission-critical systems for real time video analytics at the edge, such as autonomous driving and video surveillance. A common performance requirement in these mission-critical edge services is the near real-time latency of online object detection on edge devices. However, even with well-trained DNN object detectors, the online detection quality at edge may deteriorate for a number of reasons, such as limited capacity to run DNN object detection models on heterogeneous edge devices, and detection quality degradation due to random frame dropping when the detection processing rate is significantly slower than the incoming video frame rate. This paper addresses these problems by exploiting multi-model multi-device detection parallelism for fast object detection in edge systems with heterogeneous edge devices. First, we analyze the performance bottleneck of running a well-trained DNN model at edge for real time online object detection. We use the offline detection as a reference model, and examine the root cause by analyzing the mismatch among the incoming video streaming rate, video processing rate for object detection, and output rate for real time detection visualization of video streaming. Second, we study performance optimizations by exploiting multi-model detection parallelism. We show that the model-parallel detection approach can effectively speed up the FPS detection processing rate, minimizing the FPS disparity with the incoming video frame rate on heterogeneous edge devices. We evaluate the proposed approach using SSD300 and YOLOv3 on benchmark videos of different video stream rates. The results show that exploiting multi-model detection parallelism can speed up the online object detection processing rate and deliver near real-time object detection performance for efficient video analytics at edge.

【10】 Perception-and-Regulation Network for Salient Object Detection 标题：感知调节网络在醒目目标检测中的应用

作者：Jinchao Zhu,Xiaoyu Zhang,Xian Fang,Junnan Liu 机构： NankaiUniversity, Nankai University, Harbin Engineering University 链接：https://arxiv.org/abs/2107.12560 摘要：有效融合不同类型的特征是显著目标检测的关键。现有的网络结构设计大多基于学者的主观经验，而特征融合的过程不考虑融合特征和最高层次特征之间的关系。在本文中，我们聚焦于特征之间的关系，并提出了一种新的全局注意单元，我们称之为“感知与调节”（PR）块，它通过显式地建模特征之间的相互依赖关系来自适应地调节特征融合过程。感知部分利用分类网络中完全连通层的结构来学习物体的大小和形状。规则部分有选择地加强和削弱要融合的特征。为了提高网络的全局感知能力，进一步采用了仿眼观察模块（IEO）。通过对中心凹视觉和周边视觉的模仿，IEO能够仔细观察高度精细的物体，并组织广阔的空间场景，更好地分割物体。在SOD数据集上进行的大量实验表明，该方法优于22种最新的方法。摘要：Effective fusion of different types of features is the key to salient object detection. The majority of existing network structure design is based on the subjective experience of scholars and the process of feature fusion does not consider the relationship between the fused features and highest-level features. In this paper, we focus on the feature relationship and propose a novel global attention unit, which we term the "perception- and-regulation" (PR) block, that adaptively regulates the feature fusion process by explicitly modeling interdependencies between features. The perception part uses the structure of fully-connected layers in classification networks to learn the size and shape of objects. The regulation part selectively strengthens and weakens the features to be fused. An imitating eye observation module (IEO) is further employed for improving the global perception ability of the network. The imitation of foveal vision and peripheral vision enables IEO to scrutinize highly detailed objects and to organize the broad spatial scene to better segment objects. Sufficient experiments conducted on SOD datasets demonstrate that the proposed method performs favorably against 22 state-of-the-art methods.

【11】 Optimizing Operating Points for High Performance Lesion Detection and Segmentation Using Lesion Size Reweighting 标题：基于病灶大小加权的高性能病变检测与分割操作点优化

作者：Brennan Nichyporuk,Justin Szeto,Douglas L. Arnold,Tal Arbel 机构： Centre for Intelligent Machines, McGill University, Montreal, Canada, MILA (Quebec Artificial Intelligence Institute), McGill University, Montreal, Canada, Montreal Neurological Institute, McGill University, Montreal, Canada 备注：Accepted at MIDL 2021 链接：https://arxiv.org/abs/2107.12978 摘要：有许多临床情况需要准确检测和分割患者图像中的所有局灶性病变（如病变、肿瘤）。在小病灶和大病灶混合的情况下，标准的二进制交叉熵损失将导致以丢失小病灶为代价更好地分割大病灶。调整操作点以准确检测所有病变通常会导致大病变的过度分割。在这项工作中，我们提出了一种新的重新加权策略，以消除这一性能差距，提高小病理检测性能，同时保持分割精度。我们在多发性硬化症患者图像的大规模、多扫描仪、多中心数据集上的实验表明，我们的重新加权策略大大优于竞争策略。摘要：There are many clinical contexts which require accurate detection and segmentation of all focal pathologies (e.g. lesions, tumours) in patient images. In cases where there are a mix of small and large lesions, standard binary cross entropy loss will result in better segmentation of large lesions at the expense of missing small ones. Adjusting the operating point to accurately detect all lesions generally leads to oversegmentation of large lesions. In this work, we propose a novel reweighing strategy to eliminate this performance gap, increasing small pathology detection performance while maintaining segmentation accuracy. We show that our reweighing strategy vastly outperforms competing strategies based on experiments on a large scale, multi-scanner, multi-center dataset of Multiple Sclerosis patient images.

分类|识别相关(5篇)

【1】 Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign Language Recognition 标题：用于连续手语识别的多尺度局域-时态相似度融合

作者：Pan Xie,Zhi Cui,Yao Du,Mengyi Zhao,Jianwei Cui,Bin Wang,Xiaohui Hu 机构：Huc, Beihang University, Beijing, China, Xiaomi AI Lab, Beijing, China, Institute of Software, Chinese Academy of Sciences, Beijing, China 链接：https://arxiv.org/abs/2107.12762 摘要：连续手语识别（cSLR）是一项重要的公共任务，它将手语视频转录成有序的序列。捕获精细的光泽级别细节非常重要，因为标志视频帧和相应的光泽之间没有明确的对齐。在过去的工作中，一种很有前途的方法是采用一维卷积网络（1D-CNN）对序列帧进行时间融合。然而，cnn对相似性或相异性不可知，因此无法在时间相邻的帧内捕获局部一致的语义。为了解决这个问题，我们提出通过时间相似性自适应地融合局部特征。具体来说，我们设计了一个多尺度局部时间相似性融合网络（mLTSF-Net），具体如下：1）针对一个特定的视频帧，我们首先选择具有多尺度接收区域的相似邻域来适应不同长度的gloss。2）为了保证时间上的一致性，我们使用位置感知卷积对选定帧的每个尺度进行时间卷积。3）为了获得局部时间增强的帧表示，我们最后使用内容相关的聚合器融合不同尺度的结果。我们以端到端的方式训练我们的模型，在RWTH PHOENIX Weather 2014数据集（RWTH）上的实验结果表明，与一些最先进的模型相比，我们的模型取得了具有竞争力的性能。摘要：Continuous sign language recognition (cSLR) is a public significant task that transcribes a sign language video into an ordered gloss sequence. It is important to capture the fine-grained gloss-level details, since there is no explicit alignment between sign video frames and the corresponding glosses. Among the past works, one promising way is to adopt a one-dimensional convolutional network (1D-CNN) to temporally fuse the sequential frames. However, CNNs are agnostic to similarity or dissimilarity, and thus are unable to capture local consistent semantics within temporally neighboring frames. To address the issue, we propose to adaptively fuse local features via temporal similarity for this task. Specifically, we devise a Multi-scale Local-Temporal Similarity Fusion Network (mLTSF-Net) as follows: 1) In terms of a specific video frame, we firstly select its similar neighbours with multi-scale receptive regions to accommodate different lengths of glosses. 2) To ensure temporal consistency, we then use position-aware convolution to temporally convolve each scale of selected frames. 3) To obtain a local-temporally enhanced frame-wise representation, we finally fuse the results of different scales using a content-dependent aggregator. We train our model in an end-to-end fashion, and the experimental results on RWTH-PHOENIX-Weather 2014 datasets (RWTH) demonstrate that our model achieves competitive performance compared with several state-of-the-art models.

【2】 Real-Time Activity Recognition and Intention Recognition Using a Vision-based Embedded System 标题：基于视觉的嵌入式系统实时行为识别和意图识别

作者：Sahar Darafsh,Saeed Shiry Ghidary,Morteza Saheb Zamani 机构：Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran, Mathematics and Computer Science Department 链接：https://arxiv.org/abs/2107.12744 摘要：随着数字技术的飞速发展，智能环境中的人类活动识别和意图识别成为研究的热点。在这项研究中，我们引入了一种实时的活动识别来识别人们通过或不通过门的意图。该系统应用于电梯和自动门中，可节约能源，提高效率。本研究利用数位影像处理原理，将空间与时间特徵结合起来进行资料准备。然而，与以往的研究不同，本文只使用了一个AlexNet神经网络来代替两个流卷积神经网络。我们的嵌入式系统在我们的意图识别数据集上实现了98.78%的准确率。我们还对其他数据集（包括HMDB-51、KTH和Weizmann）检验了我们的数据表示方法，获得的准确率分别为78.48%、97.95%和100%。利用Xilinx仿真器对ZCU102板的图像识别和神经网络模型进行了仿真和实现。该嵌入式系统的工作频率为333mhz，实时工作速度为每秒120帧。摘要：With the rapid increase in digital technologies, most fields of study include recognition of human activity and intention recognition, which are important in smart environments. In this research, we introduce a real-time activity recognition to recognize people's intentions to pass or not pass a door. This system, if applied in elevators and automatic doors will save energy and increase efficiency. For this study, data preparation is applied to combine the spatial and temporal features with the help of digital image processing principles. Nevertheless, unlike previous studies, only one AlexNet neural network is used instead of two-stream convolutional neural networks. Our embedded system was implemented with an accuracy of 98.78% on our Intention Recognition dataset. We also examined our data representation approach on other datasets, including HMDB-51, KTH, and Weizmann, and obtained accuracy of 78.48%, 97.95%, and 100%, respectively. The image recognition and neural network models were simulated and implemented using Xilinx simulators for ZCU102 board. The operating frequency of this embedded system is 333 MHz, and it works in real-time with 120 frames per second (fps).

【3】 ENHANCE (ENriching Health data by ANnotations of Crowd and Experts): A case study for skin lesion classification 标题：增强(通过人群和专家的注释丰富健康数据)：皮肤病变分类的案例研究

作者：Ralf Raumanns,Gerard Schouten,Max Joosten,Josien P. W. Pluim,Veronika Cheplygina 机构： Fontys University of Applied Science, Eindhoven, The Netherlands, Eindhoven University of Technology, Eindhoven, The Netherlands, IT University of Copenhagen, Denmark 链接：https://arxiv.org/abs/2107.12734 摘要：我们提出了一个开放的多注释数据集ENHANCE，以补充现有的ISIC和PH2皮肤病变分类数据集。此数据集包含来自非专家注释来源的视觉ABC（不对称、边界、颜色）特征注释：本科生、来自Amazon MTurk的人群工作者和经典图像处理算法。在本文中，我们首先分析了注释和病变诊断标签之间的相关性，以及不同注释来源之间的一致性。总体而言，我们发现非专家注释与诊断标签的相关性较弱，不同注释源之间的一致性较低。然后，我们研究了多任务学习（MTL）和注释作为附加标签，并表明非专家注释可以通过MTL改进（集成）最先进的卷积神经网络。我们希望我们的数据集可以用于多注释和/或MTL的进一步研究。Github上提供了所有数据和模型：https://github.com/raumannsr/ENHANCE. 摘要：We present ENHANCE, an open dataset with multiple annotations to complement the existing ISIC and PH2 skin lesion classification datasets. This dataset contains annotations of visual ABC (asymmetry, border, colour) features from non-expert annotation sources: undergraduate students, crowd workers from Amazon MTurk and classic image processing algorithms. In this paper we first analyse the correlations between the annotations and the diagnostic label of the lesion, as well as study the agreement between different annotation sources. Overall we find weak correlations of non-expert annotations with the diagnostic label, and low agreement between different annotation sources. We then study multi-task learning (MTL) with the annotations as additional labels, and show that non-expert annotations can improve (ensembles of) state-of-the-art convolutional neural networks via MTL. We hope that our dataset can be used in further research into multiple annotations and/or MTL. All data and models are available on Github: https://github.com/raumannsr/ENHANCE.

【4】 Feature Fusion Methods for Indexing and Retrieval of Biometric Data: Application to Face Recognition with Privacy Protection 标题：生物特征数据索引与检索的特征融合方法及其在隐私保护人脸识别中的应用

作者：Pawel Drozdowski,Fabian Stockhardt,Christian Rathgeb,Dailé Osorio-Roig,Christoph Busch 机构：dasec – Biometrics and Internet Security Research Group, Hochschule Darmstadt, Germany 链接：https://arxiv.org/abs/2107.12675 摘要：计算效率高、准确性高、保护隐私的数据存储和检索是生物识别系统在世界范围内实际应用所面临的主要挑战之一。本文提出了一种生物特征数据的保护索引方法。利用智能配对模板的特征级融合，建立了多阶段搜索结构。在检索期间，潜在候选身份的列表被连续地预过滤，从而减少生物特征识别事务所需的模板比较的数目。使用同态加密对生物特征探针模板、存储的参考模板和创建的索引进行保护。利用两个最先进的开放源代码人脸识别系统，在封闭集和开放集识别场景下对所提出的方法进行了广泛的评估。相对于使用基于穷举搜索的检索算法的典型基线算法，所提出的方法能够将与生物识别事务相关的计算工作量减少90%，同时不会降低生物识别性能。此外，通过将模板保护与开源同态加密库无缝集成，该方法保证了受保护生物特征数据的不可链接性、不可逆性和可更新性。摘要：Computationally efficient, accurate, and privacy-preserving data storage and retrieval are among the key challenges faced by practical deployments of biometric identification systems worldwide. In this work, a method of protected indexing of biometric data is presented. By utilising feature-level fusion of intelligently paired templates, a multi-stage search structure is created. During retrieval, the list of potential candidate identities is successively pre-filtered, thereby reducing the number of template comparisons necessary for a biometric identification transaction. Protection of the biometric probe templates, as well as the stored reference templates and the created index is carried out using homomorphic encryption. The proposed method is extensively evaluated in closed-set and open-set identification scenarios on publicly available databases using two state-of-the-art open-source face recognition systems. With respect to a typical baseline algorithm utilising an exhaustive search-based retrieval algorithm, the proposed method enables a reduction of the computational workload associated with a biometric identification transaction by 90%, while simultaneously suffering no degradation of the biometric performance. Furthermore, by facilitating a seamless integration of template protection with open-source homomorphic encryption libraries, the proposed method guarantees unlinkability, irreversibility, and renewability of the protected biometric data.

【5】 Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification 标题：基于语义自对齐网络的文本到图像部分感知人物再识别

作者：Zefeng Ding,Changxing Ding,Zhiyin Shao,Dacheng Tao 机构： South China University of Technology, Pazhou Lab, Guangzhou, JD Explore Academy 备注：A new database is provided. Code will be released 链接：https://arxiv.org/abs/2107.12666 摘要：文本到图像的人物再识别（ReID）旨在通过文本描述来搜索包含感兴趣人物的图像。然而，由于文本描述中存在显著的情态差异和较大的类内差异，文本到图像的ReID仍然是一个具有挑战性的问题。因此，本文提出了一种语义自对齐网络（SSAN）来解决上述问题。首先，我们提出了一种新的方法，自动提取语义对齐的部分级特征从这两种模式。其次，我们设计了一个多视角的非局部网络来捕捉身体部位之间的关系，从而在身体部位和名词短语之间建立更好的对应关系。第三，我们引入了一个复合排序（CR）损失，它利用文本描述对同一身份的其他图像提供额外的监督，从而有效地减少了文本特征的类内方差。最后，为了加速文本到图像ReID的未来研究，我们建立了一个新的数据库ICFG-PEDES。大量的实验表明，SSAN比最新的方法有显著的优势。新的ICFG-PEDES数据库和SSAN代码都可以在https://github.com/zifyloo/SSAN. 摘要：Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions. However, due to the significant modality gap and the large intra-class variance in textual descriptions, text-to-image ReID remains a challenging problem. Accordingly, in this paper, we propose a Semantically Self-Aligned Network (SSAN) to handle the above problems. First, we propose a novel method that automatically extracts semantically aligned part-level features from the two modalities. Second, we design a multi-view non-local network that captures the relationships between body parts, thereby establishing better correspondences between body parts and noun phrases. Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features. Finally, to expedite future research in text-to-image ReID, we build a new database named ICFG-PEDES. Extensive experiments demonstrate that SSAN outperforms state-of-the-art approaches by significant margins. Both the new ICFG-PEDES database and the SSAN code are available at https://github.com/zifyloo/SSAN.

分割|语义相关(7篇)

【1】 Remember What You have drawn: Semantic Image Manipulation with Memory 标题：记住您所画的：使用记忆进行语义图像操作

作者：Xiangxi Shi,Zhonghua Wu,Guosheng Lin,Jianfei Cai,Shafiq Joty 机构：Electrical Engineering and Computer, Oregon state university, School of Computer Science and, Nanyang Technological University, Department of Data Science & AI, Monash University 链接：https://arxiv.org/abs/2107.12579 摘要：基于自然语言的图像处理是计算机视觉和自然语言处理领域的一个具有挑战性的问题，其目的是在语言描述的指导下对图像进行处理。目前，人们已经为此做出了很多努力，但是他们的表现离生成真实的、符合文本的操纵图像还有很长的距离。因此，本文提出了一种基于记忆的图像处理网络（MIM-Net），在文本描述的指导下，引入一组从图像中学习的记忆来合成纹理信息。我们提出了一个两阶段的网络与一个额外的重建阶段，学习潜在的记忆有效。为了避免不必要的背景变化，我们提出了一个目标定位单元（TLU）来集中处理文本中提到的区域。此外，为了学习鲁棒记忆，我们进一步提出了一种新的随机记忆训练方法。在四个流行数据集上的实验表明，与现有的方法相比，该方法具有更好的性能。摘要：Image manipulation with natural language, which aims to manipulate images with the guidance of language descriptions, has been a challenging problem in the fields of computer vision and natural language processing (NLP). Currently, a number of efforts have been made for this task, but their performances are still distant away from generating realistic and text-conformed manipulated images. Therefore, in this paper, we propose a memory-based Image Manipulation Network (MIM-Net), where a set of memories learned from images is introduced to synthesize the texture information with the guidance of the textual description. We propose a two-stage network with an additional reconstruction stage to learn the latent memories efficiently. To avoid the unnecessary background changes, we propose a Target Localization Unit (TLU) to focus on the manipulation of the region mentioned by the text. Moreover, to learn a robust memory, we further propose a novel randomized memory training loss. Experiments on the four popular datasets show the better performance of our method compared to the existing ones.

【2】 Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation 标题：基于运动感知掩模传播的自监督视频对象分割

作者：Bo Miao,Mohammed Bennamoun,Yongsheng Gao,Ajmal Mian 机构：The University of Western Australia, Griffith University 备注：10 pages, 5 figures 链接：https://arxiv.org/abs/2107.12569 摘要：提出了一种基于运动感知掩模传播（MAMP）的自监督时空匹配方法，用于半监督视频对象分割。在训练期间，MAMP利用帧重建任务来训练模型，而不需要注释。在推理过程中，MAMP从每一帧中提取高分辨率的特征，并根据这些特征以及所选过去帧的预测掩模建立一个存储库。然后，MAMP根据本文提出的运动感知时空匹配模块将掩模从内存库传播到后续帧。对DAVIS-2017和YouTube-VOS数据集的评估表明，与现有的自监督方法相比，MAMP具有更强的泛化能力，达到了最先进的性能，也就是说，在DAVIS-2017上，平均$mathcal{J}\&\mathcal{F}$比最近的竞争对手高4.9\%，在YouTube视频看不见的类别上，平均$mathcal{J}\&\mathcal{F}$比最近的竞争对手高4.85\%。此外，MAMP与许多有监督的视频对象分割方法相一致。我们的代码位于：\url{https://github.com/bo-miao/MAMP}. 摘要：We propose a self-supervised spatio-temporal matching method coined Motion-Aware Mask Propagation (MAMP) for semi-supervised video object segmentation. During training, MAMP leverages the frame reconstruction task to train the model without the need for annotations. During inference, MAMP extracts high-resolution features from each frame to build a memory bank from the features as well as the predicted masks of selected past frames. MAMP then propagates the masks from the memory bank to subsequent frames according to our motion-aware spatio-temporal matching module, also proposed in this paper. Evaluation on DAVIS-2017 and YouTube-VOS datasets show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods, i.e. 4.9\% higher mean $\mathcal{J}\&\mathcal{F}$ on DAVIS-2017 and 4.85\% higher mean $\mathcal{J}\&\mathcal{F}$ on the unseen categories of YouTube-VOS than the nearest competitor. Moreover, MAMP performs on par with many supervised video object segmentation methods. Our code is available at: \url{https://github.com/bo-miao/MAMP}.

【3】 Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP 标题：风格分割：基于Stylegan和CLIP的无监督语义图像分割

作者：Daniil Pakhomov,Sanchit Hira,Narayani Wagle,Kemar E. Green,Nassir Navab 机构：Johns Hopkins University 链接：https://arxiv.org/abs/2107.12518 摘要：我们介绍了一种方法，允许自动分割图像到语义上有意义的区域没有人监督。派生区域在不同的图像中是一致的，并且在某些数据集上与人类定义的语义类是一致的。在语义区域难以定义和一致标注的情况下，我们的方法仍然能够找到有意义和一致的语义类。在我们的工作中，我们使用了预训练的StyleGAN2~\cite{karras2020analysis}生成模型：在生成模型的特征空间中聚类可以发现语义类。一旦发现了类，就可以创建一个包含生成的图像和相应的分割掩码的合成数据集。然后在合成数据集上训练分割模型，使其能够推广到真实图像。此外，通过使用CLIP~\cite{radford2021learning}，我们能够使用自然语言中定义的提示来发现一些所需的语义类。我们在公开的数据集上测试了我们的方法，并展示了最新的结果。摘要：We introduce a method that allows to automatically segment images into semantically meaningful regions without human supervision. Derived regions are consistent across different images and coincide with human-defined semantic classes on some datasets. In cases where semantic regions might be hard for human to define and consistently label, our method is still able to find meaningful and consistent semantic classes. In our work, we use pretrained StyleGAN2~\cite{karras2020analyzing} generative model: clustering in the feature space of the generative model allows to discover semantic classes. Once classes are discovered, a synthetic dataset with generated images and corresponding segmentation masks can be created. After that a segmentation model is trained on the synthetic dataset and is able to generalize to real images. Additionally, by using CLIP~\cite{radford2021learning} we are able to use prompts defined in a natural language to discover some desired semantic classes. We test our method on publicly available datasets and show state-of-the-art results.

【4】 A Comprehensive Study on Colorectal Polyp Segmentation with ResUNet++, Conditional Random Field and Test-Time Augmentation 标题：Resunet++、条件随机场和测试时间增强在大肠息肉分割中的综合研究

作者：Debesh Jha,Pia H. Smedsrud,Dag Johansen,Thomas de Lange,Håvard D. Johansen,Pål Halvorsen,Michael A. Riegler 机构：UiT The Arctic University of Norway, University of Oslo, OsloMetropolitan University, Sahlgrenska University Hospital 备注：Accepted at IEEE Journal of BioMedical and Health Informatics 链接：https://arxiv.org/abs/2107.12435 摘要：结肠镜检查被认为是检测结直肠癌及其前体的金标准。然而，现有的检查方法由于总体漏检率高而受到阻碍，许多异常情况未被发现。基于先进机器学习算法的计算机辅助诊断系统被吹捧为一个游戏规则的改变者，它可以识别内窥镜检查中被医生忽略的结肠区域，并帮助检测和描述病变。在以前的工作中，我们提出了ResUNet++体系结构，并证明了它比U-Net和ResUNet更有效。在本文中，我们证明了通过使用条件随机场和测试时间扩充，ResUNet++体系结构的整体预测性能可以得到进一步的改进。我们使用六个公开可用的数据集进行了广泛的评估并验证了改进：Kvasir SEG、CVC ClinicDB、CVC ColonDB、ETIS Larib Polyp DB、ASU Mayo Clinic结肠镜视频数据库和CVC VideoClinicDB。此外，我们比较了我们提出的架构和结果模型与其他国家的最先进的方法。为了探索ResUNet++在不同公开的polyp数据集上的泛化能力，以便在实际环境中使用，我们进行了广泛的跨数据集评估。实验结果表明，在同一数据集和交叉数据集上，应用CRF和TTA可以提高不同息肉分割数据集的性能。摘要：Colonoscopy is considered the gold standard for detection of colorectal cancer and its precursors. Existing examination methods are, however, hampered by high overall miss-rate, and many abnormalities are left undetected. Computer-Aided Diagnosis systems based on advanced machine learning algorithms are touted as a game-changer that can identify regions in the colon overlooked by the physicians during endoscopic examinations, and help detect and characterize lesions. In previous work, we have proposed the ResUNet++ architecture and demonstrated that it produces more efficient results compared with its counterparts U-Net and ResUNet. In this paper, we demonstrate that further improvements to the overall prediction performance of the ResUNet++ architecture can be achieved by using conditional random field and test-time augmentation. We have performed extensive evaluations and validated the improvements using six publicly available datasets: Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS-Larib Polyp DB, ASU-Mayo Clinic Colonoscopy Video Database, and CVC-VideoClinicDB. Moreover, we compare our proposed architecture and resulting model with other State-of-the-art methods. To explore the generalization capability of ResUNet++ on different publicly available polyp datasets, so that it could be used in a real-world setting, we performed an extensive cross-dataset evaluation. The experimental results show that applying CRF and TTA improves the performance on various polyp segmentation datasets both on the same dataset and cross-dataset.

【5】 Improved-Mask R-CNN: Towards an Accurate Generic MSK MRI instance segmentation platform (Data from the Osteoarthritis Initiative) 标题：改进的Mask R-CNN：迈向准确的通用MSK MRI实例分割平台(来自骨关节炎计划的数据)

作者：Banafshe Felfeliyan,Abhilash Hareendranathan,Gregor Kuntze,Jacob L. Jaremko,Janet L. Ronsky 机构：Ronsky , Affiliations, Schulich School of Engineering, University of Calgary, McCaig Institute for Bone and Joint Health University of Calgary, Calgary, Department of Radiology & Diagnostic Imaging, University of Alberta 链接：https://arxiv.org/abs/2107.12889 摘要：目的探讨磁共振成像（MRI）对骨关节炎（OA）的诊断价值。骨、软骨和关节液的分割是骨性关节炎客观评估的必要条件。现有的分割方法大多没有进行实例分割，存在类不平衡的问题。本研究采用maskr-CNN实例分割法，并对其进行改进（改进maskr-CNN（iMaskRCNN））以获得更准确的OA相关组织广义分割。使用来自骨关节炎倡议（OAI）数据集的500个MRI膝关节和症状性髋关节OA患者的97个MRI扫描，对该方法进行训练和验证。对Mask R-CNN的三个修改产生了iMaskRCNN：添加第二个roi对齐块，向Mask报头添加额外的解码器层，并通过跳过连接将它们连接起来。使用Hausdorff距离、dice评分和变异系数（CoV）对结果进行评估。与掩模RCNN相比，iMaskRCNN改善了骨和软骨的分割，如股骨的dice评分从95%增加到98%，胫骨的dice评分从95%增加到97%，股骨软骨的dice评分从71%增加到80%，胫骨软骨的dice评分从81%增加到82%。对于渗出液检测，dice使用iMaskRCNN提高了72%，而MaskRCNN提高了71%。Reader1和Mask R-CNN（0.33）、Reader1和iMaskRCNN（0.34）、Reader2和Mask R-CNN（0.22）、Reader2和iMaskRCNN（0.29）之间的渗出检测CoV值接近两个阅读器之间的CoV值（0.21），表明人类阅读器和Mask R-CNN和iMaskRCNN之间高度一致。掩模R－CNN和IMASKRCNN能可靠地同时提取与OA相关的不同尺度关节组织，为OA的自动化评估奠定基础。iMaskRCNN结果表明，改进后的网络边缘性能得到了改善。摘要：Objective assessment of Magnetic Resonance Imaging (MRI) scans of osteoarthritis (OA) can address the limitation of the current OA assessment. Segmentation of bone, cartilage, and joint fluid is necessary for the OA objective assessment. Most of the proposed segmentation methods are not performing instance segmentation and suffer from class imbalance problems. This study deployed Mask R-CNN instance segmentation and improved it (improved-Mask R-CNN (iMaskRCNN)) to obtain a more accurate generalized segmentation for OA-associated tissues. Training and validation of the method were performed using 500 MRI knees from the Osteoarthritis Initiative (OAI) dataset and 97 MRI scans of patients with symptomatic hip OA. Three modifications to Mask R-CNN yielded the iMaskRCNN: adding a 2nd ROIAligned block, adding an extra decoder layer to the mask-header, and connecting them by a skip connection. The results were assessed using Hausdorff distance, dice score, and coefficients of variation (CoV). The iMaskRCNN led to improved bone and cartilage segmentation compared to Mask RCNN as indicated with the increase in dice score from 95% to 98% for the femur, 95% to 97% for tibia, 71% to 80% for femoral cartilage, and 81% to 82% for tibial cartilage. For the effusion detection, dice improved with iMaskRCNN 72% versus MaskRCNN 71%. The CoV values for effusion detection between Reader1 and Mask R-CNN (0.33), Reader1 and iMaskRCNN (0.34), Reader2 and Mask R-CNN (0.22), Reader2 and iMaskRCNN (0.29) are close to CoV between two readers (0.21), indicating a high agreement between the human readers and both Mask R-CNN and iMaskRCNN. Mask R-CNN and iMaskRCNN can reliably and simultaneously extract different scale articular tissues involved in OA, forming the foundation for automated assessment of OA. The iMaskRCNN results show that the modification improved the network performance around the edges.

【6】 A persistent homology-based topological loss for CNN-based multi-class segmentation of CMR 标题：一种基于持久同调的基于CNN的CMR多类分割拓扑损失

作者：Nick Byrne,James R Clough,Isra Valverde,Giovanni Montana,Andrew P King 机构：a Medical Physics, Guy’s and St. Thomas’ NHS Foundation Trust, London, UK, b School of Biomedical Engineering & Imaging Sciences, King’s College London, UK, c Paediatric Cardiology, Evelina London Children’s Hospital, UK 链接：https://arxiv.org/abs/2107.12689 摘要：心脏磁共振（CMR）图像的多类分割寻求将数据分离成具有已知结构和构型的解剖成分。最流行的基于CNN的方法是使用像素损失函数进行优化，而忽略了解剖学的空间扩展特征。因此，在与地面真实信息共享高空间重叠的同时，基于CNN的推断分割可能缺乏连贯性，包括虚假的连接成分、空洞和空洞。这样的结果是不可信的，违反了预期的解剖结构。作为回应，（单类）持续同源性为基础的损失函数已被提出，以捕捉全球解剖特征。我们的工作将这些方法扩展到多类分割的任务中。通过对所有类标签和类标签对进行丰富的拓扑描述，我们的损失函数利用基于CNN的后处理框架在分割拓扑结构上做出了可预测的和统计上显著的改进。我们还提出（并提供）一个高效的实现基于立方体复合物和并行执行，使实际应用在高分辨率的三维数据首次。我们展示了我们的方法在二维短轴和三维全心脏CMR分割，提出了一个详细和可靠的性能分析在两个公开的数据集。摘要：Multi-class segmentation of cardiac magnetic resonance (CMR) images seeks a separation of data into anatomical components with known structure and configuration. The most popular CNN-based methods are optimised using pixel wise loss functions, ignorant of the spatially extended features that characterise anatomy. Therefore, whilst sharing a high spatial overlap with the ground truth, inferred CNN-based segmentations can lack coherence, including spurious connected components, holes and voids. Such results are implausible, violating anticipated anatomical topology. In response, (single-class) persistent homology-based loss functions have been proposed to capture global anatomical features. Our work extends these approaches to the task of multi-class segmentation. Building an enriched topological description of all class labels and class label pairs, our loss functions make predictable and statistically significant improvements in segmentation topology using a CNN-based post-processing framework. We also present (and make available) a highly efficient implementation based on cubical complexes and parallel execution, enabling practical application within high resolution 3D data for the first time. We demonstrate our approach on 2D short axis and 3D whole heart CMR segmentation, advancing a detailed and faithful analysis of performance on two publicly available datasets.

【7】 Sharp U-Net: Depthwise Convolutional Network for Biomedical Image Segmentation 标题：Sharp U-Net：生物医学图像分割的深度卷积网络

作者：Hasib Zunair,A. Ben Hamza 机构：Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada 链接：https://arxiv.org/abs/2107.12461 摘要：基于完全卷积网络的U-Net结构在生物医学图像分割中被证明是有效的。然而，U-Net应用跳转连接来合并语义上不同的低层和高层卷积特征，不仅导致特征映射的模糊，而且导致目标区域的过分割和欠分割。为了解决这些局限性，我们提出了一种简单有效的端到端深度编码-解码器全卷积网络结构Sharp-U-Net，用于二值和多类生物医学图像分割。Sharp U-Net的主要原理是，在合并编码器和解码器特征之前，使用锐化核滤波器对编码器特征映射进行深度卷积，从而产生与编码器映射大小相同的锐化中间特征映射，而不是应用简单的跳过连接。使用这个锐化过滤层，我们不仅能够融合语义上不太相似的特征，而且能够在训练的早期阶段平滑整个网络层中的伪影。我们在六个数据集上的大量实验表明，所提出的Sharp-U-Net模型在二值和多类分割任务中都优于或匹配最新的基线，同时没有增加额外的可学习参数。此外，夏普U-Net的性能优于可学习参数数量三倍以上的基线。摘要：The U-Net architecture, built upon the fully convolutional network, has proven to be effective in biomedical image segmentation. However, U-Net applies skip connections to merge semantically different low- and high-level convolutional features, resulting in not only blurred feature maps, but also over- and under-segmented target regions. To address these limitations, we propose a simple, yet effective end-to-end depthwise encoder-decoder fully convolutional network architecture, called Sharp U-Net, for binary and multi-class biomedical image segmentation. The key rationale of Sharp U-Net is that instead of applying a plain skip connection, a depthwise convolution of the encoder feature map with a sharpening kernel filter is employed prior to merging the encoder and decoder features, thereby producing a sharpened intermediate feature map of the same size as the encoder map. Using this sharpening filter layer, we are able to not only fuse semantically less dissimilar features, but also to smooth out artifacts throughout the network layers during the early stages of training. Our extensive experiments on six datasets show that the proposed Sharp U-Net model consistently outperforms or matches the recent state-of-the-art baselines in both binary and multi-class segmentation tasks, while adding no extra learnable parameters. Furthermore, Sharp U-Net outperforms baselines that have more than three times the number of learnable parameters.

Zero/Few Shot|迁移|域适配|自适应(5篇)

【1】 A Physiologically-adapted Gold Standard for Arousal During a Stress Induced Scenario 标题：在应激诱导情景中生理适应的唤醒黄金标准

作者：Alice Baird,Lukas Stappen,Lukas Christ,Lea Schumann,Eva-Maria Meßner,Björn W. Schuller 机构：Chair EIHW, University of Augsburg, Augsburg, Germany, KPP, University of Ulm, Ulm, Germany, GLAM, Imperial College London, London, United Kingdom 链接：https://arxiv.org/abs/2107.12964 摘要：情感是人类固有的主观心理生理状态，要产生连续情感的一致表示（黄金标准），需要对多个人类注释者进行耗时且昂贵的训练。文献中有强有力的证据表明，生理信号是情绪状态，特别是觉醒状态的充分客观标志。在这篇文章中，我们使用了一个数据集，其中包括在压力诱导情景（特里尔社会压力测试）中捕获的连续情绪和生理信号——每分钟心跳（BPM）、皮肤电活动（EDA）和呼吸频率。我们利用长-短记忆，递归神经网络来探索融合这些生理信号的好处，以唤醒为目标，学习各种音频，视频和文本为基础的功能。我们利用最先进的缪斯工具箱来考虑注释延迟和帧间协议加权时融合目标信号。当EDA与唤醒融合时，与仅唤醒的金标准结果相比，特征集之间的一致性相关系数（CCC）有所提高。此外，基于BERT的文本特征对于唤醒和所有生理信号的结果都有所改善，与仅对唤醒的0.2118ccc相比，获得了高达0.3344ccc的结果。多模态融合还提高了整体CCC的音频和视频功能，获得高达0.6157 CCC识别唤醒加EDA和BPM。摘要：Emotion is an inherently subjective psychophysiological human-state and to produce an agreed-upon representation (gold standard) for continuous emotion requires a time-consuming and costly training procedure of multiple human annotators. There is strong evidence in the literature that physiological signals are sufficient objective markers for states of emotion, particularly arousal. In this contribution, we utilise a dataset which includes continuous emotion and physiological signals - Heartbeats per Minute (BPM), Electrodermal Activity (EDA), and Respiration-rate - captured during a stress induced scenario (Trier Social Stress Test). We utilise a Long Short-Term Memory, Recurrent Neural Network to explore the benefit of fusing these physiological signals with arousal as the target, learning from various audio, video, and textual based features. We utilise the state-of-the-art MuSe-Toolbox to consider both annotation delay and inter-rater agreement weighting when fusing the target signals. An improvement in Concordance Correlation Coefficient (CCC) is seen across features sets when fusing EDA with arousal, compared to the arousal only gold standard results. Additionally, BERT-based textual features' results improved for arousal plus all physiological signals, obtaining up to .3344 CCC compared to .2118 CCC for arousal only. Multimodal fusion also improves overall CCC with audio plus video features obtaining up to .6157 CCC to recognize arousal plus EDA and BPM.

【2】 Coarse to Fine: Domain Adaptive Crowd Counting via Adversarial Scoring Network 标题：从粗到精：基于对抗性计分网络的领域自适应人群计数

作者：Zhikang Zou,Xiaoye Qu,Pan Zhou,Shuangjie Xu,Xiaoqing Ye,Wenhao Wu,Jin Ye 机构：The Hubei Engineering Research, Center on Big Data Security, School, of Cyber Science and Engineering, Huazhong University of Science and, & Department of Computer Vision, Technology (VIS), Baidu Inc., China, Department of Computer Science and 备注：Accepted by ACMMM2021 链接：https://arxiv.org/abs/2107.12858 摘要：近年来，深度网络在人群计数方面表现出了良好的性能，由于其广泛的工业应用，这是一项引起广泛关注的关键任务。尽管取得了这些进展，但由于固有的领域转移，经过训练的数据相关模型通常不能很好地推广到不可见的场景。为了解决这一问题，本文提出了一种新的对抗性评分网络（ASNet）来逐步缩小从粗粒度到细粒度的领域差距。具体地说，在粗粒度阶段，我们设计了一个双重鉴别器策略，通过对抗式学习，从全局和局部特征空间的角度使源域更接近目标。因此，两个域之间的分布可以大致对齐。在细粒度阶段，我们根据粗粒度阶段的生成概率，从多个层次对源样本与目标样本的相似程度进行评分，从而探讨源特征的可传递性。在这些层次分数的指导下，选择合适的可转移源特征，以增强适应过程中的知识转移。通过由粗到精的设计，可以有效地缓解由领域差异引起的泛化瓶颈。三组迁移实验表明，与主要的无监督方法相比，本文提出的方法达到了最先进的计数性能。摘要：Recent deep networks have convincingly demonstrated high capability in crowd counting, which is a critical task attracting widespread attention due to its various industrial applications. Despite such progress, trained data-dependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, this paper proposes a novel adversarial scoring network (ASNet) to gradually bridge the gap across domains from coarse to fine granularity. In specific, at the coarse-grained stage, we design a dual-discriminator strategy to adapt source domain to be close to the targets from the perspectives of both global and local feature space via adversarial learning. The distributions between two domains can thus be aligned roughly. At the fine-grained stage, we explore the transferability of source characteristics by scoring how similar the source samples are to target ones from multiple levels based on generative probability derived from coarse stage. Guided by these hierarchical scores, the transferable source features are properly selected to enhance the knowledge transfer during the adaptation process. With the coarse-to-fine design, the generalization bottleneck induced from the domain discrepancy can be effectively alleviated. Three sets of migration experiments show that the proposed methods achieve state-of-the-art counting performance compared with major unsupervised methods.

【3】 Adaptive Denoising via GainTuning 标题：基于GainTuning的自适应去噪

作者：Sreyas Mohan,Joshua L. Vincent,Ramon Manzorro,Peter A. Crozier,Eero P. Simoncelli,Carlos Fernandez-Granda 机构：Center For Data Science, NYU, School for Engineering of Matter, Transport and Energy, ASU, Center for Neural Science, NYU and Flatiron Institute, Simons Foundation, Courant Institute of Mathematical Sciences, NYU 链接：https://arxiv.org/abs/2107.12815 摘要：用于图像去噪的深度卷积神经网络（CNNs）通常是在大数据集上训练的。这些模型达到了目前的技术水平，但当应用于偏离训练分布的数据时，它们很难推广。最近的研究表明，在单个噪声图像上训练去噪器是可能的。这些模型适应于测试图像的特征，但其性能受到用于训练它们的少量信息的限制。在这里，我们提出了“GainTuning”，即在大数据集上预先训练的CNN模型针对单个测试图像进行自适应和选择性调整。为了避免过度拟合，增益调谐优化了CNN卷积层中每个通道的单个乘法缩放参数（“增益”）。我们证明了GainTuning在标准的图像去噪基准上改进了最先进的CNNs，提高了它们在测试集中几乎所有图像上的去噪性能。这些自适应改进对于在噪声水平或图像类型上与训练数据系统地不同的测试图像来说更为重要。我们说明了自适应去噪在科学应用中的潜力，其中一个CNN是训练合成数据，并在真实的透射电子显微镜图像测试。与现有方法相比，GainTuning能够在极低的信噪比下从这些数据中忠实地重建催化纳米颗粒的结构。摘要：Deep convolutional neural networks (CNNs) for image denoising are usually trained on large datasets. These models achieve the current state of the art, but they have difficulties generalizing when applied to data that deviate from the training distribution. Recent work has shown that it is possible to train denoisers on a single noisy image. These models adapt to the features of the test image, but their performance is limited by the small amount of information used to train them. Here we propose "GainTuning", in which CNN models pre-trained on large datasets are adaptively and selectively adjusted for individual test images. To avoid overfitting, GainTuning optimizes a single multiplicative scaling parameter (the "Gain") of each channel in the convolutional layers of the CNN. We show that GainTuning improves state-of-the-art CNNs on standard image-denoising benchmarks, boosting their denoising performance on nearly every image in a held-out test set. These adaptive improvements are even more substantial for test images differing systematically from the training data, either in noise level or image type. We illustrate the potential of adaptive denoising in a scientific application, in which a CNN is trained on synthetic data, and tested on real transmission-electron-microscope images. In contrast to the existing methodology, GainTuning is able to faithfully reconstruct the structure of catalytic nanoparticles from these data at extremely low signal-to-noise ratios.

【4】 Nearest Neighborhood-Based Deep Clustering for Source Data-absent Unsupervised Domain Adaptation 标题：基于最近邻域的源数据无监督域自适应深度聚类

作者：Song Tang,Yan Yang,Zhiyuan Ma,Norman Hendrich,Fanyu Zeng,Shuzhi Sam Ge,Changshui Zhang,Jianwei Zhang 机构： University ofShanghai for Science and Technology, ChinaZhiyuan Ma are with the Institute of Machine Intelligence 链接：https://arxiv.org/abs/2107.12585 摘要：在经典的无监督域自适应（UDA）算法中，标记源数据在训练阶段是可用的。然而，在许多现实场景中，由于隐私保护和信息安全等原因，源数据是不可访问的，只有经过源域训练的模型才可用。本文提出了一种新的深度聚类方法。针对特征级的动态聚类问题，在数据之间的几何结构中引入了额外的约束来辅助聚类过程。具体地说，我们提出了一个基于几何的约束，称为最近邻语义一致性（SCNNH），并用它来鼓励鲁棒聚类。为了达到这个目的，我们为每个目标数据构造最近邻域，并将其作为基本的聚类单元，通过在几何体上建立目标。此外，我们开发了一个更符合SCNNH的结构，并附加了一个语义可信度约束，称为语义超最近邻（SHNNH）。之后，我们将我们的方法扩展到这个新的几何体。在三个具有挑战性的UDA数据集上的大量实验表明，我们的方法达到了最先进的结果。该方法在所有数据集上都有显著的改进（采用SHNNH后，在大规模数据集上平均精度提高了3.0%以上）。代码位于https://github.com/tntek/N2DCX. 摘要：In the classic setting of unsupervised domain adaptation (UDA), the labeled source data are available in the training phase. However, in many real-world scenarios, owing to some reasons such as privacy protection and information security, the source data is inaccessible, and only a model trained on the source domain is available. This paper proposes a novel deep clustering method for this challenging task. Aiming at the dynamical clustering at feature-level, we introduce extra constraints hidden in the geometric structure between data to assist the process. Concretely, we propose a geometry-based constraint, named semantic consistency on the nearest neighborhood (SCNNH), and use it to encourage robust clustering. To reach this goal, we construct the nearest neighborhood for every target data and take it as the fundamental clustering unit by building our objective on the geometry. Also, we develop a more SCNNH-compliant structure with an additional semantic credibility constraint, named semantic hyper-nearest neighborhood (SHNNH). After that, we extend our method to this new geometry. Extensive experiments on three challenging UDA datasets indicate that our method achieves state-of-the-art results. The proposed method has significant improvement on all datasets (as we adopt SHNNH, the average accuracy increases by over 3.0\% on the large-scaled dataset). Code is available at https://github.com/tntek/N2DCX.

【5】 H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction 标题：H3D-NET：Few-Shot高保真三维头部重建

作者：Eduard Ramon,Gil Triginer,Janna Escur,Albert Pumarola,Jaime Garcia,Xavier Giro-i-Nieto,Francesc Moreno-Noguer 机构：Crisalix SA, Universitat Politecnica de Catalunya, Institut de Robotica i Informatica Industrial, CSIC-UPC, crisalixsa.github.ioh,d-net 链接：https://arxiv.org/abs/2107.12512 摘要：最近的学习方法，隐式表示表面几何使用基于坐标的神经表示已显示出令人印象深刻的结果，在多视图三维重建的问题。然而，这些技术的有效性取决于场景的大量（几十个）输入视图的可用性和计算要求的优化。在本文中，我们针对Few-Shot全三维头部重建的具体问题解决了这些限制，通过赋予基于坐标的表示一个概率形状先验，使得在使用较少的输入图像（最多三幅）时能够更快地收敛和更好的泛化。首先，我们学习一个形状模型的三维头部从数千个不完整的原始扫描使用隐式表示。在测试时，我们使用隐式可微绘制方法，将两个基于坐标的神经网络联合过拟合到场景中，一个用于几何建模，另一个用于表面辐射估计。我们设计了一个两阶段的优化策略，在初始优化阶段，利用学习到的先验知识对几何体进行初始化和约束。然后，先验解冻结并微调到场景。通过这样做，我们实现了高保真的头部重建，包括头发和肩膀，并具有高水平的细节，在少数镜头场景中始终优于最先进的三维变形模型方法，在大视图集可用时优于非参数化方法。摘要：Recent learning approaches that implicitly represent surface geometry using coordinate-based neural representations have shown impressive results in the problem of multi-view 3D reconstruction. The effectiveness of these techniques is, however, subject to the availability of a large number (several tens) of input views of the scene, and computationally demanding optimizations. In this paper, we tackle these limitations for the specific problem of few-shot full 3D head reconstruction, by endowing coordinate-based representations with a probabilistic shape prior that enables faster convergence and better generalization when using few input images (down to three). First, we learn a shape model of 3D heads from thousands of incomplete raw scans using implicit representations. At test time, we jointly overfit two coordinate-based neural networks to the scene, one modeling the geometry and another estimating the surface radiance, using implicit differentiable rendering. We devise a two-stage optimization strategy in which the learned prior is used to initialize and constrain the geometry during an initial optimization phase. Then, the prior is unfrozen and fine-tuned to the scene. By doing this, we achieve high-fidelity head reconstructions, including hair and shoulders, and with a high level of detail that consistently outperforms both state-of-the-art 3D Morphable Models methods in the few-shot scenario, and non-parametric methods when large sets of views are available.

半弱无监督|主动学习|不确定性(3篇)

【1】 Energy-Based Open-World Uncertainty Modeling for Confidence Calibration 标题：基于能量的开放世界不确定性建模的置信度校正

作者：Yezhen Wang,Bo Li,Tong Che,Kaiyang Zhou,Dongsheng Li,Ziwei Liu 机构：Microsoft Research Asia, MILA, S-Lab, Nanyang Technological University 备注：ICCV 2021 (Poster) 链接：https://arxiv.org/abs/2107.12628 摘要：置信度校正对于机器学习系统决策的可靠性具有重要意义。然而，基于深度神经网络的判别式分类器常常因产生过度自信的预测而受到批评，这种预测不能反映分类准确率的真实正确性和可能性。我们认为，这种建模不确定性的能力主要是由softmax的封闭世界特性造成的：一个由交叉熵损失训练的模型将被迫以高概率将输入分类到一个$K$预定义的类别中。为了解决这个问题，我们首次提出了一个新的$K$+1路softmax公式，它将开放世界的不确定性建模作为额外维度。为了将原来的$K$方式分类任务的学习和建模不确定性的额外维度统一起来，我们提出了一种新的基于能量的目标函数，并从理论上证明了优化这样一个目标本质上迫使额外维度捕获边缘数据分布。大量实验表明，基于能量的openworldsoftmax（EOW-Softmax）方法在提高置信度方面优于现有的最新方法。摘要：Confidence calibration is of great importance to the reliability of decisions made by machine learning systems. However, discriminative classifiers based on deep neural networks are often criticized for producing overconfident predictions that fail to reflect the true correctness likelihood of classification accuracy. We argue that such an inability to model uncertainty is mainly caused by the closed-world nature in softmax: a model trained by the cross-entropy loss will be forced to classify input into one of $K$ pre-defined categories with high probability. To address this problem, we for the first time propose a novel $K$+1-way softmax formulation, which incorporates the modeling of open-world uncertainty as the extra dimension. To unify the learning of the original $K$-way classification task and the extra dimension that models uncertainty, we propose a novel energy-based objective function, and moreover, theoretically prove that optimizing such an objective essentially forces the extra dimension to capture the marginal data distribution. Extensive experiments show that our approach, Energy-based Open-World Softmax (EOW-Softmax), is superior to existing state-of-the-art methods in improving confidence calibration.

【2】 Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization 标题：基于跨模态共识网络的弱监督时间动作定位

作者：Fa-Ting Hong,Jia-Chang Feng,Dan Xu,Ying Shan,Wei-Shi Zheng 机构：School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China, Peng Cheng Laboratory, Shenzhen, China, Applied Research Center (ARC), Tencent PCG, Shenzhen, China 备注：ACM International Conference on Multimedia, 2021 链接：https://arxiv.org/abs/2107.12589 摘要：弱监督时间动作定位（WS-TAL）是一项具有挑战性的任务，其目标是在视频级分类监督下对给定视频中的动作实例进行定位。在以前的工作中，外观和运动特征都被使用，但是它们并没有以适当的方式使用它们，而是应用简单的连接或分数级融合。在这项工作中，我们认为从预训练提取器中提取的特征，例如I3D，不是任务特定的特征，因此需要特征重新校准以减少任务无关信息冗余。因此，我们提出一个跨模式的共识网络（CO2-Net）来解决这个问题。在co2net中，我们主要介绍了两个完全相同的跨模态一致性模块（CCM），它们利用主模态的全局信息和辅助模态的跨模态局部信息，设计了一种跨模态注意机制来滤除任务无关信息冗余。此外，我们将来自每一个CCM的注意权重作为来自另一个CCM的注意权重的伪目标，以保持来自两个CCM的预测之间的一致性，形成一种相互学习的方式。最后，我们在两个常用的时间动作定位数据集THUMOS14和ActivityNet1.2上进行了大量的实验，以验证我们的方法并获得最新的结果。实验结果表明，本文提出的跨模态一致性模型能够产生更具代表性的时间动作定位特征。摘要：Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Both appearance and motion features are used in previous works, while they do not utilize them in a proper way but apply simple concatenation or score-level fusion. In this work, we argue that the features extracted from the pretrained extractor, e.g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Therefore, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information of the auxiliary modality. Moreover, we treat the attention weights derived from each CCMas the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two common used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method and achieve the state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

【3】 MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments 标题：单目室内：室内环境自监督单目深度估计的良好实践

作者：Pan Ji,Runze Li,Bir Bhanu,Yi Xu 机构：OPPO US Research Center, InnoPeak Technology, Inc., University of California Riverside 备注：Accepted to ICCV 2021 链接：https://arxiv.org/abs/2107.12429 摘要：室内环境下的自监督深度估计至少在以下两个方面比室外环境下的深度估计更具挑战性：（i）室内序列的深度范围在不同的帧之间变化很大，使得深度网络很难产生一致的深度线索，而室外场景中的最大距离大多与相机通常看到的天空保持一致(ii）室内序列包含更多的旋转运动，这给姿势网络带来了困难，而室外序列的运动主要是平移的，尤其是对于KITTI等驱动数据集。在本文中，特别考虑到这些挑战，并巩固了一套良好的做法，以提高性能的自我监督单目深度估计在室内环境。该方法主要由两个新的模块组成，即深度分解模块和剩余姿态估计模块，每个模块都是为了解决上述问题而设计的。每个模块的有效性通过仔细进行的烧蚀研究和在两个室内数据集（即EuRoC和NYUv2）上的最新性能演示来显示。摘要：Self-supervised depth estimation for indoor environments is more challenging than its outdoor counterpart in at least the following two aspects: (i) the depth range of indoor sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences contain much more rotational motions, which cause difficulties for the pose network, while the motions of outdoor sequences are pre-dominantly translational, especially for driving datasets such as KITTI. In this paper, special considerations are given to those challenges and a set of good practices are consolidated for improving the performance of self-supervised monocular depth estimation in indoor environments. The proposed method mainly consists of two novel modules, \ie, a depth factorization module and a residual pose estimation module, each of which is designed to respectively tackle the aforementioned challenges. The effectiveness of each module is shown through a carefully conducted ablation study and the demonstration of the state-of-the-art performance on two indoor datasets, \ie, EuRoC and NYUv2.

时序|行为识别|姿态|视频|运动估计(3篇)

【1】 Enriching Local and Global Contexts for Temporal Action Localization 标题：为时态动作本地化丰富局部和全局上下文

作者：Zixin Zhu,Wei Tang,Le Wang,Nanning Zheng,Gang Hua 机构：Xi’an Jiaotong University, University of Illinois at Chicago, Wormpex AI Research 备注：Accepted by ICCV 2021 链接：https://arxiv.org/abs/2107.12960 摘要：为了有效地解决时间动作定位问题，需要一种视觉表示方法，该方法共同追求两个混淆的目标，即时间定位的细粒度判别和动作分类的足够视觉不变性。我们通过在流行的两阶段时间本地化框架中丰富本地和全局环境来应对这一挑战，其中首先生成行动建议，然后进行行动分类和时间边界回归。我们提出的模型被称为ContextLoc，可分为三个子网络：L-Net、G-Net和P-Net。L-Net通过对代码片段级特征的细粒度建模来丰富本地上下文，并将其描述为一个查询和检索过程。G-Net通过对视频级表示进行更高层次的建模，丰富了全局上下文。此外，我们还引入了一个新的上下文适应模块来适应不同的提议。P-Net进一步对上下文感知的提议间关系进行建模。在我们的实验中，我们探索了两个现有的模型作为P-Net。在THUMOS14（54.3\%at）上的实验结果验证了该方法的有效性IoU@0.5)和ActivityNet v1.3（51.24\%IoU@0.5)数据集，其性能优于最新技术。摘要：Effectively tackling the problem of temporal action localization (TAL) necessitates a visual representation that jointly pursues two confounding goals, i.e., fine-grained discrimination for temporal localization and sufficient visual invariance for action classification. We address this challenge by enriching both the local and global contexts in the popular two-stage temporal localization framework, where action proposals are first generated followed by action classification and temporal boundary regression. Our proposed model, dubbed ContextLoc, can be divided into three sub-networks: L-Net, G-Net and P-Net. L-Net enriches the local context via fine-grained modeling of snippet-level features, which is formulated as a query-and-retrieval process. G-Net enriches the global context via higher-level modeling of the video-level representation. In addition, we introduce a novel context adaptation module to adapt the global context to different proposals. P-Net further models the context-aware inter-proposal relations. We explore two existing models to be the P-Net in our experiments. The efficacy of our proposed method is validated by experimental results on the THUMOS14 (54.3\% at IoU@0.5) and ActivityNet v1.3 (51.24\% at IoU@0.5) datasets, which outperforms recent states of the art.

【2】 Transferable Knowledge-Based Multi-Granularity Aggregation Network for Temporal Action Localization: Submission to ActivityNet Challenge 2021 标题：基于可转移知识的多粒度聚合网络时间动作本地化：从深渊翻滚到ActivityNet挑战赛2021年

作者：Haisheng Su,Peiqin Zhuang,Yukun Li,Dongliang Wang,Weihao Gan,Wei Wu,Yu Qiao 机构：SenseTime Research, SIAT-SenseTime Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shanghai AI Laboratory, Shanghai, China 备注：Winner of HACS21 Challenge Weakly Supervised Learning Track with extra data. arXiv admin note: text overlap with arXiv:2103.13141 链接：https://arxiv.org/abs/2107.12618 摘要：本技术报告概述了我们在提交2021年HACS时间动作定位挑战时在监督学习和弱监督学习轨道上使用的解决方案。时态动作定位（TAL）不仅需要精确定位动作实例的时态边界，还需要对未剪辑的视频进行分类。然而，弱监督TAL表示仅使用视频级别的类标签来定位动作实例。为了训练一个有监督的时态动作定位器，本文采用时态上下文聚合网络（TCANet）通过“局部和全局”的时态上下文聚合、互补和渐进的边界细化生成高质量的动作建议。对于WSTAL，提出了一种新的框架来处理由简单分类网络产生的CAS质量差的问题，该框架只能关注局部的区分部分，而不能定位整个目标动作间隔。受迁移学习方法的启发，我们还采用了一个附加的模块，将裁剪视频（HACS片段数据集）中的知识迁移到未裁剪视频（HACS片段数据集）中，以提高未裁剪视频的分类性能。最后，我们利用嵌入内外对比度损失的边界回归模块，在增强的CAS基础上自动预测边界。该方案在有监督和弱监督的时间动作定位轨迹挑战测试集上的平均映射分别达到39.91和29.78。摘要：This technical report presents an overview of our solution used in the submission to 2021 HACS Temporal Action Localization Challenge on both Supervised Learning Track and Weakly-Supervised Learning Track. Temporal Action Localization (TAL) requires to not only precisely locate the temporal boundaries of action instances, but also accurately classify the untrimmed videos into specific categories. However, Weakly-Supervised TAL indicates locating the action instances using only video-level class labels. In this paper, to train a supervised temporal action localizer, we adopt Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through ``local and global" temporal context aggregation and complementary as well as progressive boundary refinement. As for the WSTAL, a novel framework is proposed to handle the poor quality of CAS generated by simple classification network, which can only focus on local discriminative parts, rather than locate the entire interval of target actions. Further inspired by the transfer learning method, we also adopt an additional module to transfer the knowledge from trimmed videos (HACS Clips dataset) to untrimmed videos (HACS Segments dataset), aiming at promoting the classification performance on untrimmed videos. Finally, we employ a boundary regression module embedded with Outer-Inner-Contrastive (OIC) loss to automatically predict the boundaries based on the enhanced CAS. Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.

【3】 Disentangled Implicit Shape and Pose Learning for Scalable 6D Pose Estimation 标题：解缠隐式形位学习在可伸缩六维位姿估计中的应用

作者：Yilin Wen,Xiangyu Li,Hao Pan,Lei Yang,Zheng Wang,Taku Komura,Wenping Wang 机构：The University of Hong Kong ,Brown University ,Microsoft Research Asia ,SUSTech 链接：https://arxiv.org/abs/2107.12549 摘要：近年来，利用深度学习方法对单个RGB图像中的刚性物体进行6D姿态估计有了很大的改进，但是大多数方法都是在每个物体的层次上建立模型，不能同时缩放到多个物体。本文提出了一种新的可伸缩的6D姿态估计方法，该方法利用单个自动编码器对多个目标的合成数据进行自监督学习。为了处理多个物体并推广到不可见的物体，我们将潜在物体的形状和姿态表示分离，使潜在形状空间模型形状相似，并通过与规范旋转的比较，使用潜在姿态编码进行旋转检索。为了鼓励形状空间的构建，我们采用对比度量学习，并通过参考相似的训练对象来实现对看不见对象的处理。不同的对称性导致了不一致的潜在姿势空间，我们通过重新缠绕形状和姿势表示，用产生形状相关的姿势码本的条件块来捕获这些空间。我们在两个实际数据的多目标基准上测试了我们的方法，T-LESS和nocsreal275，结果表明它在姿态估计精度和泛化性方面优于现有的基于RGB的方法。摘要：6D pose estimation of rigid objects from a single RGB image has seen tremendous improvements recently by using deep learning to combat complex real-world variations, but a majority of methods build models on the per-object level, failing to scale to multiple objects simultaneously. In this paper, we present a novel approach for scalable 6D pose estimation, by self-supervised learning on synthetic data of multiple objects using a single autoencoder. To handle multiple objects and generalize to unseen objects, we disentangle the latent object shape and pose representations, so that the latent shape space models shape similarities, and the latent pose code is used for rotation retrieval by comparison with canonical rotations. To encourage shape space construction, we apply contrastive metric learning and enable the processing of unseen objects by referring to similar training objects. The different symmetries across objects induce inconsistent latent pose spaces, which we capture with a conditioned block producing shape-dependent pose codebooks by re-entangling shape and pose representations. We test our method on two multi-object benchmarks with real data, T-LESS and NOCS REAL275, and show it outperforms existing RGB-based methods in terms of pose estimation accuracy and generalization.

医学相关(1篇)

【1】 Technical Report: Quality Assessment Tool for Machine Learning with Clinical CT 标题：技术报告：结合临床CT的机器学习质量评估工具

作者：Riqiang Gao,Mirza S. Khan,Yucheng Tang,Kaiwen Xu,Steve Deppen,Yuankai Huo,Kim L. Sandler,Pierre P. Massion,Bennett A. Landman 机构：Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA , Vanderbilt University Medical Center, Nashville, TN, USA , Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 备注：18 pages, 13 figures, technical report 链接：https://arxiv.org/abs/2107.12842 摘要：图像质量评价（IQA）是科学研究的重要内容，特别是在医学成像和机器学习中。当基于人工的工作流使用有限的数据视图时，潜在的数据质量问题可能会加剧，这些视图可能会模糊数字工件。实际上，网络问题、加速采集、运动伪影和成像协议设计等多种因素都会阻碍图像采集的解释。医学图像处理界已经开发了各种各样的工具来检查和验证图像数据。然而，计算机断层扫描（CT）的像质分析仍然是一个未被认识到的挑战，并且没有用户友好的工具来解决这些潜在的问题。在这里，我们创建并说明了一个专门设计的管道，用于识别和解决临床获得的CT数据的大规模数据挖掘遇到的问题。利用广泛研究的国家肺筛查试验（NLST），我们在17392次扫描中发现了大约4%的图像质量问题。为了评估稳健性，我们将建议的管道应用于我们的内部数据集，我们发现我们的工具可以推广到临床获得的医学图像。总之，该工具对于临床数据的研究非常有用，并且节省了时间，代码和教程在https://github.com/MASILab/QA_tool. 摘要：Image Quality Assessment (IQA) is important for scientific inquiry, especially in medical imaging and machine learning. Potential data quality issues can be exacerbated when human-based workflows use limited views of the data that may obscure digital artifacts. In practice, multiple factors such as network issues, accelerated acquisitions, motion artifacts, and imaging protocol design can impede the interpretation of image collections. The medical image processing community has developed a wide variety of tools for the inspection and validation of imaging data. Yet, IQA of computed tomography (CT) remains an under-recognized challenge, and no user-friendly tool is commonly available to address these potential issues. Here, we create and illustrate a pipeline specifically designed to identify and resolve issues encountered with large-scale data mining of clinically acquired CT data. Using the widely studied National Lung Screening Trial (NLST), we have identified approximately 4% of image volumes with quality concerns out of 17,392 scans. To assess robustness, we applied the proposed pipeline to our internal datasets where we find our tool is generalizable to clinically acquired medical images. In conclusion, the tool has been useful and time-saving for research study of clinical data, and the code and tutorials are publicly available at https://github.com/MASILab/QA_tool.

GAN|对抗|攻击|生成相关(2篇)

【1】 Image Scene Graph Generation (SGG) Benchmark 标题：图像场景图生成(SGG)基准测试

作者：Xiaotian Han,Jianwei Yang,Houdong Hu,Lei Zhang,Jianfeng Gao,Pengchuan Zhang 机构：Microsoft Cloud + AI, Microsoft Research at Redmond 链接：https://arxiv.org/abs/2107.12604 摘要：由于需要建立超越目标检测的细粒度图像理解模型，图像场景图的生成（对象、属性和关系检测）引起了人们的极大兴趣。由于缺乏一个良好的基准，不同场景图生成模型的研究结果没有直接的可比性，阻碍了研究的进展。基于maskrcnn基准和几种流行的模型，我们开发了一个非常需要的场景图生成基准。本文介绍了我们的基准测试的主要特点，并利用视觉基因组和OpenImages视觉关系检测数据集对场景图生成模型进行了综合研究。我们的代码库在https://github.com/microsoft/scene_graph_benchmark. 摘要：There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection. Due to the lack of a good benchmark, the reported results of different scene graph generation models are not directly comparable, impeding the research progress. We have developed a much-needed scene graph generation benchmark based on the maskrcnn-benchmark and several popular models. This paper presents main features of our benchmark and a comprehensive ablation study of scene graph generation models using the Visual Genome and OpenImages Visual relationship detection datasets. Our codebase is made publicly available at https://github.com/microsoft/scene_graph_benchmark.

【2】 Adversarial Attacks with Time-Scale Representations 标题：具有时间尺度表示的对抗性攻击

作者：Alberto Santamaria-Pang,Jianwei Qiu,Aritra Chowdhury,James Kubricht,Peter Tu,Iyer Naresh,Nurali Virani 机构：GE Research, Research Circle, Niskayuna, NY 链接：https://arxiv.org/abs/2107.12473 摘要：我们提出了一种新的实时黑盒通用攻击框架，该框架可以中断深度学习模型中早期卷积层的激活。我们的假设是，在小波空间产生的扰动比在时域产生的扰动更有效地破坏了早期卷积层。对抗性攻击的主要挑战是在最小限度地改变最有意义的高频内容的同时保留低频图像内容。为了解决这个问题，我们提出了一个优化问题，使用时间尺度（小波）表示作为对偶空间分三步进行。首先，利用小波系数将原始图像投影到高、低尺度的正交子空间。第二，利用生成网络对高尺度投影的小波系数进行扰动。第三，将低尺度的原始系数和高尺度子空间的扰动系数投影回来，生成新的对抗图像。我们提供了一个理论框架，保证从时间和时间尺度域表示对偶映射。我们将我们的结果与基于生成模型和基于梯度模型的最新黑盒攻击进行了比较。我们还验证了多种防御方法的有效性，如JPEG压缩，引导去噪和comdefende。我们的结果表明，基于小波的扰动始终优于基于时间的攻击，从而为深入学习模型的脆弱性提供了新的见解，并可能通过利用时间尺度表示产生健壮的体系结构或新的防御和攻击机制。摘要：We propose a novel framework for real-time black-box universal attacks which disrupts activations of early convolutional layers in deep learning models. Our hypothesis is that perturbations produced in the wavelet space disrupt early convolutional layers more effectively than perturbations performed in the time domain. The main challenge in adversarial attacks is to preserve low frequency image content while minimally changing the most meaningful high frequency content. To address this, we formulate an optimization problem using time-scale (wavelet) representations as a dual space in three steps. First, we project original images into orthonormal sub-spaces for low and high scales via wavelet coefficients. Second, we perturb wavelet coefficients for high scale projection using a generator network. Third, we generate new adversarial images by projecting back the original coefficients from the low scale and the perturbed coefficients from the high scale sub-space. We provide a theoretical framework that guarantees a dual mapping from time and time-scale domain representations. We compare our results with state-of-the-art black-box attacks from generative-based and gradient-based models. We also verify efficacy against multiple defense methods such as JPEG compression, Guided Denoiser and Comdefend. Our results show that wavelet-based perturbations consistently outperform time-based attacks thus providing new insights into vulnerabilities of deep learning models and could potentially lead to robust architectures or new defense and attack mechanisms by leveraging time-scale representations.

自动驾驶|车辆|车道检测等(2篇)

【1】 Predicting Take-over Time for Autonomous Driving with Real-World Data: Robust Data Augmentation, Models, and Evaluation 标题：用真实数据预测自动驾驶的接管时间：稳健的数据增强、模型和评估

作者：Akshay Rangesh,Nachiket Deo,Ross Greer,Pujitha Gunaratne,Mohan M. Trivedi 机构： Laboratory for Intelligent & Safe Automobiles, UC San Diego, Toyota Collaborative Safety Research Center 备注：Journal version of arXiv:2104.11489 链接：https://arxiv.org/abs/2107.12932 摘要：通过建立控制转换模型来理解乘员与车辆之间的相互作用对于确保乘用车自动化的安全方法非常重要。包含驾驶员状态的上下文、语义上有意义的表示的模型可用于确定驾驶员和车辆之间控制转移的适当时机和条件。然而，这样的模型依赖于真实世界的控制，从从事分散注意力的活动的司机那里接收数据，这是很昂贵的收集。在这里，我们介绍了这样一个数据集的数据扩充方案。利用扩展数据集，我们开发和训练了接管时间（TOT）模型，这些模型在计算机视觉算法产生的中高级特征上依次运行，这些特征在不同的面向驾驶员的摄像机视图上运行，显示了在扩展数据集上训练的模型优于初始数据集。演示的模型特征对驾驶员状态的不同方面进行编码，涉及驾驶员的面部、手、脚和上半身。我们在特征组合和模型结构上进行了烧蚀实验，结果表明，由增广数据支持的TOT模型可以毫不延迟地产生接管时间的连续估计，适用于复杂的现实场景。摘要：Understanding occupant-vehicle interactions by modeling control transitions is important to ensure safe approaches to passenger vehicle automation. Models which contain contextual, semantically meaningful representations of driver states can be used to determine the appropriate timing and conditions for transfer of control between driver and vehicle. However, such models rely on real-world control take-over data from drivers engaged in distracting activities, which is costly to collect. Here, we introduce a scheme for data augmentation for such a dataset. Using the augmented dataset, we develop and train take-over time (TOT) models that operate sequentially on mid and high-level features produced by computer vision algorithms operating on different driver-facing camera views, showing models trained on the augmented dataset to outperform the initial dataset. The demonstrated model features encode different aspects of the driver state, pertaining to the face, hands, foot and upper body of the driver. We perform ablative experiments on feature combinations as well as model architectures, showing that a TOT model supported by augmented data can be used to produce continuous estimates of take-over times without delay, suitable for complex real-world scenarios.

【2】 Analyzing vehicle pedestrian interactions combining data cube structure and predictive collision risk estimation model 标题：结合数据立方体结构和碰撞风险预测模型的车辆行人交互分析

作者：Byeongjoon Noh,Hansaem Park,Hwasoo Yeo 机构：: Applied Science Research Institute, Korea Advanced Institute of Science and Technology, Daehak-, : Department of Civil and Environmental Engineering, Korea Advanced Institute of Science and, Technology, Daehak-ro, Yuseung-gu, Daejeon, Republic of Korea 备注：33 pages, 19 figures 链接：https://arxiv.org/abs/2107.12507 摘要：交通事故是对人类生命的威胁，尤其是行人过早死亡。因此，有必要设计系统，以预防事故的发生和积极应对，利用潜在的危险情况作为替代安全措施之一。本研究引入一个新的概念，行人安全系统相结合的领域和集中进程。该系统可以在现场立即对即将发生的风险发出警告，并通过评估道路的安全水平来提高风险频繁区域的安全性，而不会发生实际碰撞。特别是，本研究通过引入一个新的分析框架，对具有车辆/行人行为和环境特征的人行横道安全评估进行了重点研究。我们从城市的实际交通视频中获得这些行为特征，并进行了完全的自动处理。该框架主要通过构造一个数据立方体结构，将基于LSTM的预测碰撞风险估计模型与在线分析处理操作相结合，从多维角度分析这些行为。根据PCR估计模型，我们将风险的严重程度分为四个等级，并应用所提出的框架来评估具有行为特征的人行横道安全性。我们的分析实验是基于两个场景，不同的描述结果是收集车辆和行人在道路环境中的运动模式以及风险水平和车速之间的关系。因此，该框架可以为决策者提供有价值的信息，以提高行人在未来交通事故中的安全性，并有助于我们更好地了解行人在人行横道附近的行为。为了验证该框架的可行性和适用性，我们将其应用于韩国大山市实际运营的中央电视台。摘要：Traffic accidents are a threat to human lives, particularly pedestrians causing premature deaths. Therefore, it is necessary to devise systems to prevent accidents in advance and respond proactively, using potential risky situations as one of the surrogate safety measurements. This study introduces a new concept of a pedestrian safety system that combines the field and the centralized processes. The system can warn of upcoming risks immediately in the field and improve the safety of risk frequent areas by assessing the safety levels of roads without actual collisions. In particular, this study focuses on the latter by introducing a new analytical framework for a crosswalk safety assessment with behaviors of vehicle/pedestrian and environmental features. We obtain these behavioral features from actual traffic video footage in the city with complete automatic processing. The proposed framework mainly analyzes these behaviors in multidimensional perspectives by constructing a data cube structure, which combines the LSTM based predictive collision risk estimation model and the on line analytical processing operations. From the PCR estimation model, we categorize the severity of risks as four levels and apply the proposed framework to assess the crosswalk safety with behavioral features. Our analytic experiments are based on two scenarios, and the various descriptive results are harvested the movement patterns of vehicles and pedestrians by road environment and the relationships between risk levels and car speeds. Thus, the proposed framework can support decision makers by providing valuable information to improve pedestrian safety for future accidents, and it can help us better understand their behaviors near crosswalks proactively. In order to confirm the feasibility and applicability of the proposed framework, we implement and apply it to actual operating CCTVs in Osan City, Korea.

人脸|人群计数(3篇)

【1】 Learning Local Recurrent Models for Human Mesh Recovery 标题：用于人体网格恢复的学习局部递归模型

作者：Runze Li,Srikrishna Karanam,Ren Li,Terrence Chen,Bir Bhanu,Ziyan Wu 机构：United Imaging Intelligence, Cambridge MA, USA, University of California Riverside, Riverside CA, USA 备注：10 pages, 6 figures, 2 tables 链接：https://arxiv.org/abs/2107.12847 摘要：我们考虑的问题估计帧级全人体网格给一个视频的人自然运动动力学。虽然这一领域在基于单个图像的网格估计方面取得了很大进展，但由于其在缓解深度模糊和遮挡等问题方面的作用，从视频中推断网格动力学的努力最近有所上升。然而，现有工作的一个关键限制是假设所有观测到的运动动力学可以使用一个动力学/循环模型来建模。虽然这在相对简单的动态情况下可能会很好地工作，但在野外视频中进行推理会带来许多挑战。特别地，典型的情况是，人的不同身体部位在视频中经历不同的动力学，例如，腿可以以与手（例如，跳舞的人）动力学不同的方式移动。为了解决这些问题，我们提出了一种新的视频网格恢复方法，该方法根据标准的骨架模型将人体网格划分为多个局部区域。然后，我们用单独的递归模型对每个局部部分的动力学进行建模，每个模型根据已知的人体运动学结构进行适当的调节。这就产生了一个基于结构的局部递归学习体系结构，它可以通过可用的注释以端到端的方式进行训练。我们在Human3.6M、MPI-INF-3DHP和3DPW等标准视频网格恢复基准数据集上进行了各种实验，证明了我们的局部动态建模设计的有效性，并建立了基于标准评估指标的最新结果。摘要：We consider the problem of estimating frame-level full human body meshes given a video of a person with natural motion dynamics. While much progress in this field has been in single image-based mesh estimation, there has been a recent uptick in efforts to infer mesh dynamics from video given its role in alleviating issues such as depth ambiguity and occlusions. However, a key limitation of existing work is the assumption that all the observed motion dynamics can be modeled using one dynamical/recurrent model. While this may work well in cases with relatively simplistic dynamics, inference with in-the-wild videos presents many challenges. In particular, it is typically the case that different body parts of a person undergo different dynamics in the video, e.g., legs may move in a way that may be dynamically different from hands (e.g., a person dancing). To address these issues, we present a new method for video mesh recovery that divides the human mesh into several local parts following the standard skeletal model. We then model the dynamics of each local part with separate recurrent models, with each model conditioned appropriately based on the known kinematic structure of the human body. This results in a structure-informed local recurrent learning architecture that can be trained in an end-to-end fashion with available annotations. We conduct a variety of experiments on standard video mesh recovery benchmark datasets such as Human3.6M, MPI-INF-3DHP, and 3DPW, demonstrating the efficacy of our design of modeling local dynamics as well as establishing state-of-the-art results based on standard evaluation metrics.

【2】 Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework 标题：重新思考人群中的计数和定位：一个纯粹基于点的框架

作者：Qingyu Song,Changan Wang,Zhengkai Jiang,Yabiao Wang,Ying Tai,Chengjie Wang,Jilin Li,Feiyue Huang,Yang Wu 机构：Tencent Youtu Lab,Applied Research Center (ARC), Tencent PCG 备注：To be appear in ICCV2021 (Oral) 链接：https://arxiv.org/abs/2107.12746 摘要：在人群中定位个体比简单地计数更符合后续高层次人群分析任务的实际需求。然而，现有的基于定位的方法依赖于中间表示（\textit{i.e.}，密度图或伪框）作为学习目标，这是违反直觉和容易出错的。在本文中，我们提出了一个纯基于点的联合人群计数和个体定位框架。在这个框架中，我们提出了一种新的度量标准，称为密度归一化平均精度（nAP），以提供更全面、更精确的性能评估，而不是仅仅报告图像级的绝对计数误差。在此框架下，我们设计了一个直观的解决方案，称为点对点网络（P2PNet）。P2PNet摒弃了多余的步骤，直接预测一组点建议来表示图像中的头部，与人类注释结果一致。通过深入的分析，我们发现实现这种新思想的关键步骤是为这些建议分配最佳的学习目标。因此，我们建议使用匈牙利算法以一对一的匹配方式来执行这一关键关联。P2PNet不仅在流行的计数基准上显著地超过了最新的方法，而且获得了很好的定位精度。代码将在以下位置提供：https://github.com/TencentYoutuResearch/CrowdCounting-P2PNet. 摘要：Localizing individuals in crowds is more in accordance with the practical demands of subsequent high-level crowd analysis tasks than simply counting. However, existing localization based methods relying on intermediate representations (\textit{i.e.}, density maps or pseudo boxes) serving as learning targets are counter-intuitive and error-prone. In this paper, we propose a purely point-based framework for joint crowd counting and individual localization. For this framework, instead of merely reporting the absolute counting error at image level, we propose a new metric, called density Normalized Average Precision (nAP), to provide more comprehensive and more precise performance evaluation. Moreover, we design an intuitive solution under this framework, which is called Point to Point Network (P2PNet). P2PNet discards superfluous steps and directly predicts a set of point proposals to represent heads in an image, being consistent with the human annotation results. By thorough analysis, we reveal the key step towards implementing such a novel idea is to assign optimal learning targets for these proposals. Therefore, we propose to conduct this crucial association in an one-to-one matching manner using the Hungarian algorithm. The P2PNet not only significantly surpasses state-of-the-art methods on popular counting benchmarks, but also achieves promising localization accuracy. The codes will be available at: https://github.com/TencentYoutuResearch/CrowdCounting-P2PNet.

【3】 Uniformity in Heterogeneity:Diving Deep into Count Interval Partition for Crowd Counting 标题：异质性中的一致性：深入计数区间划分进行人群计数

作者：Changan Wang,Qingyu Song,Boshen Zhang,Yabiao Wang,Ying Tai,Xuyi Hu,Chengjie Wang,Jilin Li,Jiayi Ma,Yang Wu 机构：Tencent Youtu Lab,Applied Research Center (ARC), Tencent PCG, Department of Electronic & Electrical Engineering, University College London, United Kingdom, Electronic Information School, Wuhan University, Wuhan, China 备注：To be appear in ICCV2021 链接：https://arxiv.org/abs/2107.12619 摘要：近年来，群体计数中学习目标不准确的问题越来越引起人们的关注。在一些开创性工作的启发下，我们通过尝试预测计数的预定义区间箱的指数而不是计数值本身来解决这个问题。然而，不恰当的区间设置可能会使不同区间的计数误差贡献极不平衡，导致较差的计数性能。因此，我们提出了一种新的计数区间划分准则，称为均匀误差划分（UEP），它始终保持所有区间的期望计数误差贡献相等，以最小化预测风险。然后，为了减少在计数量化过程中不可避免地引入的离散化错误，我们提出了另一种称为平均计数代理（MCP）的准则。MCP准则为每个区间选择一个最佳计数代理来表示其计数值，使得图像的整体期望离散化误差几乎可以忽略不计。据我们所知，这项工作是第一次深入研究这样的分类任务，并最终得到了一个很有前途的解决方案计数区间划分。根据上述两个理论证明的准则，我们提出了一个简单而有效的统一错误划分网络（UEPNet）模型，该模型在多个具有挑战性的数据集上取得了最先进的性能。代码将在以下位置提供：https://github.com/TencentYoutuResearch/CrowdCounting-UEPNet. 摘要：Recently, the problem of inaccurate learning targets in crowd counting draws increasing attention. Inspired by a few pioneering work, we solve this problem by trying to predict the indices of pre-defined interval bins of counts instead of the count values themselves. However, an inappropriate interval setting might make the count error contributions from different intervals extremely imbalanced, leading to inferior counting performance. Therefore, we propose a novel count interval partition criterion called Uniform Error Partition (UEP), which always keeps the expected counting error contributions equal for all intervals to minimize the prediction risk. Then to mitigate the inevitably introduced discretization errors in the count quantization process, we propose another criterion called Mean Count Proxies (MCP). The MCP criterion selects the best count proxy for each interval to represent its count value during inference, making the overall expected discretization error of an image nearly negligible. As far as we are aware, this work is the first to delve into such a classification task and ends up with a promising solution for count interval partition. Following the above two theoretically demonstrated criterions, we propose a simple yet effective model termed Uniform Error Partition Network (UEPNet), which achieves state-of-the-art performance on several challenging datasets. The codes will be available at: https://github.com/TencentYoutuResearch/CrowdCounting-UEPNet.

跟踪(1篇)

【1】 VIPose: Real-time Visual-Inertial 6D Object Pose Tracking 标题：VIPose：实时视觉惯性6D目标姿态跟踪

作者：Rundong Ge,Giuseppe Loianno 机构：The authors are with the New York University, Tandon School ofEngineering 备注：Accepted by The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021 链接：https://arxiv.org/abs/2107.12617 摘要：估计物体的6D姿态有利于机器人的任务，如运输、自主导航、操纵以及虚拟现实和增强现实等机器人以外的场景。对于单帧图像的姿态估计，姿态跟踪考虑了多帧图像的时间信息，克服了可能的检测不一致性，提高了姿态估计的效率。在这项工作中，我们介绍了一种新的深度神经网络（DNN）称为VIPose，它结合惯性和相机数据来解决实时的目标姿态跟踪问题。其关键贡献是设计了一种融合视觉和惯性特征的DNN结构来预测物体在连续图像帧间的相对6D姿态。然后通过连续组合相对姿势来估计整个6D姿势。我们的方法显示了显著的姿态估计结果严重闭塞的对象是众所周知的是非常具有挑战性的处理现有的最先进的解决方案。在一个名为VIYCB的新数据集上验证了该方法的有效性，该数据集包含RGB图像、IMU数据和通过自动标记技术创建的精确的6D姿势标注。该方法的精度性能可与最先进的技术相媲美，但具有实时性的额外优势。摘要：Estimating the 6D pose of objects is beneficial for robotics tasks such as transportation, autonomous navigation, manipulation as well as in scenarios beyond robotics like virtual and augmented reality. With respect to single image pose estimation, pose tracking takes into account the temporal information across multiple frames to overcome possible detection inconsistencies and to improve the pose estimation efficiency. In this work, we introduce a novel Deep Neural Network (DNN) called VIPose, that combines inertial and camera data to address the object pose tracking problem in real-time. The key contribution is the design of a novel DNN architecture which fuses visual and inertial features to predict the objects' relative 6D pose between consecutive image frames. The overall 6D pose is then estimated by consecutively combining relative poses. Our approach shows remarkable pose estimation results for heavily occluded objects that are well known to be very challenging to handle by existing state-of-the-art solutions. The effectiveness of the proposed approach is validated on a new dataset called VIYCB with RGB image, IMU data, and accurate 6D pose annotations created by employing an automated labeling technique. The approach presents accuracy performances comparable to state-of-the-art techniques, but with additional benefit to be real-time.

裁剪|量化|加速|压缩相关(1篇)

【1】 COPS: Controlled Pruning Before Training Starts 标题：警察：训练开始前有控制的修剪

作者：Paul Wimmer,Jens Mehnert,Alexandru Condurache 机构： Controlled Pruning Before Training StartsWimmer PaulImage ProcessingRobert Bosch GmbH & Lübeck University7 1 2 29 Leonberg, Condurache AlexandruEngineering Cognitive SystemsRobert Bosch GmbH & Lübeck University70 499 Stuttgart 备注：Accepted by The International Joint Conference on Neural Network (IJCNN) 2021 链接：https://arxiv.org/abs/2107.12673 摘要：最先进的深度神经网络（DNN）剪枝技术，在训练开始前应用一次，用一个标准——剪枝分数——来评估稀疏结构。基于单独分数的权值修剪对于某些体系结构和修剪率很有效，但对于其他体系结构和修剪率也可能失败。作为剪除分数的一个常用基线，我们引入了广义突触分数（GSS）的概念。在这项工作中，我们不集中在一个单一的剪枝标准，但提供了一个框架，结合任意gss创建更强大的剪枝策略。这些组合剪枝得分（COPS）是通过求解一个约束优化问题得到的。对多个分数进行优化可防止稀疏网络过度专门化单个任务，从而在训练开始前控制修剪。将COPS给出的组合优化问题松弛在线性规划（LP）上。这个LP是解析求解的，并确定了COPS的一个解。此外，还提出了一种数值计算两个分数的算法，并进行了评价。用这种方法求解cop比一般的LP求解器复杂度要低。在我们的实验中，我们将剪枝和COPS与最新的方法进行了比较，得到了改进的结果。摘要：State-of-the-art deep neural network (DNN) pruning techniques, applied one-shot before training starts, evaluate sparse architectures with the help of a single criterion -- called pruning score. Pruning weights based on a solitary score works well for some architectures and pruning rates but may also fail for other ones. As a common baseline for pruning scores, we introduce the notion of a generalized synaptic score (GSS). In this work we do not concentrate on a single pruning criterion, but provide a framework for combining arbitrary GSSs to create more powerful pruning strategies. These COmbined Pruning Scores (COPS) are obtained by solving a constrained optimization problem. Optimizing for more than one score prevents the sparse network to overly specialize on an individual task, thus COntrols Pruning before training Starts. The combinatorial optimization problem given by COPS is relaxed on a linear program (LP). This LP is solved analytically and determines a solution for COPS. Furthermore, an algorithm to compute it for two scores numerically is proposed and evaluated. Solving COPS in such a way has lower complexity than the best general LP solver. In our experiments we compared pruning with COPS against state-of-the-art methods for different network architectures and image classification tasks and obtained improved results.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Greedy Gradient Ensemble for Robust Visual Question Answering 标题：贪婪梯度集成在鲁棒视觉问答中的应用

作者：Xinzhe Han,Shuhui Wang,Chi Su,Qingming Huang,Qi Tian 机构：Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, Kingsoft Cloud, Beijing, China, Peng Cheng Laboratory, Shenzhen, China, Cloud BU, Huawei Technologies, Shenzhen, China. 备注：Accepted by ICCV 2021. Code: this https URL 链接：https://arxiv.org/abs/2107.12651 摘要：语言偏差是视觉问答（VQA）中的一个关键问题，在VQA中，模型往往利用数据集偏差进行最终决策，而不考虑图像信息。因此，他们遭受了性能下降的分布数据和不够直观的解释。在对现有的鲁棒VQA方法进行实验分析的基础上，着重分析了VQA中的语言偏差，即分布偏差和快捷偏差。我们进一步提出了一个新的去偏框架，即贪婪梯度集成（GGE），它结合了多个有偏模型进行无偏基模型学习。通过贪婪策略，GGE强制有偏模型优先拟合有偏数据分布，从而使基础模型更加关注有偏模型难以求解的实例。实验结果表明，该方法在不使用额外标注的情况下，较好地利用了视觉信息，在诊断数据集VQA-CP时达到了最先进的性能。摘要：Language bias is a critical issue in Visual Question Answering (VQA), where models often exploit dataset biases for the final decision without considering the image information. As a result, they suffer from performance drop on out-of-distribution data and inadequate visual explanation. Based on experimental analysis for existing robust VQA methods, we stress the language bias in VQA that comes from two aspects, i.e., distribution bias and shortcut bias. We further propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning. With the greedy strategy, GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models. The experiments demonstrate that our method makes better use of visual information and achieves state-of-the-art performance on diagnosing dataset VQA-CP without using extra annotations.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 BridgeNet: A Joint Learning Network of Depth Map Super-Resolution and Monocular Depth Estimation 标题：BridgeNet：深度图超分辨率与单目深度估计的联合学习网络

作者：Qi Tang,Runmin Cong,Ronghui Sheng,Lingzhi He,Dan Zhang,Yao Zhao,Sam Kwong 机构：Institute of Information Science, Beijing Jiaotong University, Beijing, China, Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China, UISEE Technology (Beijing) Co., Ltd. Beijing, China 备注：10 pages, 7 figures, Accepted by ACM MM 2021 链接：https://arxiv.org/abs/2107.12541 摘要：深度图超分辨率是一项实际应用要求很高的任务。现有的彩色引导深度图超分辨率重建方法通常需要一个额外的分支从RGB图像中提取高频细节信息来指导低分辨率深度图的重建。然而，由于这两种模式之间仍存在一些差异，在特征维或边缘映射维上直接进行信息传输无法获得令人满意的结果，甚至可能在RGB-D对结构不一致的区域触发纹理复制。受多任务学习的启发，本文提出了一种深度图超分辨率（DSR）和单目深度估计（MDE）的联合学习网络。对于两个子网的相互作用，我们采用了差分引导策略，并相应地设计了两个网桥。一种是为特征编码过程设计的高频注意桥（HABdg），它学习MDE任务的高频信息来指导DSR任务。另一种是为深度图重建过程设计的内容引导桥（CGBdg），它为MDE任务提供从DSR任务学习到的内容引导。整个网络体系结构具有很高的可移植性，可以提供一个将DSR和MDE任务关联起来的范例。在基准数据集上的大量实验表明，该方法具有很好的性能。我们的代码和型号可在https://rmcong.github.io/proj_BridgeNet.html. 摘要：Depth map super-resolution is a task with high practical application requirements in the industry. Existing color-guided depth map super-resolution methods usually necessitate an extra branch to extract high-frequency detail information from RGB image to guide the low-resolution depth map reconstruction. However, because there are still some differences between the two modalities, direct information transmission in the feature dimension or edge map dimension cannot achieve satisfactory result, and may even trigger texture copying in areas where the structures of the RGB-D pair are inconsistent. Inspired by the multi-task learning, we propose a joint learning network of depth map super-resolution (DSR) and monocular depth estimation (MDE) without introducing additional supervision labels. For the interaction of two subnetworks, we adopt a differentiated guidance strategy and design two bridges correspondingly. One is the high-frequency attention bridge (HABdg) designed for the feature encoding process, which learns the high-frequency information of the MDE task to guide the DSR task. The other is the content guidance bridge (CGBdg) designed for the depth map reconstruction process, which provides the content guidance learned from DSR task for MDE task. The entire network architecture is highly portable and can provide a paradigm for associating the DSR and MDE tasks. Extensive experiments on benchmark datasets demonstrate that our method achieves competitive performance. Our code and models are available at https://rmcong.github.io/proj_BridgeNet.html.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 CKConv: Learning Feature Voxelization for Point Cloud Analysis 标题：CKConv：点云分析的学习特征体素化

作者：Sungmin Woo,Dogyoon Lee,Junhyeop Lee,Sangwon Hwang,Woojin Kim,Sangyoun Lee 机构：Yonsei University 链接：https://arxiv.org/abs/2107.12655 摘要：尽管深度学习取得了显著的成功，但由于数据结构的不规则性，点云上的最优卷积运算仍然是不确定的。在本文中，我们提出了三次核卷积（CKConv）学习体素化特征的局部点，利用连续和离散卷积。我们的连续卷积独特地采用了一种三维立方体形式的核权表示，在嵌入空间将特征分割成体素。通过以空间的方式连续地对体素化特征应用离散的3D卷积，迫使之前的连续卷积学习空间特征映射，即特征体素化。通过这种方式，几何信息可以通过细分特征编码来细化，并且由于嵌入空间中的体素化，我们对这些固定结构数据的三维卷积不会受到离散化伪影的影响。此外，我们提出了一个空间注意模块，局部集注意（LSA），在局部点集中提供全面的结构感知，从而产生具有代表性的特征。通过学习LSA的特征体素化，CKConv可以提取丰富的特征进行有效的点云分析。结果表明，CKConv在点云处理任务中具有很强的适用性，包括目标分类、目标部分分割和场景语义分割。摘要：Despite the remarkable success of deep learning, optimal convolution operation on point cloud remains indefinite due to its irregular data structure. In this paper, we present Cubic Kernel Convolution (CKConv) that learns to voxelize the features of local points by exploiting both continuous and discrete convolutions. Our continuous convolution uniquely employs a 3D cubic form of kernel weight representation that splits a feature into voxels in embedding space. By consecutively applying discrete 3D convolutions on the voxelized features in a spatial manner, preceding continuous convolution is forced to learn spatial feature mapping, i.e., feature voxelization. In this way, geometric information can be detailed by encoding with subdivided features, and our 3D convolutions on these fixed structured data do not suffer from discretization artifacts thanks to voxelization in embedding space. Furthermore, we propose a spatial attention module, Local Set Attention (LSA), to provide comprehensive structure awareness within the local point set and hence produce representative features. By learning feature voxelization with LSA, CKConv can extract enriched features for effective point cloud analysis. We show that CKConv has great applicability to point cloud processing tasks including object classification, object part segmentation, and scene semantic segmentation with state-of-the-art results.

多模态(2篇)

【1】 Angel's Girl for Blind Painters: an Efficient Painting Navigation System Validated by Multimodal Evaluation Approach 标题：盲人画家的天使女孩：一种多模态评价验证的高效绘画导航系统

作者：Hang Liu,Menghan Hu,Yuzhen Chen,Qingli Li,Guangtao Zhai,Simon X. Yang,Xiao-Ping Zhang,Xiaokang Yang 备注：13 pages, 18 figures 链接：https://arxiv.org/abs/2107.12921 摘要：对于热爱绘画但不幸有视觉障碍的人来说，拿着画笔创作作品是一项非常困难的任务。这一特殊群体中的人们渴望拿起画笔，像达芬奇一样，创造并充分利用自己的才华。因此，为了最大限度地弥合这一鸿沟，我们提出了一种绘画导航系统，帮助盲人进行绘画和艺术创作。该系统由认知系统和引导系统组成。系统采用基于二维码的画板定位、基于目标检测的刷子导航和实时定位。同时，本文采用了基于语音的人机交互和一种简单高效的位置信息编码规则。此外，我们还设计了一个有效判断笔刷是否到达目标的标准。根据实验结果，从被测者面部提取的热曲线表明，被蒙眼甚至盲眼的被测者都能较好地接受。在1s的提示频率下，喷漆导航系统的完成度为89%，标准差为8.37%，溢出度为347%，标准差为162.14%。刷尖轨迹的优良类型占74%，相对运动距离为4.21，SD为2.51。这项工作表明盲人通过手中的画笔感受世界是可行的。未来，我们计划在手机上部署Angle的眼睛，使其更便携。建议的喷漆导航系统的演示视频可从以下网址获得：https://doi.org/10.6084/m9.figshare.9760004.v1. 摘要：For people who ardently love painting but unfortunately have visual impairments, holding a paintbrush to create a work is a very difficult task. People in this special group are eager to pick up the paintbrush, like Leonardo da Vinci, to create and make full use of their own talents. Therefore, to maximally bridge this gap, we propose a painting navigation system to assist blind people in painting and artistic creation. The proposed system is composed of cognitive system and guidance system. The system adopts drawing board positioning based on QR code, brush navigation based on target detection and bush real-time positioning. Meanwhile, this paper uses human-computer interaction on the basis of voice and a simple but efficient position information coding rule. In addition, we design a criterion to efficiently judge whether the brush reaches the target or not. According to the experimental results, the thermal curves extracted from the faces of testers show that it is relatively well accepted by blindfolded and even blind testers. With the prompt frequency of 1s, the painting navigation system performs best with the completion degree of 89% with SD of 8.37% and overflow degree of 347% with SD of 162.14%. Meanwhile, the excellent and good types of brush tip trajectory account for 74%, and the relative movement distance is 4.21 with SD of 2.51. This work demonstrates that it is practicable for the blind people to feel the world through the brush in their hands. In the future, we plan to deploy Angle's Eyes on the phone to make it more portable. The demo video of the proposed painting navigation system is available at: https://doi.org/10.6084/m9.figshare.9760004.v1.

【2】 Multi-modal estimation of the properties of containers and their content: survey and evaluation 标题：集装箱及其内容物性能的多模态估计：调查与评价

作者：Alessio Xompero,Santiago Donaher,Vladimir Iashin,Francesca Palermo,Gökhan Solak,Claudio Coppola,Reina Ishikawa,Yuichi Nagao,Ryo Hachiuma,Qi Liu,Fan Feng,Chuanlin Lan,Rosa H. M. Chan,Guilherme Christmann,Jyun-Ting Song,Gonuguntla Neeharika,Chinnakotla Krishna Teja Reddy,Dinesh Jain,Bakhtawar Ur Rehman,Andrea Cavallaro 机构： are with Keio University 备注：13 pages, 9 tables, 5 figures, submitted to IEEE Transactions on Multimedia 链接：https://arxiv.org/abs/2107.12719 摘要：声学和视觉传感可以支持非接触估计重量的容器和它的内容量时，容器是由一个人操纵。然而，透明胶片（容器和内容）以及材料、形状和尺寸的可变性使这个问题具有挑战性。在本文中，我们提出了一个开放的基准框架和深入的比较分析，最近的方法，估计一个集装箱的容量，以及类型，质量和数量的内容。这些方法使用学习和手工制作的特征，如mel频率倒谱系数、过零率、频谱图，使用不同类型的分类器用声学数据来估计内容的类型和数量，用视觉数据的几何方法来确定容器的容量。在一个新的分布式数据集上的结果表明，单独使用音频是一种很强的模式，对于内容类型和级别分类，方法的加权平均F1得分分别高达81%和97%。用视觉方法估计集装箱容量，用多模态、多阶段算法估计填充质量，加权平均容量和质量分数可达65%。摘要：Acoustic and visual sensing can support the contactless estimation of the weight of a container and the amount of its content when the container is manipulated by a person. However, transparencies (both of the container and of the content) and the variability of materials, shapes and sizes make this problem challenging. In this paper, we present an open benchmarking framework and an in-depth comparative analysis of recent methods that estimate the capacity of a container, as well as the type, mass, and amount of its content. These methods use learned and handcrafted features, such as mel-frequency cepstrum coefficients, zero-crossing rate, spectrograms, with different types of classifiers to estimate the type and amount of the content with acoustic data, and geometric approaches with visual data to determine the capacity of the container. Results on a newly distributed dataset show that audio alone is a strong modality and methods achieves a weighted average F1-score up to 81% and 97% for content type and level classification, respectively. Estimating the container capacity with vision-only approaches and filling mass with multi-modal, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores.

3D|3D重建等相关(1篇)

【1】 Language Grounding with 3D Objects 标题：3D对象的语言基础

作者：Jesse Thomason,Mohit Shridhar,Yonatan Bisk,Chris Paxton,Luke Zettlemoyer 机构：University of Southern California, University of Washington, Carnegie Mellon University, NVIDIA 备注：this https URL 链接：https://arxiv.org/abs/2107.12514 摘要：对机器人看似简单的自然语言请求通常没有明确规定，例如“你能给我拿无线鼠标吗？”当查看架子上的鼠标时，从某些角度或位置可能看不到按钮的数量或电线的存在。候选小鼠的平面图像可能无法提供“无线”所需的鉴别信息。世界和其中的物体不是平面的图像，而是复杂的三维形状。如果人类根据物体的任何基本属性（如颜色、形状或纹理）请求物体，机器人应该进行必要的探索以完成任务。特别是，虽然在明确理解颜色和类别等视觉属性方面做出了大量的努力和进展，但在理解形状和轮廓的语言方面取得的进展相对较少。在这项工作中，我们介绍了一种新的推理任务，目标都是视觉和非视觉语言的三维物体。我们的新基准，ShapeNet注解引用表达式（SNARE），需要一个模型来选择两个对象中的哪一个被自然语言描述引用。我们介绍了几种基于剪辑的模型来区分物体，并证明了尽管视觉和语言联合建模的最新进展有助于机器人的语言理解，但这些模型在理解物体的三维本质（在操纵中起关键作用的属性）方面仍然较弱。特别是，我们发现在语言基础模型中添加视图估计可以提高SNARE和在机器人平台上识别语言中引用的对象的准确性。摘要：Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" When viewing mice on the shelf, the number of buttons or presence of a wire may not be visible from certain angles or positions. Flat images of candidate mice may not provide the discriminative information needed for "wireless". The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark, ShapeNet Annotated with Referring Expressions (SNARE), requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these models are weaker at understanding the 3D nature of objects -- properties which play a key role in manipulation. In particular, we find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.

其他神经网络|深度学习|模型|建模(8篇)

【1】 StarEnhancer: Learning Real-Time and Style-Aware Image Enhancement 标题：StarEnhizer：学习实时和样式感知的图像增强

作者：Yuda Song,Hui Qian,Xin Du 机构：College of Computer Science and Technology, Zhejiang University, Hangzhou, China, College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China 链接：https://arxiv.org/abs/2107.12898 摘要：图像增强是一个主观的过程，其目标随着用户的喜好而变化。在本文中，我们提出了一种基于深度学习的图像增强方法，它只使用一个称为StarEnhancer的模型就可以覆盖多个色调样式。它可以将图像从一种色调风格转换为另一种色调风格，即使这种风格是看不见的。通过简单的一次性设置，用户可以自定义模型，使增强图像更符合他们的审美。为了使该方法更加实用，我们提出了一种设计良好的增强器，该增强器可以处理超过200fps的4K分辨率图像，但在峰值信噪比（PSNR）、SSIM和LPIPS方面优于同时代的单一风格图像增强方法。最后，我们提出的增强方法具有良好的交互性，允许用户使用直观的选项对增强图像进行微调。摘要：Image enhancement is a subjective process whose targets vary with user preferences. In this paper, we propose a deep learning-based image enhancement method covering multiple tonal styles using only a single model dubbed StarEnhancer. It can transform an image from one tonal style to another, even if that style is unseen. With a simple one-time setting, users can customize the model to make the enhanced images more in line with their aesthetics. To make the method more practical, we propose a well-designed enhancer that can process a 4K-resolution image over 200 FPS but surpasses the contemporaneous single style image enhancement methods in terms of PSNR, SSIM, and LPIPS. Finally, our proposed enhancement method has good interactability, which allows the user to fine-tune the enhanced image using intuitive options.

【2】 RGL-NET: A Recurrent Graph Learning framework for Progressive Part Assembly 标题：RGL-Net：递归的零件装配图学习框架

作者：Abhinav Narayan Harish,Rajendra Nagar,Shanmuganathan Raman 机构：Indian Institute of Technology Gandhinagar, Indian Institute of Technology Jodhpur 备注：Accepted to Winter Conference of Computer Vision (WACV 2022) 链接：https://arxiv.org/abs/2107.12859 摘要：物体的自主装配是机器人技术和三维计算机视觉中的一项重要任务。它作为一个运动规划、执行器控制和避障问题，在机器人学中得到了广泛的研究。然而，开发一个对结构变体具有鲁棒性的通用装配框架的任务仍然相对未被探索。在这项工作中，我们使用一个考虑零件间关系和零件姿态渐进更新的递归图学习框架来解决这个问题。我们的网络可以学习更多合理的预测形状结构占优先组装的零件。与目前最先进的技术相比，我们的网络在PartNet数据集上的部分精度提高了10%，连接精度提高了15%。此外，我们得到的潜在空间有助于激动人心的应用，如从点云组件的形状恢复。我们进行了大量的实验来证明我们的设计选择，并证明了所提出的框架的有效性。摘要：Autonomous assembly of objects is an essential task in robotics and 3D computer vision. It has been studied extensively in robotics as a problem of motion planning, actuator control and obstacle avoidance. However, the task of developing a generalized framework for assembly robust to structural variants remains relatively unexplored. In this work, we tackle this problem using a recurrent graph learning framework considering inter-part relations and the progressive update of the part pose. Our network can learn more plausible predictions of shape structure by accounting for priorly assembled parts. Compared to the current state-of-the-art, our network yields up to 10% improvement in part accuracy and up to 15% improvement in connectivity accuracy on the PartNet dataset. Moreover, our resulting latent space facilitates exciting applications such as shape recovery from the point-cloud components. We conduct extensive experiments to justify our design choices and demonstrate the effectiveness of the proposed framework.

【3】 Deep Reinforcement Learning for L3 Slice Localization in Sarcopenia Assessment 标题：深度强化学习在肉质减少症L3切片定位中的应用

作者：Othmane Laousy,Guillaume Chassagnon,Edouard Oyallon,Nikos Paragios,Marie-Pierre Revel,Maria Vakalopoulou 链接：https://arxiv.org/abs/2107.12800 摘要：肌肉减少症是一种以肌肉质量和功能减少为特征的疾病。定量诊断技术包括定位穿过第三腰椎区域（L3）中部的CT切片，并在此水平分割肌肉。在本文中，我们提出了一种深度强化学习方法来精确定位L3 CT切片。我们的方法通过激励强化学习代理发现正确的位置来训练它。具体地说，一个深度Q网络被训练来寻找解决这个问题的最佳策略。可视化的训练过程显示，代理模拟有经验的放射科医生滚动。通过对其他基于深度学习的L3定位方法的大量实验，证明了该方法的优越性，即使在数据量和注释量有限的情况下也能取得良好的定位效果。摘要：Sarcopenia is a medical condition characterized by a reduction in muscle mass and function. A quantitative diagnosis technique consists of localizing the CT slice passing through the middle of the third lumbar area (L3) and segmenting muscles at this level. In this paper, we propose a deep reinforcement learning method for accurate localization of the L3 CT slice. Our method trains a reinforcement learning agent by incentivizing it to discover the right position. Specifically, a Deep Q-Network is trained to find the best policy to follow for this problem. Visualizing the training process shows that the agent mimics the scrolling of an experienced radiologist. Extensive experiments against other state-of-the-art deep learning based methods for L3 localization prove the superiority of our technique which performs well even with limited amount of data and annotations.

【4】 Continual Learning with Neuron Activation Importance 标题：持续学习与神经元激活的重要性

作者：Sohee Kim,Seungkyu Lee 机构：Kyunghee University, Department of Computer Engineering, Yongin, Republic of Korea 链接：https://arxiv.org/abs/2107.12657 摘要：持续学习是一个概念，在线学习与多个连续的任务。持续学习的一个关键障碍是，网络应该学习新任务，保持对旧任务的知识，而不访问旧任务的任何数据。本文提出了一种基于神经元激活重要性的正则化方法，用于不考虑任务顺序的稳定连续学习。我们在现有的基准数据集上进行了综合实验，不仅评估了该方法的稳定性和可塑性，提高了分类精度，而且评估了性能随任务顺序变化的鲁棒性。摘要：Continual learning is a concept of online learning with multiple sequential tasks. One of the critical barriers of continual learning is that a network should learn a new task keeping the knowledge of old tasks without access to any data of the old tasks. In this paper, we propose a neuron activation importance-based regularization method for stable continual learning regardless of the order of tasks. We conduct comprehensive experiments on existing benchmark data sets to evaluate not just the stability and plasticity of our method with improved classification accuracy also the robustness of the performance along the changes of task order.

【5】 Co-Transport for Class-Incremental Learning 标题：用于班级增量学习的协同传输

作者：Da-Wei Zhou,Han-Jia Ye,De-Chuan Zhan 机构：State Key Laboratory for Novel Software Technology, Nanjing University 备注：Accepted to ACM Multimedia 2021 链接：https://arxiv.org/abs/2107.12654 摘要：传统的学习系统是在封闭的世界中为固定数量的课程而训练的，需要预先收集数据集。然而，新的类经常出现在现实世界的应用程序中，并且应该逐步学习。例如，在电子商务中，每天都会出现新类型的产品，在社交媒体社区中，新的话题也会频繁出现。在这种情况下，增量模型应该一次学习几个新类而不会忘记。我们发现在增量学习中，新老课程之间有很强的相关性，可以用来相互联系和促进不同的学习阶段。因此，我们提出了类增量学习的共迁移（COIL），它通过类语义关系来学习跨增量任务的关联。具体来说，共迁移有两个方面：前瞻性迁移试图用最优的迁移知识扩充旧分类器，作为快速的模型自适应。回溯迁移的目的是将新的类分类器作为旧的类分类器向后迁移，以克服遗忘。有了这些运输，线圈有效地适应新的任务，并稳定地抵抗遗忘。在基准测试和真实多媒体数据集上的实验验证了该方法的有效性。摘要：Traditional learning systems are trained in closed-world for a fixed number of classes, and need pre-collected datasets in advance. However, new classes often emerge in real-world applications and should be learned incrementally. For example, in electronic commerce, new types of products appear daily, and in a social media community, new topics emerge frequently. Under such circumstances, incremental models should learn several new classes at a time without forgetting. We find a strong correlation between old and new classes in incremental learning, which can be applied to relate and facilitate different learning stages mutually. As a result, we propose CO-transport for class Incremental Learning (COIL), which learns to relate across incremental tasks with the class-wise semantic relationship. In detail, co-transport has two aspects: prospective transport tries to augment the old classifier with optimal transported knowledge as fast model adaptation. Retrospective transport aims to transport new class classifiers backward as old ones to overcome forgetting. With these transports, COIL efficiently adapts to new tasks, and stably resists forgetting. Experiments on benchmark and real-world multimedia datasets validate the effectiveness of our proposed method.

【6】 Identify Apple Leaf Diseases Using Deep Learning Algorithm 标题：基于深度学习算法的苹果叶部病害识别

作者：Daping Zhang,Hongyu Yang,Jiayu Cao 链接：https://arxiv.org/abs/2107.12598 摘要：农业是一个国家社会经济的重要产业。然而，病虫害给农业生产造成了巨大的减产，而对农民避灾的指导却不够。为了解决这个问题，我们通过建立分类模型，将CNNs应用于植物病害识别。在3642幅苹果叶片图像的数据集中，为了节省训练时间，我们采用了一种基于卷积神经网络（CNN）和Fastai框架的预训练图像分类模型Restnet34。总体分类准确率为93.765%。摘要：Agriculture is an essential industry in the both society and economy of a country. However, the pests and diseases cause a great amount of reduction in agricultural production while there is not sufficient guidance for farmers to avoid this disaster. To address this problem, we apply CNNs to plant disease recognition by building a classification model. Within the dataset of 3,642 images of apple leaves, We use a pre-trained image classification model Restnet34 based on a Convolutional neural network (CNN) with the Fastai framework in order to save the training time. Overall, the accuracy of classification is 93.765%.

【7】 Towards Efficient Tensor Decomposition-Based DNN Model Compression with Optimization Framework 标题：优化框架下基于张量分解的DNN模型压缩

作者：Miao Yin,Yang Sui,Siyu Liao,Bo Yuan 机构：Department of ECE, Rutgers University,Amazon 备注：This paper was accepted to CVPR'21 链接：https://arxiv.org/abs/2107.12422 摘要：高级张量分解，如张量列（TT）和张量环（TR），已被广泛应用于深度神经网络（DNN）模型压缩，特别是递归神经网络（RNNs）。然而，使用TT/TR压缩卷积神经网络（CNNs）的精度往往会受到很大的损失。本文提出了一个基于张量分解的交替方向乘子模型压缩系统框架。通过将基于TT分解的模型压缩问题转化为一个张量秩约束的优化问题，利用ADMM技术以迭代的方式系统地求解该优化问题。在这个过程中，整个DNN模型是在原来的结构，而不是TT格式的训练，但逐渐享受理想的低张量秩特征。然后，我们将这个未压缩的模型分解为TT格式，并对其进行微调，最终得到一个高精度的TT格式DNN模型。我们的框架非常通用，它同时适用于CNNs和RNNs，并且可以很容易地修改以适合其他张量分解方法。我们在不同的DNN模型上评估了我们提出的框架，用于图像分类和视频识别任务。实验结果表明，基于ADMM的TT格式模型具有很高的压缩性能和较高的精度。值得注意的是，在CIFAR-100上，压缩比为2.3X和2.4X，我们的模型的top-1精度分别比原来的ResNet-20和ResNet-32高1.96%和2.21%。通过对ImageNet上ResNet-18的压缩，我们的模型在不损失精度的情况下实现了2.47倍的跳频。摘要：Advanced tensor decomposition, such as Tensor train (TT) and Tensor ring (TR), has been widely studied for deep neural network (DNN) model compression, especially for recurrent neural networks (RNNs). However, compressing convolutional neural networks (CNNs) using TT/TR always suffers significant accuracy loss. In this paper, we propose a systematic framework for tensor decomposition-based model compression using Alternating Direction Method of Multipliers (ADMM). By formulating TT decomposition-based model compression to an optimization problem with constraints on tensor ranks, we leverage ADMM technique to systemically solve this optimization problem in an iterative way. During this procedure, the entire DNN model is trained in the original structure instead of TT format, but gradually enjoys the desired low tensor rank characteristics. We then decompose this uncompressed model to TT format and fine-tune it to finally obtain a high-accuracy TT-format DNN model. Our framework is very general, and it works for both CNNs and RNNs, and can be easily modified to fit other tensor decomposition approaches. We evaluate our proposed framework on different DNN models for image classification and video recognition tasks. Experimental results show that our ADMM-based TT-format models demonstrate very high compression performance with high accuracy. Notably, on CIFAR-100, with 2.3X and 2.4X compression ratios, our models have 1.96% and 2.21% higher top-1 accuracy than the original ResNet-20 and ResNet-32, respectively. For compressing ResNet-18 on ImageNet, our model achieves 2.47X FLOPs reduction without accuracy loss.

【8】 SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery 标题：SARNET：一种卫星影像深度学习辅助搜救数据集

作者：Michael Thoreau,Frazer Wilson 机构：Department of Electrical and Computer Engineering, New York University, No Affiliation 链接：https://arxiv.org/abs/2107.12469 摘要：近年来，随着几个新的星座投入使用，获得高分辨率卫星图像的机会急剧增加。高重访频率以及分辨率的提高，使卫星图像的使用范围扩大到人道主义救济甚至搜救等领域。提出了一种新的深度学习辅助SaR遥感目标检测数据集。此数据集仅包含已被确定为实时SaR响应的潜在目标的小对象。我们评估了流行的目标检测模型在这个数据集上的应用，作为进一步研究的基础。我们还提出了一个新的目标检测指标，专门设计用于深度学习辅助SaR设置。摘要：Access to high resolution satellite imagery has dramatically increased in recent years as several new constellations have entered service. High revisit frequencies as well as improved resolution has widened the use cases of satellite imagery to areas such as humanitarian relief and even Search and Rescue (SaR). We propose a novel remote sensing object detection dataset for deep learning assisted SaR. This dataset contains only small objects that have been identified as potential targets as part of a live SaR response. We evaluate the application of popular object detection models to this dataset as a baseline to inform further research. We also propose a novel object detection metric, specifically designed to be used in a deep learning assisted SaR setting.

其他(5篇)

【1】 Improving ClusterGAN Using Self-AugmentedInformation Maximization of Disentangling LatentSpaces 标题：基于解缠延迟空间自增强信息最大化的聚类GAN改进

作者：Tanmoy Dam,Sreenatha G. Anavatti,Hussein A. Abbass 机构：School of Engineering and Information Technology, University of New South Wales Canberra, Australia. 备注：This paper is under review to IEEE TNNLS 链接：https://arxiv.org/abs/2107.12706 摘要：生成性对抗网络中的潜在空间聚类（ClusterGAN）方法在高维数据上取得了成功。然而，该方法假设在模式生成过程中先验信息是均匀分布的，这在实际数据中是一个限制性的假设，导致生成模式的多样性损失。本文提出了自增强信息最大化改进型Clus-terGAN（SIMI-ClusterGAN）算法，从数据中学习特征先验。所提出的SIMI-ClusterGAN算法由四个深层神经网络组成：自增强先验网络、生成器、鉴别器和聚类推理自动编码器，并用七个基准数据集对该算法进行了验证，结果表明该算法的性能优于现有方法。为了证明SIMI-ClusterGAN在不平衡数据集上的优越性，我们讨论了MNIST数据集上一类不平衡和三类不平衡的两种不平衡情况，结果突出了SIMI-ClusterGAN的优越性。摘要：The Latent Space Clustering in Generative adversarial networks (ClusterGAN) method has been successful with high-dimensional data. However, the method assumes uniformlydistributed priors during the generation of modes, which isa restrictive assumption in real-world data and cause loss ofdiversity in the generated modes. In this paper, we proposeself-augmentation information maximization improved Clus-terGAN (SIMI-ClusterGAN) to learn the distinctive priorsfrom the data. The proposed SIMI-ClusterGAN consists offour deep neural networks: self-augmentation prior network,generator, discriminator and clustering inference autoencoder.The proposed method has been validated using seven bench-mark data sets and has shown improved performance overstate-of-the art methods. To demonstrate the superiority ofSIMI-ClusterGAN performance on imbalanced dataset, wehave discussed two imbalanced conditions on MNIST datasetswith one-class imbalance and three classes imbalanced cases.The results highlight the advantages of SIMI-ClusterGAN.

【2】 Vision-Guided Forecasting -- Visual Context for Multi-Horizon Time Series Forecasting 标题：视觉引导预测--多层次时间序列预测的视觉语境

作者：Eitan Kosman,Dotan Di Castro 机构：Technion - Israel Institute of Technology, Bosch Center for Artificial Intelligence 链接：https://arxiv.org/abs/2107.12674 摘要：近年来，自动驾驶获得了巨大的吸引力，因为它有可能改变我们的通勤方式。为了估计车辆的状态，人们付出了很大的努力。同时，学习预测前方车辆的状态会引入新的功能，例如预测危险情况。此外，预测带来了新的监督机会，学习预测更丰富的背景，表达了多个视野。直观地说，来自前置摄像头的视频流是必要的，因为它对即将到来的道路信息进行编码。此外，车辆状态的历史痕迹提供了更多的背景。在这篇论文中，我们将这两种模式融合，来处理车辆状态的多视界预测。我们设计并实验了3种端到端架构，分别利用三维卷积进行视觉特征提取和一维卷积进行速度和转向角轨迹的特征提取。为了证明我们的方法的有效性，我们在两个公开的真实世界数据集，Comma2k19和Udacity challenge上进行了广泛的实验。我们证明，我们能够预测一辆车的状态到不同的水平，同时在驾驶状态估计的相关任务上优于目前最新的结果。我们检验了视觉特征的贡献，发现在Udacity和Comma2k19数据集上，一个输入视觉特征的模型的误差分别是不使用这些特征的模型误差的56.6%和66.9%。摘要：Autonomous driving gained huge traction in recent years, due to its potential to change the way we commute. Much effort has been put into trying to estimate the state of a vehicle. Meanwhile, learning to forecast the state of a vehicle ahead introduces new capabilities, such as predicting dangerous situations. Moreover, forecasting brings new supervision opportunities by learning to predict richer a context, expressed by multiple horizons. Intuitively, a video stream originated from a front-facing camera is necessary because it encodes information about the upcoming road. Besides, historical traces of the vehicle's states give more context. In this paper, we tackle multi-horizon forecasting of vehicle states by fusing the two modalities. We design and experiment with 3 end-to-end architectures that exploit 3D convolutions for visual features extraction and 1D convolutions for features extraction from speed and steering angle traces. To demonstrate the effectiveness of our method, we perform extensive experiments on two publicly available real-world datasets, Comma2k19 and the Udacity challenge. We show that we are able to forecast a vehicle's state to various horizons, while outperforming the current state-of-the-art results on the related task of driving state estimation. We examine the contribution of vision features, and find that a model fed with vision features achieves an error that is 56.6% and 66.9% of the error of a model that doesn't use those features, on the Udacity and Comma2k19 datasets respectively.

【3】 Computer Vision-Based Guidance Assistance Concept for Plowing Using RGB-D Camera 标题：基于计算机视觉的RGB-D相机辅助犁耕方案

作者：Erkin Türköz,Ertug Olcay,Timo Oksanen 机构：Technical University of Munich, School of Life Sciences, Chair of Agrimechatronics, Freising, Germany, © , IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including 备注：Accepted to be published in Proceedings of the 2021 IEEE International Conference on Imaging Systems and Techniques, August 24-26 2021 链接：https://arxiv.org/abs/2107.12646 摘要：提出了一种基于计算机视觉的农用车辅助导航系统的概念，以提高农用车的犁耕精度，减轻驾驶员在长期耕作作业中的认知负担。在许多国家，耕作是一种常见的农业实践，为种植准备土壤，它可以在春季和秋季进行。由于犁耕作业需要较高的牵引力，导致能耗增加。此外，由于不必要的操纵而导致的较长运行时间会导致较高的燃油消耗。为了给拖拉机驾驶员和控制单元提供必要的信息，提出了一种基于RGB-D摄像机的犁沟检测系统。摘要：This paper proposes a concept of computer vision-based guidance assistance for agricultural vehicles to increase the accuracy in plowing and reduce driver's cognitive burden in long-lasting tillage operations. Plowing is a common agricultural practice to prepare the soil for planting in many countries and it can take place both in the spring and the fall. Since plowing operation requires high traction forces, it causes increased energy consumption. Moreover, longer operation time due to unnecessary maneuvers leads to higher fuel consumption. To provide necessary information for the driver and the control unit of the tractor, a first concept of furrow detection system based on an RGB-D camera was developed.

【4】 CalCROP21: A Georeferenced multi-spectral dataset of Satellite Imagery and Crop Labels 标题：CalCROP21：卫星图像和作物标签的地理参考多光谱数据集

作者：Rahul Ghosh,Praveen Ravirathinam,Xiaowei Jia,Ankush Khandelwal,David Mulla,Vipin Kumar 机构：University of Minnesota, Minneapolis, MN, USA, University of Pittsburgh, Pittsburgh, PA, USA, St. Paul, MN, USA 备注：13 pages; 11 figures 链接：https://arxiv.org/abs/2107.12499 摘要：测绘和监测作物是实现可持续农业集约化和解决全球粮食安全问题的关键步骤。像ImageNet这样的数据集彻底改变了计算机视觉应用，可以加速新型作物制图技术的发展。目前，美国农业部（USDA）每年发布耕地数据层（CDL），该数据层包含整个美利坚合众国分辨率为30m的作物标签。虽然CDL是最先进的技术，广泛用于许多农业应用，但它有许多局限性（例如，像素化错误、从以前的错误中携带的标签以及没有输入图像和类别标签）。在这项工作中，我们创建了一个新的语义分割基准数据集，我们称之为CalCROP21，使用一个基于googleearth引擎的鲁棒图像处理管道和一个新的基于注意的时空语义分割算法STATT，以10m的空间分辨率，为加州中央山谷地区的不同作物。STATT使用重新采样（内插）的CDL标签进行训练，但通过利用Sentinel2多光谱图像序列中的空间和时间模式来有效捕获作物之间的物候差异，并利用注意力减少云和其他大气干扰的影响，STATT能够生成比CDL更好的预测。我们还提出了一个综合评价，表明STATT有显着更好的结果相比，重采样的CDL标签。我们已经发布了用于生成基准数据集的数据集和处理管道代码。摘要：Mapping and monitoring crops is a key step towards sustainable intensification of agriculture and addressing global food security. A dataset like ImageNet that revolutionized computer vision applications can accelerate development of novel crop mapping techniques. Currently, the United States Department of Agriculture (USDA) annually releases the Cropland Data Layer (CDL) which contains crop labels at 30m resolution for the entire United States of America. While CDL is state of the art and is widely used for a number of agricultural applications, it has a number of limitations (e.g., pixelated errors, labels carried over from previous errors and absence of input imagery along with class labels). In this work, we create a new semantic segmentation benchmark dataset, which we call CalCROP21, for the diverse crops in the Central Valley region of California at 10m spatial resolution using a Google Earth Engine based robust image processing pipeline and a novel attention based spatio-temporal semantic segmentation algorithm STATT. STATT uses re-sampled (interpolated) CDL labels for training, but is able to generate a better prediction than CDL by leveraging spatial and temporal patterns in Sentinel2 multi-spectral image series to effectively capture phenologic differences amongst crops and uses attention to reduce the impact of clouds and other atmospheric disturbances. We also present a comprehensive evaluation to show that STATT has significantly better results when compared to the resampled CDL labels. We have released the dataset and the processing pipeline code for generating the benchmark dataset.

【5】 Circular-Symmetric Correlation Layer based on FFT 标题：基于FFT的圆对称相关层

作者：Bahar Azari,Deniz Erdogmus 机构：Department of Electrical & Computer Engineering, Northeastern University, USA, Boston, MA , Deniz Erdo˘gmu¸s 链接：https://arxiv.org/abs/2107.12480 摘要：尽管标准的平面卷积神经网络取得了巨大的成功，但对于分析任意弯曲流形（如圆柱体）上的信号，它们并不是最有效的选择。当人们对这些信号进行平面投影时，问题就出现了，并且不可避免地导致它们在有价值信息的地方被扭曲或破坏。基于连续群$S^1\times\mathbb{R}$上旋转平移等变相关的形式，我们提出了一种圆对称相关层（CCL），并利用著名的快速傅立叶变换（FFT）算法有效地实现了它。我们展示了一个装有CCL的通用网络在各种识别和分类任务和数据集上的性能分析。CCL的PyTorch包实现是在线提供的。摘要：Despite the vast success of standard planar convolutional neural networks, they are not the most efficient choice for analyzing signals that lie on an arbitrarily curved manifold, such as a cylinder. The problem arises when one performs a planar projection of these signals and inevitably causes them to be distorted or broken where there is valuable information. We propose a Circular-symmetric Correlation Layer (CCL) based on the formalism of roto-translation equivariant correlation on the continuous group $S^1 \times \mathbb{R}$, and implement it efficiently using the well-known Fast Fourier Transform (FFT) algorithm. We showcase the performance analysis of a general network equipped with CCL on various recognition and classification tasks and datasets. The PyTorch package implementation of CCL is provided online.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-28，如有侵权请联系 cloudcommunity@tencent.com 删除

linux