前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >计算机视觉学术速递[6.17]

计算机视觉学术速递[6.17]

作者头像
公众号-arXiv每日学术速递
发布2021-07-02 19:03:09
1.6K0
发布2021-07-02 19:03:09
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CV 方向,今日共计66篇

Transformer(2篇)

【1】 Shuffle Transformer with Feature Alignment for Video Face Parsing 标题:用于视频人脸分析的带特征对齐的置乱变换

作者:Rui Zhang,Yang Han,Zilong Huang,Pei Cheng,Guozhong Luo,Gang Yu,Bin Fu 备注:technical report 链接:https://arxiv.org/abs/2106.08650 摘要:这是一份简短的技术报告,介绍了TCParser团队在CVPR 2021年第三人上下文(PIC)研讨会和挑战中的解决方案。本文介绍了一个强大的主干,即基于跨窗口的洗牌变换器,用于呈现准确的人脸解析表示。为了进一步获得更精细的分割结果,特别是在边缘,我们引入了特征对齐聚合(FAA)模块。该方法能有效地解决多分辨率特征融合引起的特征错位问题。得益于更强的主干和更好的特征聚合,该方法在第三人上下文(PIC)工作坊和挑战赛的短视频人脸解析轨迹中获得86.9519%的分数,排名第一。 摘要:This is a short technical report introducing the solution of the Team TCParser for Short-video Face Parsing Track of The 3rd Person in Context (PIC) Workshop and Challenge at CVPR 2021. In this paper, we introduce a strong backbone which is cross-window based Shuffle Transformer for presenting accurate face parsing representation. To further obtain the finer segmentation results, especially on the edges, we introduce a Feature Alignment Aggregation (FAA) module. It can effectively relieve the feature misalignment issue caused by multi-resolution feature aggregation. Benefiting from the stronger backbone and better feature aggregation, the proposed method achieves 86.9519% score in the Short-video Face Parsing track of the 3rd Person in Context (PIC) Workshop and Challenge, ranked the first place.

【2】 Scene Transformer: A unified multi-task model for behavior prediction and planning 标题:场景变形器:一种统一的多任务行为预测和规划模型

作者:Jiquan Ngiam,Benjamin Caine,Vijay Vasudevan,Zhengdong Zhang,Hao-Tien Lewis Chiang,Jeffrey Ling,Rebecca Roelofs,Alex Bewley,Chenxi Liu,Ashish Venugopal,David Weiss,Ben Sapp,Zhifeng Chen,Jonathon Shlens 链接:https://arxiv.org/abs/2106.08417 摘要:预测多智能体的未来运动对于动态环境下的规划是必要的。这个任务对于自动驾驶来说是一个挑战,因为代理(例如车辆和行人)及其相关行为可能是多样的,并且相互影响。以前的工作主要集中在首先根据所有过去的运动预测每个代理的独立未来,然后根据这些独立预测进行规划。然而,针对固定预测的规划可能会由于无法表示不同代理之间未来交互的可能性而导致次优规划。在这项工作中,我们建立了一个模型来预测所有代理在真实驾驶环境中的行为。受最近语言建模方法的启发,我们使用掩蔽策略作为对模型的查询,使我们能够调用单个模型以多种方式预测代理行为,例如潜在地以自主车辆的目标或未来完整轨迹或环境中其他代理的行为为条件。我们的模型架构融合了一个统一的Transformer架构中的异构世界状态,通过对道路元素、代理交互和时间步长的关注。我们评估了我们的方法对自主驾驶数据集的行为预测,并取得了最先进的性能。我们的工作表明,在一个统一的体系结构中用掩蔽策略来描述行为预测问题可以使我们拥有一个能够有效执行多个运动预测和规划相关任务的模型。 摘要:Predicting the future motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence each other. Most prior work has focused on first predicting independent futures for each agent based on all past motion, and then planning against these independent predictions. However, planning against fixed predictions can suffer from the inability to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly in real-world driving environments in a unified manner. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture fuses heterogeneous world state in a unified Transformer architecture by employing attention across road elements, agent interactions and time steps. We evaluate our approach on autonomous driving datasets for behavior prediction, and achieve state-of-the-art performance. Our work demonstrates that formulating the problem of behavior prediction in a unified architecture with a masking strategy may allow us to have a single model that can perform multiple motion prediction and planning related tasks effectively.

检测相关(8篇)

【1】 End-to-End Semi-Supervised Object Detection with Soft Teacher 标题:基于软教师的端到端半监督目标检测

作者:Mengde Xu,Zheng Zhang,Han Hu,Jianfeng Wang,Lijuan Wang,Fangyun Wei,Xiang Bai,Zicheng Liu 链接:https://arxiv.org/abs/2106.09018 摘要:本文提出了一种端到端的半监督目标检测方法,与以往较为复杂的多阶段方法相比。在课程中,端到端的训练逐渐提高了伪标签的质量,而越来越精确的伪标签又有利于目标检测训练。在此框架下,我们还提出了两种简单而有效的方法:一种软教师机制,其中每个未标记边界框的分类损失由教师网络产生的分类分数来衡量;为盒回归学习选择可靠伪盒的盒抖动方法。在COCO基准上,在不同的标记率(1%、5%和10%)下,该方法比以往的方法有了很大的改进。此外,当标记数据量相对较大时,我们的方法也表现良好。例如,通过利用COCO的123K未标记图像,它可以改进使用+3.6map设置的完整COCO训练集训练的40.9map基线检测器,达到44.5map。在最先进的基于Swin-Transformer的目标检测仪(test dev上为58.9map)上,仍能显著提高+1.5map的检测精度,达到60.4map;实例分割精度提高+1.2map,达到52.4map,推动了新的技术水平。 摘要:This paper presents an end-to-end semi-supervised object detection approach, in contrast to previous more complex multi-stage methods. The end-to-end training gradually improves pseudo label qualities during the curriculum, and the more and more accurate pseudo labels in turn benefit object detection training. We also propose two simple yet effective techniques within this framework: a soft teacher mechanism where the classification loss of each unlabeled bounding box is weighed by the classification score produced by the teacher network; a box jittering approach to select reliable pseudo boxes for the learning of box regression. On COCO benchmark, the proposed approach outperforms previous methods by a large margin under various labeling ratios, i.e. 1\%, 5\% and 10\%. Moreover, our approach proves to perform also well when the amount of labeled data is relatively large. For example, it can improve a 40.9 mAP baseline detector trained using the full COCO training set by +3.6 mAP, reaching 44.5 mAP, by leveraging the 123K unlabeled images of COCO. On the state-of-the-art Swin Transformer-based object detector (58.9 mAP on test-dev), it can still significantly improve the detection accuracy by +1.5 mAP, reaching 60.4 mAP, and improve the instance segmentation accuracy by +1.2 mAP, reaching 52.4 mAP, pushing the new state-of-the-art.

【2】 Toward Robotic Weed Control: Detection of Nutsedge Weed in Bermudagrass Turf Using Inaccurate and Insufficient Training Data 标题:走向机器人杂草控制:使用不准确和不充分的训练数据检测狗牙根草坪上的小坚果杂草

作者:Shuangyu Xie,Chengsong Hu,Muthukumar Bagavathiannan,Dezhen Song 链接:https://arxiv.org/abs/2106.08897 摘要:为了实现机器人杂草控制,我们开发了从狗牙根草坪中检测莎草杂草的算法。由于杂草与背景草皮之间的相似性,人工数据标注成本高且容易出错。因此,直接应用深度学习方法进行目标检测并不能得到令人满意的结果。基于一种实例检测方法(maskr-CNN),我们将合成数据与原始数据相结合来训练网络。提出了一种生成高保真合成数据的算法,采用不同层次的标注来降低标注成本。此外,我们构造了一个基于nutsedge骨架的概率映射(NSPM)作为神经网络的输入,以减少对像素级精确标注的依赖。我们还将损失函数从交叉熵改为Kullback-Leibler散度,以适应标记过程中的不确定性。我们实现了该算法,并将其与更快的R-CNN和Mask R-CNN进行了比较。结果表明,我们的设计可以有效地克服训练样本不精确和不足的影响,显著优于快速R-CNN算法,其误报率仅为0.4%。特别是,我们的方法也减少了95%的标记时间,同时实现了更好的性能,如果与原来的掩膜R-CNN方法相比。 摘要:To enable robotic weed control, we develop algorithms to detect nutsedge weed from bermudagrass turf. Due to the similarity between the weed and the background turf, manual data labeling is expensive and error-prone. Consequently, directly applying deep learning methods for object detection cannot generate satisfactory results. Building on an instance detection approach (i.e. Mask R-CNN), we combine synthetic data with raw data to train the network. We propose an algorithm to generate high fidelity synthetic data, adopting different levels of annotations to reduce labeling cost. Moreover, we construct a nutsedge skeleton-based probabilistic map (NSPM) as the neural network input to reduce the reliance on pixel-wise precise labeling. We also modify loss function from cross entropy to Kullback-Leibler divergence which accommodates uncertainty in the labeling process. We implement the proposed algorithm and compare it with both Faster R-CNN and Mask R-CNN. The results show that our design can effectively overcome the impact of imprecise and insufficient training sample issues and significantly outperform the Faster R-CNN counterpart with a false negative rate of only 0.4%. In particular, our approach also reduces labeling time by 95% while achieving better performance if comparing with the original Mask R-CNN approach.

【3】 JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection 标题:JRDB-ACT:一个面向时空动作、社会群体和活动检测的大规模多模态数据集

作者:Mahsa Ehsanpour,Fatemeh Saleh,Silvio Savarese,Ian Reid,Hamid Rezatofighi 链接:https://arxiv.org/abs/2106.08827 摘要:大规模视频动作理解数据集的可用性促进了对包含人的视觉场景的解释的进步。然而,学习如何在一个无约束的现实世界环境中识别人类活动,以及潜在的高度不平衡和长尾分布数据仍然是一个重大挑战,尤其是由于缺乏一个可反映的大规模数据集。大多数现有的大规模数据集都是从特定或受限的环境(如厨房或房间)或视频共享平台(如YouTube)收集的。在本文中,我们介绍了JRDB法案,一个多模态数据集,作为现有的JRDB的扩展,这是捕获的非社会移动机械手,反映了一个真实的分布在大学校园环境中的人类日常生活行为。JRDB-Act被密集地标注了原子动作,包含了280多万个动作标签,构成了一个大规模的时空动作检测数据集。每个人体边界框都标有一个基于姿势的动作标签和多个(可选)基于交互的动作标签。此外,JRDB法案还提供了社会群体识别注释,有助于根据个人在场景中的互动对其进行分组,以推断其社会活动(每个社会群体中的共同活动)。 摘要:The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognize human activities in an unconstrained real-world environment, with potentially highly unbalanced and long-tailed distributed data remains a significant challenge, not least owing to the lack of a reflective large-scale dataset. Most existing large-scale datasets are either collected from a specific or constrained environment, e.g. kitchens or rooms, or video sharing platforms such as YouTube. In this paper, we introduce JRDB-Act, a multi-modal dataset, as an extension of the existing JRDB, which is captured by asocial mobile manipulator and reflects a real distribution of human daily life actions in a university campus environment. JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels, constituting a large-scale spatio-temporal action detection dataset. Each human bounding box is labelled with one pose-based action label and multiple (optional) interaction-based action labels. Moreover JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene to infer their social activities (common activities in each social group).

【4】 2nd Place Solution for Waymo Open Dataset Challenge - Real-time 2D Object Detection 标题:Waymo开放数据集挑战赛亚军解决方案-实时2D对象检测

作者:Yueming Zhang,Xiaolin Song,Bing Bai,Tengfei Xing,Chao Liu,Xin Gao,Zhihui Wang,Yawei Wen,Haojin Liao,Guoshan Zhang,Pengfei Xu 链接:https://arxiv.org/abs/2106.08713 摘要:在自动驾驶系统中,从图像中识别车辆、行人和骑自行车的人至关重要。除了预测的高精度外,实时运行的要求也给卷积网络模型带来了新的挑战。本文介绍了一种从图像中实时检测二维物体的方法。我们聚合了几种流行的单级目标检测算法,并独立训练各种输入策略的模型,以获得更好的性能,实现对每个类别的精确多尺度检测,特别是对小目标的检测。对于模型加速,我们利用TensorRT来优化检测管道的推理时间。如排行榜所示,我们提出的检测框架在Waymo开放数据集挑战的实时2D检测轨迹中以75.00%L1 mAP和69.72%L2 mAP排名第二,而我们的框架在Nvidia特斯拉V100 GPU上实现了45.8ms/帧的延迟。 摘要:In an autonomous driving system, it is essential to recognize vehicles, pedestrians and cyclists from images. Besides the high accuracy of the prediction, the requirement of real-time running brings new challenges for convolutional network models. In this report, we introduce a real-time method to detect the 2D objects from images. We aggregate several popular one-stage object detectors and train the models of variety input strategies independently, to yield better performance for accurate multi-scale detection of each category, especially for small objects. For model acceleration, we leverage TensorRT to optimize the inference time of our detection pipeline. As shown in the leaderboard, our proposed detection framework ranks the 2nd place with 75.00% L1 mAP and 69.72% L2 mAP in the real-time 2D detection track of the Waymo Open Dataset Challenges, while our framework achieves the latency of 45.8ms/frame on an Nvidia Tesla V100 GPU.

【5】 FastAno: Fast Anomaly Detection via Spatio-temporal Patch Transformation 标题:FastAno:基于时空斑块变换的快速异常检测

作者:Chaewon Park,MyeongAh Cho,Minhyeok Lee,Sangyoun Lee 备注:2022WACV 链接:https://arxiv.org/abs/2106.08613 摘要:随着对监控视频自动监控要求的不断提高,视频异常检测得到了广泛的关注。特别是基于预测的异常检测方法,通过对训练集正常帧的学习,预测测试集中包含异常事件的帧,是目前研究最多的异常检测方法之一。然而,许多预测网络由于使用预先训练好的光流网络而计算量大,或者由于其预测异常的生成能力强而无法检测到异常情况。针对这些不足,本文提出了空间旋转变换(SRT)和时间混合变换(TMT)的方法,在正常框架长方体中生成不规则的面片长方体,以增强正常特征的学习。此外,本文提出的面片变换只在训练阶段使用,使得我们的模型能够在推理过程中快速检测异常帧。我们的模型在三个异常检测基准上进行了评估,达到了有竞争力的精度,并且在速度上超过了以前的所有工作。 摘要:Video anomaly detection has gained significant attention due to the increasing requirements of automatic monitoring for surveillance videos. Especially, the prediction based approach is one of the most studied methods to detect anomalies by predicting frames that include abnormal events in the test set after learning with the normal frames of the training set. However, a lot of prediction networks are computationally expensive owing to the use of pre-trained optical flow networks, or fail to detect abnormal situations because of their strong generative ability to predict even the anomalies. To address these shortcomings, we propose spatial rotation transformation (SRT) and temporal mixing transformation (TMT) to generate irregular patch cuboids within normal frame cuboids in order to enhance the learning of normal features. Additionally, the proposed patch transformation is used only during the training phase, allowing our model to detect abnormal frames at fast speed during inference. Our model is evaluated on three anomaly detection benchmarks, achieving competitive accuracy and surpassing all the previous works in terms of speed.

【6】 Anomaly Detection in Video Sequences: A Benchmark and Computational Model 标题:视频序列中的异常检测:一种基准测试和计算模型

作者:Boyang Wan,Wenhui Jiang,Yuming Fang,Zhiyuan Luo,Guanqun Ding 备注:Publication in IET Image Processing 链接:https://arxiv.org/abs/2106.08570 摘要:异常检测引起了人们的广泛关注。然而,现有的异常检测数据库存在两大问题。一是规模有限。其次,训练集只包含表示在整个视频中存在异常事件的视频级别标签,而缺少精确的持续时间注释。为了解决这些问题,我们提出了一个新的大规模异常检测(LAD)数据库,作为视频序列异常检测的基准。1) 它包含2000个视频序列,包括正常和异常视频片段,包括车祸、火灾、暴力等14个异常类别,场景种类繁多,是迄今为止最大的异常分析数据库。2) 它提供注释数据,包括视频级别标签(异常/正常视频,异常类型)和帧级别标签(异常/正常视频帧),以便于异常检测。利用LAD数据库的上述优点,我们进一步将异常检测问题描述为一个完全有监督的学习问题,并提出了一个多任务的深度神经网络来解决这个问题。首先利用膨胀三维卷积(I3D)网络获得局部时空上下文特征。然后构造一个反馈局部时空上下文特征的递归卷积神经网络来提取时空上下文特征。利用全局时空上下文特征,利用多任务神经网络可以同时计算异常类型和得分。实验结果表明,该方法在本数据库和其他公共数据库上的异常检测效果优于现有的异常检测方法。代码可在https://github.com/wanboyang/anomaly_detection_LAD2000. 摘要:Anomaly detection has attracted considerable search attention. However, existing anomaly detection databases encounter two major problems. Firstly, they are limited in scale. Secondly, training sets contain only video-level labels indicating the existence of an abnormal event during the full video while lacking annotations of precise time durations. To tackle these problems, we contribute a new Large-scale Anomaly Detection (LAD) database as the benchmark for anomaly detection in video sequences, which is featured in two aspects. 1) It contains 2000 video sequences including normal and abnormal video clips with 14 anomaly categories including crash, fire, violence, etc. with large scene varieties, making it the largest anomaly analysis database to date. 2) It provides the annotation data, including video-level labels (abnormal/normal video, anomaly type) and frame-level labels (abnormal/normal video frame) to facilitate anomaly detection. Leveraging the above benefits from the LAD database, we further formulate anomaly detection as a fully-supervised learning problem and propose a multi-task deep neural network to solve it. We first obtain the local spatiotemporal contextual feature by using an Inflated 3D convolutional (I3D) network. Then we construct a recurrent convolutional neural network fed the local spatiotemporal contextual feature to extract the spatiotemporal contextual feature. With the global spatiotemporal contextual feature, the anomaly type and score can be computed simultaneously by a multi-task neural network. Experimental results show that the proposed method outperforms the state-of-the-art anomaly detection methods on our database and other public databases of anomaly detection. Codes are available at https://github.com/wanboyang/anomaly_detection_LAD2000.

【7】 Detection of Morphed Face Images Using Discriminative Wavelet Sub-bands 标题:基于鉴别小波子带的变形人脸图像检测

作者:Poorya Aghdaie,Baaria Chaudhary,Sobhan Soleymani,Jeremy Dawson,Nasser M. Nasrabadi 链接:https://arxiv.org/abs/2106.08565 摘要:这项工作调查了众所周知的问题变形攻击,这已引起相当多的关注,在生物识别界。变形图像暴露了人脸识别系统易受错误接受的影响,导致了可怕的后果,特别是对于国家安全应用。为了检测变形攻击,我们提出了一种基于二维离散小波变换(2D-DWT)的方法。鉴别小波子带可以突出真实图像和变形图像之间的不一致性。我们观察到,真实图像中给定子带的熵与变形样本中相同子带的熵之间存在显著差异。考虑到这两个熵值之间的差异,我们发现了两个分布之间的Kullback-Leibler散度,即真实图像和相应变形图像的熵。最具鉴别能力的子波子带是那些具有最高KL散度值的子波子带。因此,在形态检测方面,选择22个子带作为最具辨别性的子带。结果表明,在22个区分子带上训练的深度神经网络(DNN)能准确地检测变形样本。最重要的是,通过在VISAPP17、LMA和MorGAN三个数据集上的实验验证了算法的有效性。我们还对子带选择进行了消融研究。 摘要:This work investigates the well-known problem of morphing attacks, which has drawn considerable attention in the biometrics community. Morphed images have exposed face recognition systems' susceptibility to false acceptance, resulting in dire consequences, especially for national security applications. To detect morphing attacks, we propose a method which is based on a discriminative 2D Discrete Wavelet Transform (2D-DWT). A discriminative wavelet sub-band can highlight inconsistencies between a real and a morphed image. We observe that there is a salient discrepancy between the entropy of a given sub-band in a bona fide image, and the same sub-band's entropy in a morphed sample. Considering this dissimilarity between these two entropy values, we find the Kullback-Leibler divergence between the two distributions, namely the entropy of the bona fide and the corresponding morphed images. The most discriminative wavelet sub-bands are those with the highest corresponding KL-divergence values. Accordingly, 22 sub-bands are selected as the most discriminative ones in terms of morph detection. We show that a Deep Neural Network (DNN) trained on the 22 discriminative sub-bands can detect morphed samples precisely. Most importantly, the effectiveness of our algorithm is validated through experiments on three datasets: VISAPP17, LMA, and MorGAN. We also performed an ablation study on the sub-band selection.

【8】 GKNet: grasp keypoint network for grasp candidates detection 标题:GKNet:面向抓取候选检测的抓取关键点网络

作者:Ruinian Xu,Fu-Jen Chu,Patricio A. Vela 备注:24 pages, 12 figures, 13 tables 链接:https://arxiv.org/abs/2106.08497 摘要:现代的抓取检测方法采用深度学习来实现对传感器和对象模型不确定性的鲁棒性。这两种主要的方法要么设计抓取质量评分,要么设计基于锚的抓取识别网络。本文提出了一种不同的抓取检测方法,将其视为关键点检测。深度网络将每个抓取候选作为一对关键点进行检测,可转换为抓取表示g={x,y,w,{\theta}}^T,而不是三重或四重角点。通过将关键点分组成对来降低检测难度可以提高性能。为了进一步促进关键点之间的依赖关系,一般的非局部模块被纳入拟议的学习框架。基于离散和连续方向预测的最终滤波策略消除了错误对应,进一步提高了抓取检测性能。GKNet,这里介绍的方法,在康奈尔和简化提花数据集上实现了精度和速度的最佳平衡(在41.67和23.26 fps时分别为96.9%和98.39%)。在机械手上进行了4种不同干扰源的抓取实验:静态抓取、动态抓取、不同相机角度抓取和拣箱。GKNet在静态和动态抓取实验中的性能优于参考基线,同时对不同的摄像机视点和面元拾取实验具有鲁棒性。结果证实了抓取关键点是深抓取网络的有效输出表示的假设,深抓取网络对期望的干扰因子具有鲁棒性。 摘要:Contemporary grasp detection approaches employ deep learning to achieve robustness to sensor and object model uncertainty. The two dominant approaches design either grasp-quality scoring or anchor-based grasp recognition networks. This paper presents a different approach to grasp detection by treating it as keypoint detection. The deep network detects each grasp candidate as a pair of keypoints, convertible to the grasp representation g = {x, y, w, {\theta}}^T, rather than a triplet or quartet of corner points. Decreasing the detection difficulty by grouping keypoints into pairs boosts performance. To further promote dependencies between keypoints, the general non-local module is incorporated into the proposed learning framework. A final filtering strategy based on discrete and continuous orientation prediction removes false correspondences and further improves grasp detection performance. GKNet, the approach presented here, achieves the best balance of accuracy and speed on the Cornell and the abridged Jacquard dataset (96.9% and 98.39% at 41.67 and 23.26 fps). Follow-up experiments on a manipulator evaluate GKNet using 4 types of grasping experiments reflecting different nuisance sources: static grasping, dynamic grasping, grasping at varied camera angles, and bin picking. GKNet outperforms reference baselines in static and dynamic grasping experiments while showing robustness to varied camera viewpoints and bin picking experiments. The results confirm the hypothesis that grasp keypoints are an effective output representation for deep grasp networks that provide robustness to expected nuisance factors.

分类|识别相关(6篇)

【1】 Contrastive Learning with Continuous Proxy Meta-Data for 3D MRI Classification 标题:基于连续代理元数据的3D MRI分类对比学习

作者:Benoit Dufumier,Pietro Gori,Julie Victor,Antoine Grigis,Michel Wessa,Paolo Brambilla,Pauline Favre,Mircea Polosan,Colm McDonald,Camille Marie Piguet,Edouard Duchesnay 备注:None 链接:https://arxiv.org/abs/2106.08808 摘要:传统的深度神经网络有监督学习需要大量的标记数据才能收敛到一个好的解。对于三维医学图像,为特定的病理学建立一个大的同质注释数据集通常是不切实际的。自监督方法提供了一种新的方法来学习图像的代表性,在一个无监督的方式与神经网络。特别是,对比学习通过(几乎)匹配完全监督的CNN在视觉任务上的表现,显示出巨大的前景。尽管如此,这种方法并没有利用可用的元数据,例如被视为先验知识的参与者的年龄。在这里,我们建议在对比学习框架中,通过引入一种称为y-Aware InfoNCE loss的新损失来利用连续代理元数据。具体来说,我们改进了预训练过程中的正抽样,假设具有相似的区分语义特征,通过锚添加更多具有相似代理元数据的正样本,一个3D CNN模型在10^4美元的多部位健康脑部MRI扫描上预先训练,可以提取三个分类任务的相关特征:精神分裂症、两极诊断和阿尔茨海默病检测。经过微调后,它在这些任务上的表现也优于从零开始训练的3D CNN,以及最先进的自我监督方法。我们的代码在这里公开。 摘要:Traditional supervised learning with deep neural networks requires a tremendous amount of labelled data to converge to a good solution. For 3D medical images, it is often impractical to build a large homogeneous annotated dataset for a specific pathology. Self-supervised methods offer a new way to learn a representation of the images in an unsupervised manner with a neural network. In particular, contrastive learning has shown great promises by (almost) matching the performance of fully-supervised CNN on vision tasks. Nonetheless, this method does not take advantage of available meta-data, such as participant's age, viewed as prior knowledge. Here, we propose to leverage continuous proxy metadata, in the contrastive learning framework, by introducing a new loss called y-Aware InfoNCE loss. Specifically, we improve the positive sampling during pre-training by adding more positive examples with similar proxy meta-data with the anchor, assuming they share similar discriminative semantic features.With our method, a 3D CNN model pre-trained on $10^4$ multi-site healthy brain MRI scans can extract relevant features for three classification tasks: schizophrenia, bipolar diagnosis and Alzheimer's detection. When fine-tuned, it also outperforms 3D CNN trained from scratch on these tasks, as well as state-of-the-art self-supervised methods. Our code is made publicly available here.

【2】 Unsupervised Person Re-identification via Multi-Label Prediction and Classification based on Graph-Structural Insight 标题:基于图结构洞察力的多标签预测和分类的无监督人员再识别

作者:Jongmin Yu,Hyeontaek Oh 备注:submitted to ICCV 链接:https://arxiv.org/abs/2106.08798 摘要:提出了一种基于图结构洞察的多标签预测和分类的无监督人员再识别方法。我们的方法从人物图像中提取特征,并生成一个由特征和它们的成对相似性组成的图,分别作为节点和边。在图的基础上,提出了一种基于图结构的多标签预测(GSMLP)方法,该方法通过考虑节点间的成对相似性和相邻节点的分布来预测多标签。将GSMLP生成的多标签应用于所提出的选择性多标签分类(SMLC)算法。SMLC集成了硬样本挖掘方案和多标签分类。提出的GSMLP和SMLC算法在没有任何预标记数据集的情况下提高了无监督人员识别的性能。实验结果证明了该方法在无监督人员识别中的优越性。本文的源代码可在'https://github.com/uknownpioneer/GSMLP-SMLC.git'. 摘要:This paper addresses unsupervised person re-identification (Re-ID) using multi-label prediction and classification based on graph-structural insight. Our method extracts features from person images and produces a graph that consists of the features and a pairwise similarity of them as nodes and edges, respectively. Based on the graph, the proposed graph structure based multi-label prediction (GSMLP) method predicts multi-labels by considering the pairwise similarity and the adjacency node distribution of each node. The multi-labels created by GSMLP are applied to the proposed selective multi-label classification (SMLC) loss. SMLC integrates a hard-sample mining scheme and a multi-label classification. The proposed GSMLP and SMLC boost the performance of unsupervised person Re-ID without any pre-labelled dataset. Experimental results justify the superiority of the proposed method in unsupervised person Re-ID by producing state-of-the-art performance. The source code for this paper is publicly available on 'https://github.com/uknownpioneer/GSMLP-SMLC.git'.

【3】 Structured DropConnect for Uncertainty Inference in Image Classification 标题:用于图像分类不确定性推理的结构化DropConnect

作者:Wenqing Zheng,Jiyang Xie,Weidong Liu,Zhanyu Ma 备注:5 pages,1 figures 链接:https://arxiv.org/abs/2106.08624 摘要:随着网络结构的复杂性,不确定性推理成为提高人工智能系统分类精度的重要任务。对于图像分类任务,我们提出了一个结构化DropConnect(SDC)框架,通过Dirichlet分布对深度神经网络的输出进行建模。在训练过程中,我们在完全连接的层中引入了一种权重的DropConnect策略。在测试中,我们将网络分割成若干个子网络,然后通过将其矩与这些子网络输出的均值和方差相匹配来模拟Dirichlet分布。最后利用估计的Dirichlet分布的熵进行不确定性推理。该框架在LeNet$5$和VGG$16$模型上实现,用于MNIST和CIFAR$10$数据集的误分类检测和分布外检测。实验结果表明,该方法的性能可以与其他不确定性推理方法相媲美。此外,SDC还可以很好地适应不同的网络结构,具有一定的泛化能力和研究前景。 摘要:With the complexity of the network structure, uncertainty inference has become an important task to improve the classification accuracy for artificial intelligence systems. For image classification tasks, we propose a structured DropConnect (SDC) framework to model the output of a deep neural network by a Dirichlet distribution. We introduce a DropConnect strategy on weights in the fully connected layers during training. In test, we split the network into several sub-networks, and then model the Dirichlet distribution by match its moments with the mean and variance of the outputs of these sub-networks. The entropy of the estimated Dirichlet distribution is finally utilized for uncertainty inference. In this paper, this framework is implemented on LeNet$5$ and VGG$16$ models for misclassification detection and out-of-distribution detection on MNIST and CIFAR-$10$ datasets. Experimental results show that the performance of the proposed SDC can be comparable to other uncertainty inference methods. Furthermore, the SDC is adapted well to different network structures with certain generalization capabilities and research prospects.

【4】 Federated Semi-supervised Medical Image Classification via Inter-client Relation Matching 标题:基于客户间关系匹配的联合半监督医学图像分类

作者:Quande Liu,Hongzheng Yang,Qi Dou,Pheng-Ann Heng 备注:Accepted to MICCAI 2021 链接:https://arxiv.org/abs/2106.08600 摘要:联邦学习(FL)已经出现,随着越来越流行的合作,分布式医疗机构的训练深度网络。然而,尽管现有的FL算法只允许有监督的训练设置,但现实中的大多数医院由于缺乏预算或专业知识,往往负担不起复杂的数据标注。本文研究了一个实际而又富有挑战性的FL问题,即联合半监督学习(FSSL),该问题的目的是通过联合使用来自有标签和无标签客户(即医院)的数据来学习联合模型。我们提出了一种新的方法来解决这个问题,它改进了传统的一致性正则化机制,提出了一种新的客户间关系匹配方案。所提出的学习方案通过调整被标记和未标记的病人之间提取的疾病关系来明确地连接被标记和未标记病人之间的学习,从而减轻了未标记病人任务知识的不足,提高了未标记病人的辨别信息。我们在两个大规模的医学图像分类数据集上验证了我们的方法。我们的方法的有效性已经得到了证明,与现有技术相比有了明显的改进,并且对两个任务进行了彻底的烧蚀分析\footnote{代码将在\url上提供{https://github.com/liuquande/FedIRM}}. 摘要:Federated learning (FL) has emerged with increasing popularity to collaborate distributed medical institutions for training deep networks. However, despite existing FL algorithms only allow the supervised training setting, most hospitals in realistic usually cannot afford the intricate data labeling due to absence of budget or expertise. This paper studies a practical yet challenging FL problem, named \textit{Federated Semi-supervised Learning} (FSSL), which aims to learn a federated model by jointly utilizing the data from both labeled and unlabeled clients (i.e., hospitals). We present a novel approach for this problem, which improves over traditional consistency regularization mechanism with a new inter-client relation matching scheme. The proposed learning scheme explicitly connects the learning across labeled and unlabeled clients by aligning their extracted disease relationships, thereby mitigating the deficiency of task knowledge at unlabeled clients and promoting discriminative information from unlabeled samples. We validate our method on two large-scale medical image classification datasets. The effectiveness of our method has been demonstrated with the clear improvements over state-of-the-arts as well as the thorough ablation analysis on both tasks\footnote{Code will be made available at \url{https://github.com/liuquande/FedIRM}}.

【5】 Domain Consistency Regularization for Unsupervised Multi-source Domain Adaptive Classification 标题:无监督多源域自适应分类的域一致性正则化

作者:Zhipeng Luo,Xiaobing Zhang,Shijian Lu,Shuai Yi 链接:https://arxiv.org/abs/2106.08590 摘要:基于深度学习的多源无监督领域自适应(MUDA)是近年来研究的热点。与单源无监督域自适应(SUDA)相比,MUDA算法不仅存在于源域和目标域之间,而且存在于多个源域之间。现有的MUDA算法大多侧重于提取各领域之间的领域不变性表示,而忽略了类之间任务特定的决策边界。在本文中,我们提出了一个端到端的可训练网络,利用领域一致性正则化的无监督多源领域自适应分类(CRMA)。CRMA不仅使每对源域和目标域的分布一致,而且使所有域的分布一致。对于每一对源域和目标域,我们采用域内一致性来正则化一对特定于域的分类器以实现域内对齐。此外,我们还设计了一个域间一致性模型,以所有域之间的联合域间对齐为目标。为了解决多个源域和目标域之间不同的相似性,我们设计了一种授权策略,该策略将不同的权限自适应地分配给特定领域的分类器,以实现最佳的伪标签预测和自训练。大量实验表明,CRMA在多源环境下有效地解决了无监督域自适应问题,并在多个MUDA数据集上实现了一致的自适应。 摘要:Deep learning-based multi-source unsupervised domain adaptation (MUDA) has been actively studied in recent years. Compared with single-source unsupervised domain adaptation (SUDA), domain shift in MUDA exists not only between the source and target domains but also among multiple source domains. Most existing MUDA algorithms focus on extracting domain-invariant representations among all domains whereas the task-specific decision boundaries among classes are largely neglected. In this paper, we propose an end-to-end trainable network that exploits domain Consistency Regularization for unsupervised Multi-source domain Adaptive classification (CRMA). CRMA aligns not only the distributions of each pair of source and target domains but also that of all domains. For each pair of source and target domains, we employ an intra-domain consistency to regularize a pair of domain-specific classifiers to achieve intra-domain alignment. In addition, we design an inter-domain consistency that targets joint inter-domain alignment among all domains. To address different similarities between multiple source domains and the target domain, we design an authorization strategy that assigns different authorities to domain-specific classifiers adaptively for optimal pseudo label prediction and self-training. Extensive experiments show that CRMA tackles unsupervised domain adaptation effectively under a multi-source setup and achieves superior adaptation consistently across multiple MUDA datasets.

【6】 Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI 标题:实时MRI中基于声道形状动力学的无声语音和情感识别

作者:Laxmi Pandey,Ahmed Sabbir Arif 备注:8 pages 链接:https://arxiv.org/abs/2106.08706 摘要:口语的语音是通过改变发音器在声道周围的结构来获得的。它们包含了丰富的信息,可以用来更好地理解人类言语产生的潜在机制。我们提出了一种新的基于深度神经网络的学习框架,该框架能够理解语音产生过程中声道形状可变长度序列中的声学信息,并通过实时磁共振成像(rtMRI)捕获这些信息,然后将其翻译成文本。提出的框架包括时空卷积、循环网络和连接主义时间分类损失,完全端到端训练。在USC-TIMIT语料库上,该模型在句子水平上的识别率达到了40.6%,比现有的模型要好得多。据我们所知,这是第一个研究表明,识别整个口语句子的基础上个人的发音运动捕捉到的rtMRI视频。我们还分析了不同情绪和性别的声道各亚区(即咽部、腭部和背侧、硬腭、唇部收缩区)发音几何结构的变化。结果表明,每个子区域的失真都受到情绪和性别的影响。 摘要:Speech sounds of spoken language are obtained by varying configuration of the articulators surrounding the vocal tract. They contain abundant information that can be utilized to better understand the underlying mechanism of human speech production. We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. The proposed framework comprises of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. On the USC-TIMIT corpus, the model achieved a 40.6% PER at sentence-level, much better compared to the existing models. To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's articulatory motions captured by rtMRI video. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract (i.e., pharyngeal, velar and dorsal, hard palate, labial constriction region) with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.

分割|语义相关(5篇)

【1】 Unsupervised Domain Adaptation with Variational Approximation for Cardiac Segmentation 标题:基于变分逼近的无监督区域自适应心脏分割

作者:Fuping Wu,Xiahai Zhuang 备注:accepted by IEEE Transactions on Medical Imaging 链接:https://arxiv.org/abs/2106.08752 摘要:无监督域自适应在医学图像分割中有着重要的应用。特别地,当目标图像的真实性不可用时,域自适应可以通过利用现有的来自其他模式的标记图像来训练特定于目标的模型。大多数研究将源域和目标域的图像映射到一个共同的潜在特征空间中,然后通过对抗性训练隐式地或直接最小化差异度量来减少它们之间的差异。在这项工作中,我们提出了一个新的框架,其中两个域的潜在特征被驱动到一个共同的参数化变分形式,其条件分布给定的图像是高斯分布。这是通过两个基于变分自动编码器(VAEs)的网络和一个用于此变分近似的正则化来实现的。两个vae都包含一个分割模块,其中源分割以有监督的方式进行训练,而目标分割以无监督的方式进行训练。我们使用两个心脏分割任务,即跨模态(CT和MR)全心脏分割和跨序列心脏MR分割,验证了所提出的域自适应方法。结果表明,与两种最新的分割方法相比,该方法获得了更好的分割精度,显示出良好的心脏分割潜力。此外,所提出的显式正则化方法在缩小域间分布差距方面是有效的,这对于无监督域自适应是有用的。我们的代码和数据已经通过https://zmiclab.github.io/projects.html. 摘要:Unsupervised domain adaptation is useful in medical image segmentation. Particularly, when ground truths of the target images are not available, domain adaptation can train a target-specific model by utilizing the existing labeled images from other modalities. Most of the reported works mapped images of both the source and target domains into a common latent feature space, and then reduced their discrepancy either implicitly with adversarial training or explicitly by directly minimizing a discrepancy metric. In this work, we propose a new framework, where the latent features of both domains are driven towards a common and parameterized variational form, whose conditional distribution given the image is Gaussian. This is achieved by two networks based on variational auto-encoders (VAEs) and a regularization for this variational approximation. Both of the VAEs, each for one domain, contain a segmentation module, where the source segmentation is trained in a supervised manner, while the target one is trained unsupervisedly. We validated the proposed domain adaptation method using two cardiac segmentation tasks, i.e., the cross-modality (CT and MR) whole heart segmentation and the cross-sequence cardiac MR segmentation. Results show that the proposed method achieved better accuracies compared to two state-of-the-art approaches and demonstrated good potential for cardiac segmentation. Furthermore, the proposed explicit regularization was shown to be effective and efficient in narrowing down the distribution gap between domains, which is useful for unsupervised domain adaptation. Our code and data has been released via https://zmiclab.github.io/projects.html.

【2】 AtrialGeneral: Domain Generalization for Left Atrial Segmentation of Multi-Center LGE MRIs 标题:AtrialGeneral:多中心LGE MRI左心房分割的域泛化

作者:Lei Li,Veronika A. Zimmer,Julia A. Schnabel,Xiahai Zhuang 备注:10 pages, 4 figures, MICCAI2021 链接:https://arxiv.org/abs/2106.08727 摘要:从晚期钆增强磁共振成像(LGE-MRI)中分割左心房(LA)是计划心房颤动治疗的关键步骤。然而,由于LGE-MRI图像质量差、LA形状变异性大、LA边界不清晰等原因,自动分割LA仍然是一个挑战。虽然基于深度学习的方法可以提供很有前途的LA分割结果,但是它们通常很难推广到不可见的领域,例如来自不同扫描仪和/或站点的数据。在这项工作中,我们收集了210个来自不同中心的不同图像质量水平的LGE-mri。为了评估模型在LA切分任务中的领域泛化能力,我们采用了四种常用的语义切分网络对多中心LGE-MRIs进行LA切分。此外,我们还研究了直方图匹配、基于互信息的非纠缠表示和随机风格转换三种领域泛化策略,其中简单的直方图匹配是最有效的。 摘要:Left atrial (LA) segmentation from late gadolinium enhanced magnetic resonance imaging (LGE MRI) is a crucial step needed for planning the treatment of atrial fibrillation. However, automatic LA segmentation from LGE MRI is still challenging, due to the poor image quality, high variability in LA shapes, and unclear LA boundary. Though deep learning-based methods can provide promising LA segmentation results, they often generalize poorly to unseen domains, such as data from different scanners and/or sites. In this work, we collect 210 LGE MRIs from different centers with different levels of image quality. To evaluate the domain generalization ability of models on the LA segmentation task, we employ four commonly used semantic segmentation networks for the LA segmentation from multi-center LGE MRIs. Besides, we investigate three domain generalization strategies, i.e., histogram matching, mutual information based disentangled representation, and random style transfer, where a simple histogram matching is proved to be most effective.

【3】 CMF: Cascaded Multi-model Fusion for Referring Image Segmentation 标题:CMF:用于参考图像分割的级联多模型融合

作者:Jianhua Yang,Yan Huang,Zhanyu Ma,Liang Wang 备注:Accepted by ICIP 2021 链接:https://arxiv.org/abs/2106.08617 摘要:在这项工作中,我们提出了参考图像分割(RIS)的任务,其目的是预测一个由自然语言表达式描述的对象的分割掩码。现有的方法大多侧重于在视觉和语言特征之间建立单向或定向的关系,将两种模式联系起来,而多尺度语境被忽略或建模不足。在多模态融合过程中,多尺度背景对于定位和分割尺度变化较大的目标至关重要。为了解决这个问题,我们提出了一个简单而有效的级联多模态融合(CMF)模块,该模块将多个atrus卷积层并行堆叠,并引入级联分支来融合视觉和语言特征。级联分支可以逐步整合多尺度上下文信息,并在多模态融合过程中促进两种模态的对齐。在四个基准数据集上的实验结果表明,我们的方法优于大多数最先进的方法。代码位于https://github.com/jianhua2022/CMF-Refseg. 摘要:In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.

【4】 Disentangling Semantic-to-visual Confusion for Zero-shot Learning 标题:解开语义到视觉的念力零距离学习

作者:Zihan Ye,Fuyuan Hu,Fan Lyu,Linyan Li,Kaizhu Huang 备注:Accepted by IEEE TRANSACTIONS ON MULTIMEDIA (TMM) in 2021 链接:https://arxiv.org/abs/2106.08605 摘要:利用生成模型从语义分布中综合视觉特征是近年来ZSL图像分类最流行的解决方案之一。三重态丢失(TL)是一种常用的方法,它通过自动搜索有区别的表示来从语义中生成真实的视觉分布。然而,由于ZSL中不可见类的不可用性,传统TL无法搜索可靠的不可见解纠缠表示。为了缓解这个缺点,我们在这项工作中提出了一种多模三重态损耗(MMTL),它利用多模信息来搜索一个分离的表示空间。因此,所有类都可以相互作用,这有利于在搜索空间中学习分离的类表示。此外,我们开发了一个新的模型,称之为解纠缠类表示生成对抗网络(DCR-GAN),专注于在训练、特征合成和最终识别阶段开发解纠缠表示。由于DCR-GAN的非纠缠表示,它可以在可见和不可见特征上拟合更真实的分布。大量的实验表明,我们提出的模型在四个基准数据集上的性能优于现有的模型。我们的代码在https://github.com/FouriYe/DCRGAN-TMM. 摘要:Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multimodal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets. Our code is available at https://github.com/FouriYe/DCRGAN-TMM.

【5】 ICDAR 2021 Competition on Components Segmentation Task of Document Photos 标题:ICDAR 2021文档照片分量分割任务竞赛

作者:Celso A. M. Lopes Junior,Ricardo B. das Neves Junior,Byron L. D. Bezerra,Alejandro H. Toselli,Donato Impedovo 备注:This paper was accepted for ICDAR 2021 Conference 链接:https://arxiv.org/abs/2106.08499 摘要:本文描述了在第16届国际文献分析与识别会议(icdar2021)背景下进行的文献照片组分分割任务的短期竞赛。本次比赛的目的是聚集在识别文件图像处理领域的研究人员,并为他们提供一个合适的基准,以比较他们在文件图像的组件分割任务的技术。提出了三个挑战性任务,需要在提供的数据集上执行不同的分段分配。收集的数据来自几种类型的巴西身份证文件,这些文件的个人信息被方便地替换。共有16名参与者,他们在部分或全部三项任务中获得的结果显示,所采用的指标的比率不同,比如骰子相似系数在0.06到0.99之间。不同策略的进入者采用不同的深度学习模式,在每项任务中取得最佳效果。结果表明,目前用于解决其中一项任务(文档边界检测)的实用方法已经很好地建立起来。然而,对于另外两个挑战性任务(文本区域和手写符号检测),仍然需要研究和开发更健壮的方法来获得可接受的结果。 摘要:This paper describes the short-term competition on Components Segmentation Task of Document Photos that was prepared in the context of the 16th International Conference on Document Analysis and Recognition (ICDAR 2021). This competition aims to bring together researchers working on the filed of identification document image processing and provides them a suitable benchmark to compare their techniques on the component segmentation task of document images. Three challenge tasks were proposed entailing different segmentation assignments to be performed on a provided dataset. The collected data are from several types of Brazilian ID documents, whose personal information was conveniently replaced. There were 16 participants whose results obtained for some or all the three tasks show different rates for the adopted metrics, like Dice Similarity Coefficient ranging from 0.06 to 0.99. Different Deep Learning models were applied by the entrants with diverse strategies to achieve the best results in each of the tasks. Obtained results show that the current applied methods for solving one of the proposed tasks (document boundary detection) are already well stablished. However, for the other two challenge tasks (text zone and handwritten sign detection) research and development of more robust approaches are still required to achieve acceptable results.

Zero/Few Shot|迁移|域适配|自适应(3篇)

【1】 Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation 标题:架起多任务学习与元学习之间的桥梁:迈向高效训练和有效适应

作者:Haoxiang Wang,Han Zhao,Bo Li 备注:ICML 2021 camera-ready version. Code is released at this https URL 链接:https://arxiv.org/abs/2106.09017 摘要:多任务学习(Multi-task learning,MTL)旨在通过多个相关任务的联合学习来提高它们的泛化能力。作为一个比较,除了联合训练计划,现代元学习允许在测试阶段使用有限标签的看不见的任务,以期快速适应它们。尽管MTL和元学习在问题表述上存在细微的差异,但两种学习范式都有一个共同的见解,即现有训练任务之间的共享结构可以导致更好的泛化和适应性。本文通过理论分析和实证研究,进一步认识这两种学习范式之间的密切联系。理论上,我们首先证明了MTL与一类基于梯度的元学习(GBML)算法具有相同的优化公式。证明了对于具有足够深度的超参数化神经网络,MTL和GBML的学习预测函数是接近的。特别是,这一结果表明,这两个模型给出的预测是相似的,在同一个看不见的任务。在实验上,我们证实了我们的理论发现,通过适当的实现,MTL在一组少量镜头图像分类基准上与最先进的GBML算法具有竞争力。由于现有的GBML算法往往涉及昂贵的二阶双层优化,我们的一阶MTL方法在大规模数据集(如mini-ImageNet)上的速度要快一个数量级。我们相信,这项工作可以帮助弥合这两种学习模式之间的差距,并提供一个计算效率高的替代GBML,也支持快速任务适应。 摘要:Multi-task learning (MTL) aims to improve the generalization of several related tasks by learning them jointly. As a comparison, in addition to the joint training scheme, modern meta-learning allows unseen tasks with limited labels during the test phase, in the hope of fast adaptation over them. Despite the subtle difference between MTL and meta-learning in the problem formulation, both learning paradigms share the same insight that the shared structure between existing training tasks could lead to better generalization and adaptation. In this paper, we take one important step further to understand the close connection between these two learning paradigms, through both theoretical analysis and empirical investigation. Theoretically, we first demonstrate that MTL shares the same optimization formulation with a class of gradient-based meta-learning (GBML) algorithms. We then prove that for over-parameterized neural networks with sufficient depth, the learned predictive functions of MTL and GBML are close. In particular, this result implies that the predictions given by these two models are similar over the same unseen task. Empirically, we corroborate our theoretical findings by showing that, with proper implementation, MTL is competitive against state-of-the-art GBML algorithms on a set of few-shot image classification benchmarks. Since existing GBML algorithms often involve costly second-order bi-level optimization, our first-order MTL method is an order of magnitude faster on large-scale datasets such as mini-ImageNet. We believe this work could help bridge the gap between these two learning paradigms, and provide a computationally efficient alternative to GBML that also supports fast task adaptation.

【2】 ECKPN: Explicit Class Knowledge Propagation Network for Transductive Few-shot Learning 标题:ECKPN:面向传导式Few-Shot学习的显式类知识传播网络

作者:Chaofan Chen,Xiaoshan Yang,Changsheng Xu,Xuhui Huang,Zhe Ma 备注:Accepted by CVPR2021 链接:https://arxiv.org/abs/2106.08523 摘要:近年来,基于transductive图的分类方法在Few-Shot分类中取得了很大的成功。然而,大多数现有的方法忽略了对类级知识的探索,而这些知识很容易被人类从少数几个样本中学习到。针对这一问题,本文提出了一种由比较、压缩和校正模块组成的显式类知识传播网络(ECKPN)。具体来说,我们首先使用比较模块来探索成对样本关系,以学习实例级图中的丰富样本表示。然后对实例级图进行压缩,生成类级图,有助于获取类级的可视化知识,便于对不同类之间的关系进行建模。其次,利用校正模块对类之间的关系进行显式刻画,得到更具区分性的类级知识表示。最后,将类级知识与实例级样本表示相结合,指导查询样本的推理。我们在四个镜头分类基准上进行了大量的实验,实验结果表明,所提出的ECKPN明显优于现有的方法。 摘要:Recently, the transductive graph-based methods have achieved great success in the few-shot classification task. However, most existing methods ignore exploring the class-level knowledge that can be easily learned by humans from just a handful of samples. In this paper, we propose an Explicit Class Knowledge Propagation Network (ECKPN), which is composed of the comparison, squeeze and calibration modules, to address this problem. Specifically, we first employ the comparison module to explore the pairwise sample relations to learn rich sample representations in the instance-level graph. Then, we squeeze the instance-level graph to generate the class-level graph, which can help obtain the class-level visual knowledge and facilitate modeling the relations of different classes. Next, the calibration module is adopted to characterize the relations of the classes explicitly to obtain the more discriminative class-level knowledge representations. Finally, we combine the class-level knowledge with the instance-level sample representations to guide the inference of the query samples. We conduct extensive experiments on four few-shot classification benchmarks, and the experimental results show that the proposed ECKPN significantly outperforms the state-of-the-art methods.

【3】 Achieving Domain Robustness in Stereo Matching Networks by Removing Shortcut Learning 标题:消除捷径学习实现立体匹配网络的域鲁棒性

作者:WeiQin Chuah,Ruwan Tennakoon,Alireza Bab-Hadiashar,David Suter 备注:11 pages, 7 figures 链接:https://arxiv.org/abs/2106.08486 摘要:基于学习的立体匹配和深度估计网络目前优于公共基准,并取得了令人印象深刻的结果。然而,最先进的网络往往无法从合成图像推广到更具挑战性的真实数据领域。本文试图通过分析合成图像学习对实际数据性能的影响,揭示实现域鲁棒性的秘密,特别是发现立体匹配网络泛化成功的重要因素。我们提供的证据表明,立体匹配网络在合成域中的特征学习受到合成数据中呈现的两个“捷径”的严重影响:(1)合成立体图像中匹配像素之间的相同局部统计(RGB颜色特征)和(2)合成纹理中缺乏真实感在游戏引擎中模拟的3D对象。我们将展示,通过移除这些快捷方式,我们可以在最先进的立体匹配框架中实现域鲁棒性,并在多个真实数据集上产生显著的性能,尽管事实上网络仅在合成数据上训练。我们的实验结果指出,消除合成数据中的捷径是实现合成数据域和真实数据域之间的域不变泛化的关键。 摘要:Learning-based stereo matching and depth estimation networks currently excel on public benchmarks with impressive results. However, state-of-the-art networks often fail to generalize from synthetic imagery to more challenging real data domains. This paper is an attempt to uncover hidden secrets of achieving domain robustness and in particular, discovering the important ingredients of generalization success of stereo matching networks by analyzing the effect of synthetic image learning on real data performance. We provide evidence that demonstrates that learning of features in the synthetic domain by a stereo matching network is heavily influenced by two "shortcuts" presented in the synthetic data: (1) identical local statistics (RGB colour features) between matching pixels in the synthetic stereo images and (2) lack of realism in synthetic textures on 3D objects simulated in game engines. We will show that by removing such shortcuts, we can achieve domain robustness in the state-of-the-art stereo matching frameworks and produce a remarkable performance on multiple realistic datasets, despite the fact that the networks were trained on synthetic data, only. Our experimental results point to the fact that eliminating shortcuts from the synthetic data is key to achieve domain-invariant generalization between synthetic and real data domains.

半弱无监督|主动学习|不确定性(4篇)

【1】 Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation 标题:无监督图像到图像翻译中解缠潜在样式空间的平滑

作者:Yahui Liu,Enver Sangineto,Yajing Chen,Linchao Bao,Haoxian Zhang,Nicu Sebe,Bruno Lepri,Wei Wang,Marco De Nadai 备注:Accepted to CVPR 2021 链接:https://arxiv.org/abs/2106.09016 摘要:图像到图像(I2I)多域翻译模型通常也使用其语义插值结果的质量进行评估。然而,最先进的模型在插值过程中经常会出现图像外观的突变,并且在跨域插值时表现较差。在本文中,我们提出了一种新的基于三种特定损失的训练协议,该协议有助于翻译网络学习一个平滑且不纠缠的潜在风格空间:1)域内和域间插值都对应于生成图像的渐变;2)在翻译过程中更好地保留源图像的内容。此外,我们还提出了一种新的评价指标来衡量I2I翻译模型潜在风格空间的平滑度。我们在不同数据集上的大量实验表明,该方法能显著提高生成图像的质量和插值的渐进性。 摘要:Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation. Moreover, we propose a novel evaluation metric to properly measure the smoothness of latent style space of I2I translation models. The proposed method can be plugged into existing translation approaches, and our extensive experiments on different datasets show that it can significantly boost the quality of the generated images and the graduality of the interpolations.

【2】 PatchNet: Unsupervised Object Discovery based on Patch Embedding 标题:PatchNet:基于补丁嵌入的无监督对象发现

作者:Hankyu Moon,Heng Hao,Sima Didari,Jae Oh Woo,Patrick Bangert 链接:https://arxiv.org/abs/2106.08599 摘要:我们证明了通过自监督从少量图像(100到200)中训练随机抽样的面片可以发现频繁出现的物体。这种方法的关键是模式空间,一种表示给定图像数据的所有可能子图像的潜在模式空间。模式空间中的距离结构反映了频繁对象导致的模式共生现象。模式空间嵌入是通过最小化随机生成的相邻面片之间的对比损失来学习的。为了防止嵌入对背景的学习,我们采用基于颜色的目标显著性和背景相异性来调节对比度损失。学习的距离结构作为目标记忆,通过对随机抽取的样本样本进行模式向量聚类,简单地发现频繁目标。基于图像块的图像表示方法很自然地处理了位置和尺度不变性,这对多目标发现至关重要。该方法已经被证明非常有效,并成功地应用于从自然图像中发现多个人脸和人体。 摘要:We demonstrate that frequently appearing objects can be discovered by training randomly sampled patches from a small number of images (100 to 200) by self-supervision. Key to this approach is the pattern space, a latent space of patterns that represents all possible sub-images of the given image data. The distance structure in the pattern space captures the co-occurrence of patterns due to the frequent objects. The pattern space embedding is learned by minimizing the contrastive loss between randomly generated adjacent patches. To prevent the embedding from learning the background, we modulate the contrastive loss by color-based object saliency and background dissimilarity. The learned distance structure serves as object memory, and the frequent objects are simply discovered by clustering the pattern vectors from the random patches sampled for inference. Our image representation based on image patches naturally handles the position and scale invariance property that is crucial to multi-object discovery. The method has been proven surprisingly effective, and successfully applied to finding multiple human faces and bodies from natural images.

【3】 Unsupervised-learning-based method for chest MRI-CT transformation using structure constrained unsupervised generative attention networks 标题:基于结构约束无监督生成性注意网络的胸部MRI-CT变换的无监督学习方法

作者:Hidetoshi Matsuo,Mizuho Nishio,Munenobu Nogami,Feibi Zeng,Takako Kurimoto,Sandeep Kaushik,Florian Wiesinger,Atsushi K Kono,Takamichi Murakami 备注:27 pages, 12 figures 链接:https://arxiv.org/abs/2106.08557 摘要:集成正电子发射断层扫描/磁共振成像(PET/MRI)扫描仪有助于通过PET同时获取代谢信息,并通过MRI获得具有高软组织对比度的形态学信息。尽管PET/MRI有助于高精度融合图像的获取,但其主要缺点是在进行衰减校正时遇到困难,这是定量PET评估所必需的。PET/MRI联合扫描由于伽玛射线衰减信息与MRIs之间没有直接关系,需要从MRI生成衰减校正图。尽管基于MRI的骨组织分割可以很容易地用于头部和骨盆区域,但是通过胸部CT生成来实现精确的骨分割仍然是一项具有挑战性的任务。这可以归因于呼吸和心脏运动发生在胸部以及其解剖结构复杂和相对薄的骨皮质。该文提出了一种在不需要人工标注的情况下,通过在生成性对抗网络(GAN)中加入结构约束来减少解剖结构变化的方法。本研究的结果显示,所提出的U-GAT-IT+MIND方法优于所有其他竞争方法。这项研究的发现暗示了从胸部MRI合成临床上可接受的CT图像的可能性,无需人类注释,从而将解剖结构的变化最小化。 摘要:The integrated positron emission tomography/magnetic resonance imaging (PET/MRI) scanner facilitates the simultaneous acquisition of metabolic information via PET and morphological information with high soft-tissue contrast using MRI. Although PET/MRI facilitates the capture of high-accuracy fusion images, its major drawback can be attributed to the difficulty encountered when performing attenuation correction, which is necessary for quantitative PET evaluation. The combined PET/MRI scanning requires the generation of attenuation-correction maps from MRI owing to no direct relationship between the gamma-ray attenuation information and MRIs. While MRI-based bone-tissue segmentation can be readily performed for the head and pelvis regions, the realization of accurate bone segmentation via chest CT generation remains a challenging task. This can be attributed to the respiratory and cardiac motions occurring in the chest as well as its anatomically complicated structure and relatively thin bone cortex. This paper presents a means to minimise the anatomical structural changes without human annotation by adding structural constraints using a modality-independent neighbourhood descriptor (MIND) to a generative adversarial network (GAN) that can transform unpaired images. The results obtained in this study revealed the proposed U-GAT-IT + MIND approach to outperform all other competing approaches. The findings of this study hint towards possibility of synthesising clinically acceptable CT images from chest MRI without human annotation, thereby minimising the changes in the anatomical structure.

【4】 Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows 标题:电视看得太多是好的:从影视节目中学习自我监督的视听表征

作者:Mahdi M. Kalayeh,Nagendra Kamath,Lingyi Liu,Ashok Chandrashekar 链接:https://arxiv.org/abs/2106.08513 摘要:声音的丰富性和易用性,再加上听觉线索揭示了大量场景中发生的事情,使得视听空间成为自我监督表征学习的一个非常直观的选择。然而,目前的文献表明,与以监督方式收集的替代数据相比,对未处理数据的训练产生的表示效果要差得多,并且只有当数据量显著增加时,差距才会缩小。此外,已知学习表示的质量严重受用于自我监督训练的管理数据集的大小和分类的影响。这就引出了这样一个问题:当我们的自我监督工作仍然几乎完全依赖于策划的数据时,我们是否过早地庆祝赶上监督学习。在本文中,我们研究了从电影和电视节目中学习作为视听自我监督学习的未加工数据形式的有效性。我们证明了一个基于对比学习的简单模型,在电影和电视节目的集合上训练,不仅显著优于在数量级的更大的未分级数据集上训练的更复杂的方法,而且与从大规模精选数据中学习的最新技术相比,表现非常有竞争力。我们发现,在电影的整个过程中,经常出现的视听模式,如主角或重要场景的出现和情节,导致在对比学习过程中容易出现过多的负面事例。利用这种观察,我们提出了一种分层抽样策略,尽管它简单,但有效地提高了性能,特别是当学习电视节目自然面临较少的语义多样性。 摘要:The abundance and ease of utilizing sound, along with the fact that auditory clues reveal so much about what happens in the scene, make the audio-visual space a perfectly intuitive choice for self-supervised representation learning. However, the current literature suggests that training on \textit{uncurated} data yields considerably poorer representations compared to the \textit{curated} alternatives collected in supervised manner, and the gap only narrows when the volume of data significantly increases. Furthermore, the quality of learned representations is known to be heavily influenced by the size and taxonomy of the curated datasets used for self-supervised training. This begs the question of whether we are celebrating too early on catching up with supervised learning when our self-supervised efforts still rely almost exclusively on curated data. In this paper, we study the efficacy of learning from Movies and TV Shows as forms of uncurated data for audio-visual self-supervised learning. We demonstrate that a simple model based on contrastive learning, trained on a collection of movies and TV shows, not only dramatically outperforms more complex methods which are trained on orders of magnitude larger uncurated datasets, but also performs very competitively with the state-of-the-art that learns from large-scale curated data. We identify that audiovisual patterns like the appearance of the main character or prominent scenes and mise-en-sc\`ene which frequently occur through the whole duration of a movie, lead to an overabundance of easy negative instances in the contrastive learning formulation. Capitalizing on such observation, we propose a hierarchical sampling policy, which despite its simplicity, effectively improves the performance, particularly when learning from TV shows which naturally face less semantic diversity.

时序|行为识别|姿态|视频|运动估计(3篇)

【1】 C^3: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues标题:C^3:视频对话的构图反事实对比学习

作者:Hung Le,Nancy F. Chen,Steven C. H. Hoi 备注:22 pages, 11 figures, 7 tables 链接:https://arxiv.org/abs/2106.08914 摘要:基于视频的对话系统旨在整合视频理解和对话理解,以产生与对话和视频背景相关的回应。大多数现有的方法都采用了深度学习模型,并且在相对较小的数据集上取得了显著的效果。然而,结果部分是通过利用数据集中的偏差而不是发展多模态推理来实现的,这导致了有限的泛化。在本文中,我们提出了一种新的作文反事实对比学习方法(C^3$),来发展基于视频的对话中事实和反事实样本之间的对比训练。具体来说,我们设计了基于视频中的时间步长和对话中的标记的事实/反事实抽样,并提出了利用对象级或动作级方差的对比损失函数。与以往的方法不同,本文着重研究了合成输出标记之间的隐藏状态表示,以优化生成环境中的表示空间。我们在视听场景感知对话(AVSD)基准上取得了很好的性能,并展示了我们的方法在视频和对话背景方面的优势。 摘要:Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.

【2】 X-MAN: Explaining multiple sources of anomalies in video 标题:X-MAN:解释视频中异常的多个来源

作者:Stanislaw Szymanowicz,James Charles,Roberto Cipolla 备注:In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2021 链接:https://arxiv.org/abs/2106.08856 摘要:我们的目标是检测视频中的异常,同时自动解释探测器响应背后的原因。在实际意义上,可解释性对于这项任务是至关重要的,因为对异常所需的响应取决于其性质和严重性。然而,大多数主要的方法(基于深度神经网络)是不可解释的,并且将决策过程隐藏在不可解释的特征表示中。为了解决这一问题,我们做出了以下贡献:(1)我们展示了如何构建适用于检测具有最先进性能的异常的可解释特征表示,(2)我们提出了一种可解释的概率异常检测器,它可以使用高级概念描述其响应背后的原因,(3)我们是第一个直接考虑对象交互的异常检测方法;(4)我们提出了一个解释异常的新任务,并发布了一个大型数据集来评估该任务的方法。我们的方法在公共数据集上与最新技术很好地竞争,同时还提供了基于对象及其交互的异常解释。 摘要:Our objective is to detect anomalies in video while also automatically explaining the reason behind the detector's response. In a practical sense, explainability is crucial for this task as the required response to an anomaly depends on its nature and severity. However, most leading methods (based on deep neural networks) are not interpretable and hide the decision making process in uninterpretable feature representations. In an effort to tackle this problem we make the following contributions: (1) we show how to build interpretable feature representations suitable for detecting anomalies with state of the art performance, (2) we propose an interpretable probabilistic anomaly detector which can describe the reason behind it's response using high level concepts, (3) we are the first to directly consider object interactions for anomaly detection and (4) we propose a new task of explaining anomalies and release a large dataset for evaluating methods on this task. Our method competes well with the state of the art on public datasets while also providing anomaly explanation based on objects and their interactions.

【3】 Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding 标题:基于改进CNN的低复杂度视频编码帧间预测内插滤波器学习

作者:Luka Murn,Saverio Blasi,Alan F. Smeaton,Marta Mrak 备注:IEEE Open Journal of Signal Processing Special Issue on Applied AI and Machine Learning for Video Coding and Streaming, June 2021 链接:https://arxiv.org/abs/2106.08936 摘要:最新机器学习方法的多功能性使其成为改进下一代视频压缩解决方案的理想选择。不幸的是,这些方法通常会带来计算复杂度的显著增加,并且难以解释为可解释的模型,从而影响它们在实际视频编码应用中的实现潜力。提出了一种新的基于可解释神经网络的帧间预测方法,以提高分数精度运动补偿所需参考样本的插值精度。该方法需要训练一个神经网络,并从中导出一个完整的四分之一像素插值滤波器集,因为该网络具有线性结构,易于解释。一种新颖的训练框架使得每个网络分支都类似于一个特定的分数移位。这种实用的解决方案使得它与传统的视频编码方案一起使用非常有效。当在最先进的多用途视频编码(VVC)测试模型的上下文中实现时,对于在随机存取、低延迟B和低延迟P配置下的低分辨率序列,可分别平均实现0.77%、1.27%和2.25%的BD速率节省,与全cnn插值相比,学习的插值算法的复杂度显著降低。 摘要:The versatility of recent machine learning approaches makes them ideal for improvement of next generation video compression solutions. Unfortunately, these approaches typically bring significant increases in computational complexity and are difficult to interpret into explainable models, affecting their potential for implementation within practical video coding applications. This paper introduces a novel explainable neural network-based inter-prediction scheme, to improve the interpolation of reference samples needed for fractional precision motion compensation. The approach requires a single neural network to be trained from which a full quarter-pixel interpolation filter set is derived, as the network is easily interpretable due to its linear structure. A novel training framework enables each network branch to resemble a specific fractional shift. This practical solution makes it very efficient to use alongside conventional video coding schemes. When implemented in the context of the state-of-the-art Versatile Video Coding (VVC) test model, 0.77%, 1.27% and 2.25% BD-rate savings can be achieved on average for lower resolution sequences under the random access, low-delay B and low-delay P configurations, respectively, while the complexity of the learned interpolation schemes is significantly reduced compared to the interpolation with full CNNs.

医学相关(2篇)

【1】 Over-and-Under Complete Convolutional RNN for MRI Reconstruction 标题:MRI重建中的过欠完全卷积RNN

作者:Pengfei Guo,Jeya Maria Jose Valanarasu,Puyang Wang,Jinyuan Zhou,Shanshan Jiang,Vishal M. Patel 备注:Accepted to MICCAI 2021 链接:https://arxiv.org/abs/2106.08886 摘要:从欠采样数据重建磁共振(MR)图像是一个具有挑战性的问题,因为欠采样操作引入了各种伪影。最近的基于深度学习的MR图像重建方法通常利用一种通用的自动编码器结构,它在初始层捕获低层特征,在深层捕获高层特征。这种网络主要集中在全局特征上,这些特征对于重建完全采样的图像可能不是最优的。本文提出了一种过完备和欠完备卷积递归神经网络(OUCR),它由一个过完备和一个欠完备卷积递归神经网络(CRNN)组成。过完备分支通过抑制网络的感受野来学习局部结构。将它与欠完备分支相结合,可以得到一个更关注底层特征而不丢失全局结构的网络。在两个数据集上的大量实验表明,该方法在可训练参数较少的情况下,比压缩感知和流行的基于深度学习的方法有了显著的改进。我们的代码在https://github.com/guopengf/OUCR. 摘要:Reconstructing magnetic resonance (MR) images from undersampled data is a challenging problem due to various artifacts introduced by the under-sampling operation. Recent deep learning-based methods for MR image reconstruction usually leverage a generic auto-encoder architecture which captures low-level features at the initial layers and high?level features at the deeper layers. Such networks focus much on global features which may not be optimal to reconstruct the fully-sampled image. In this paper, we propose an Over-and-Under Complete Convolu?tional Recurrent Neural Network (OUCR), which consists of an overcomplete and an undercomplete Convolutional Recurrent Neural Network(CRNN). The overcomplete branch gives special attention in learning local structures by restraining the receptive field of the network. Combining it with the undercomplete branch leads to a network which focuses more on low-level features without losing out on the global structures. Extensive experiments on two datasets demonstrate that the proposed method achieves significant improvements over the compressed sensing and popular deep learning-based methods with less number of trainable parameters. Our code is available at https://github.com/guopengf/OUCR.

【2】 Multi-scale Neural ODEs for 3D Medical Image Registration 标题:基于多尺度神经网络的三维医学图像配准

作者:Junshen Xu,Eric Z. Chen,Xiao Chen,Terrence Chen,Shanhui Sun 链接:https://arxiv.org/abs/2106.08493 摘要:图像配准在医学图像分析中起着重要的作用。传统的基于优化的方法,由于迭代过程,以昂贵的计算为代价,提供了一个精确的估计。深入学习的方法,如学习地图要快得多,但无论是迭代或粗到精的方法是需要提高精度处理大运动。在这项工作中,我们提出学习一个注册优化器通过多尺度神经ODE模型。这种推理由迭代梯度更新组成,类似于传统的梯度下降优化器,但速度更快,因为神经ODE从训练数据中学习,以便在每次迭代时有效地调整梯度。此外,我们建议学习一个模态无关的相似度量来处理不同图像对比度下的图像外观变化。我们通过大量的实验对来自公共和私人数据源的多对比度3D MR图像进行了评估,并证明了我们提出的方法的优越性能。 摘要:Image registration plays an important role in medical image analysis. Conventional optimization based methods provide an accurate estimation due to the iterative process at the cost of expensive computation. Deep learning methods such as learn-to-map are much faster but either iterative or coarse-to-fine approach is required to improve accuracy for handling large motions. In this work, we proposed to learn a registration optimizer via a multi-scale neural ODE model. The inference consists of iterative gradient updates similar to a conventional gradient descent optimizer but in a much faster way, because the neural ODE learns from the training data to adapt the gradient efficiently at each iteration. Furthermore, we proposed to learn a modal-independent similarity metric to address image appearance variations across different image contrasts. We performed evaluations through extensive experiments in the context of multi-contrast 3D MR images from both public and private data sources and demonstrate the superior performance of our proposed methods.

GAN|对抗|攻击|生成相关(5篇)

【1】 Cascading Modular Network (CAM-Net) for Multimodal Image Synthesis 标题:用于多模态图像合成的级联模块化网络(CAM-NET)

作者:Shichong Peng,Alireza Moazeni,Ke Li 备注:Videos available as ancillary files 链接:https://arxiv.org/abs/2106.09015 摘要:深度生成模型(如GANs)近年来在条件图像合成方面取得了令人瞩目的进展。由于模式崩溃的问题,从同一输入图像生成不同版本的输出图像一直是一个难题:因为每个输入图像只给出一个地面真值输出图像,所以只对条件分布的一个模式进行建模。本文以最近提出的隐式极大似然估计(IMLE)技术为基础,研究了多模态条件图像合成问题。以往的基于IMLE的方法对不同的任务要求不同的体系结构,这限制了它们的适用性,并且在生成的图像中缺乏细节。我们提出了CAM-Net,一个可以应用于广泛任务的统一体系结构。此外,它还能够产生令人信服的高频细节,与基线相比,Frechet起始距离(FID)减少了45.3%。 摘要:Deep generative models such as GANs have driven impressive advances in conditional image synthesis in recent years. A persistent challenge has been to generate diverse versions of output images from the same input image, due to the problem of mode collapse: because only one ground truth output image is given per input image, only one mode of the conditional distribution is modelled. In this paper, we focus on this problem of multimodal conditional image synthesis and build on the recently proposed technique of Implicit Maximum Likelihood Estimation (IMLE). Prior IMLE-based methods required different architectures for different tasks, which limit their applicability, and were lacking in fine details in the generated images. We propose CAM-Net, a unified architecture that can be applied to a broad range of tasks. Additionally, it is capable of generating convincing high frequency details, achieving a reduction of the Frechet Inception Distance (FID) by up to 45.3% compared to the baseline.

【2】 Learning to Disentangle GAN Fingerprint for Fake Image Attribution 标题:用于虚假图像归属的GaN指纹解缠方法研究

作者:Tianyun Yang,Juan Cao,Qiang Sheng,Lei Li,Jiaqi Ji,Xirong Li,Sheng Tang 备注:10 pages, 7 figures 链接:https://arxiv.org/abs/2106.08749 摘要:生成模型的快速发展给视觉取证带来了新的威胁,如恶意人格化、数字版权侵权等,促进了作品对虚假图像的归属。现有的伪图像属性研究主要依赖于一个直接的分类框架。如果没有额外的监督,提取的特征可能包含许多与内容相关的成分,并且泛化能力较差。同时,如何获得一个可解释的GAN指纹来解释这一决定仍然是一个悬而未决的问题。采用多任务框架,提出了一种GAN指纹分离网络(GFD-Net),该网络能同时将指纹从GAN生成的图像中分离出来,并产生一种与内容无关的伪图像属性表示。通过一系列约束条件来保证指纹的稳定性和可分辨性,从而有助于提取与内容无关的特征。进一步,我们对GAN指纹进行了综合分析,提供了一些关于GAN指纹特性的线索,以及哪些因素在GAN结构中起主导作用。实验表明,我们的GFD网络在封闭世界和开放世界测试中都取得了优异的伪图像归因性能。我们还将该方法应用于二值伪图像检测中,并对不可见生成器表现出很强的泛化能力。 摘要:Rapid pace of generative models has brought about new threats to visual forensics such as malicious personation and digital copyright infringement, which promotes works on fake image attribution. Existing works on fake image attribution mainly rely on a direct classification framework. Without additional supervision, the extracted features could include many content-relevant components and generalize poorly. Meanwhile, how to obtain an interpretable GAN fingerprint to explain the decision remains an open question. Adopting a multi-task framework, we propose a GAN Fingerprint Disentangling Network (GFD-Net) to simultaneously disentangle the fingerprint from GAN-generated images and produce a content-irrelevant representation for fake image attribution. A series of constraints are provided to guarantee the stability and discriminability of the fingerprint, which in turn helps content-irrelevant feature extraction. Further, we perform comprehensive analysis on GAN fingerprint, providing some clues about the properties of GAN fingerprint and which factors dominate the fingerprint in GAN architecture. Experiments show that our GFD-Net achieves superior fake image attribution performance in both closed-world and open-world testing. We also apply our method in binary fake image detection and exhibit a significant generalization ability on unseen generators.

【3】 Compound Frechet Inception Distance for Quality Assessment of GAN Created Images 标题:用于GaN生成图像质量评价的复合Frechet起始距离

作者:Eric J. Nunn,Pejman Khadivi,Shadrokh Samavi 备注:11 pages, 10 figures 链接:https://arxiv.org/abs/2106.08575 摘要:生成性对抗网络是一种生成性建模框架。GANs涉及到一对神经网络参与竞争,迭代地创建虚假数据,与真实数据无法区分。GANs的一个显著应用是开发假人脸,也称为“深度假面”,这是因为GANs框架的核心是深度学习算法。测量生成图像的质量本质上是主观的,但是已经尝试使用标准化的度量来客观化质量。客观度量的一个例子是Frechet初始距离(FID),它度量两个独立图像数据集的特征向量分布之间的差异。有些情况下,低感知质量的图像没有被分配适当的FID分数。我们建议通过整合较低层次的特征来覆盖更广泛的视觉缺陷,从而提高评估过程的稳健性。该方法综合了三个层次的特征提取来评价生成图像的质量。实验结果表明,该方法对畸变图像有较好的处理效果。 摘要:Generative adversarial networks or GANs are a type of generative modeling framework. GANs involve a pair of neural networks engaged in a competition in iteratively creating fake data, indistinguishable from the real data. One notable application of GANs is developing fake human faces, also known as "deep fakes," due to the deep learning algorithms at the core of the GAN framework. Measuring the quality of the generated images is inherently subjective but attempts to objectify quality using standardized metrics have been made. One example of objective metrics is the Frechet Inception Distance (FID), which measures the difference between distributions of feature vectors for two separate datasets of images. There are situations that images with low perceptual qualities are not assigned appropriate FID scores. We propose to improve the robustness of the evaluation process by integrating lower-level features to cover a wider array of visual defects. Our proposed method integrates three levels of feature abstractions to evaluate the quality of generated images. Experimental evaluations show better performance of the proposed method for distorted images.

【4】 Tackling the Challenges in Scene Graph Generation with Local-to-Global Interactions 标题:利用局部到全局交互应对场景图生成中的挑战

作者:Sangmin Woo,Junhyug Noh,Kangil Kim 备注:This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 链接:https://arxiv.org/abs/2106.08543 摘要:在这项工作中,我们寻求对场景图生成(SGG)任务的潜在挑战的新见解。对视觉基因组数据集的定量和定性分析意味着:1)歧义:即使对象间的关系包含相同的对象(或谓词),它们在视觉上或语义上也可能不相似;2)不对称:尽管关系的性质体现了方向,但在以前的研究中没有得到很好的解决,3)高阶上下文:利用某些图形元素的标识可以帮助生成精确的场景图。在分析的基础上,我们设计了一个新的SGG框架,即局部到全局的交互网络(LOGIN)。在局部,交互通过约束输入顺序将方向感知烘焙到网络中,从而提取主体、客体和背景三个实例之间的本质。全局而言,交互对每个图形组件(节点和边)之间的上下文进行编码。此外,我们还引入吸引和排斥损失,精细调整谓词嵌入。我们的框架能够通过设计以局部到全局的方式预测场景图,利用可能的互补性。为了量化登录对关系方向的感知程度,我们提出了一种新的诊断任务,称为双向关系分类(BRC)。我们看到,LOGIN可以成功地区分关系方向,而不是现有的方法(在BRC任务中),同时在可视化基因组基准上显示最新的结果(在SGG任务中)。 摘要:In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies -- 1) Ambiguity: even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry: despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts: leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances - subject, object, and background - while baking direction awareness into the network by constraining the input order. Globally, interactions encode the contexts between every graph components -- nodes and edges. Also we introduce Attract & Repel loss which finely adjusts predicate embeddings. Our framework enables predicting the scene graph in a local-to-global manner by design, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, we propose a new diagnostic task called Bidirectional Relationship Classification (BRC). We see that LOGIN can successfully distinguish relational direction than existing methods (in BRC task) while showing state-of-the-art results on the Visual Genome benchmark (in SGG task).

【5】 Dynamically Grown Generative Adversarial Networks 标题:动态增长的生成性对抗性网络

作者:Lanlan Liu,Yuting Zhang,Jia Deng,Stefano Soatto 备注:Accepted to AAAI 2021 链接:https://arxiv.org/abs/2106.08505 摘要:最近的工作将渐进式网络生长作为一种很有前途的方法来简化大型GANs的训练,但是模型设计和结构生长策略仍有待探索,需要针对不同的图像数据进行人工设计。本文提出了一种在训练过程中动态生长GAN的方法,优化了网络结构和参数,并实现了自动化。该方法将结构搜索技术嵌入到基于梯度训练的交织步骤中,周期性地为生成器和鉴别器寻找最优的结构增长策略。它既享受着渐进式增长带来的轻松训练的好处,也享受着更广阔的建筑设计空间带来的性能提升的好处。实验结果证明了新的图像生成技术。搜索过程中的观察也为GAN模型的设计提供了建设性的见解,如发生器-鉴别器平衡和卷积层的选择。 摘要:Recent work introduced progressive network growing as a promising way to ease the training for large GANs, but the model design and architecture-growing strategy still remain under-explored and needs manual design for different image data. In this paper, we propose a method to dynamically grow a GAN during training, optimizing the network architecture and its parameters together with automation. The method embeds architecture search techniques as an interleaving step with gradient-based training to periodically seek the optimal architecture-growing strategy for the generator and discriminator. It enjoys the benefits of both eased training because of progressive growing and improved performance because of broader architecture design space. Experimental results demonstrate new state-of-the-art of image generation. Observations in the search procedure also provide constructive insights into the GAN model design such as generator-discriminator balance and convolutional layer choices.

自动驾驶|车辆|车道检测等(1篇)

【1】 A Multi-Layered Approach for Measuring the Simulation-to-Reality Gap of Radar Perception for Autonomous Driving 标题:一种多层次测量自动驾驶雷达感知仿真与现实差距的方法

作者:Anthony Ngo,Max Paul Bauer,Michael Resch 备注:Accepted at the 24th IEEE International Conference on Intelligent Transportation Systems (ITSC 2021) 链接:https://arxiv.org/abs/2106.08372 摘要:随着自动驾驶汽车发布的安全验证要求的不断提高,除了传统的真实世界测试之外,还出现了替代方法,例如基于模拟的测试。为了依赖虚拟测试,必须对所采用的传感器模型进行验证。因此,有必要量化模拟与现实之间的差异,以确定某一保真度是否足以满足预期用途。目前还没有一种可靠的方法来测量自动驾驶雷达感知模拟与现实的差距。我们通过引入一种多层评估方法来解决这个问题,该方法包括显式和隐式传感器模型评估的组合。前者直接评估综合生成的传感器数据的真实性,而后者指的是对下游目标应用的评估。为了验证该方法,我们评估了三种典型雷达模型(理想、数据驱动、基于射线跟踪)的保真度及其在基于雷达的多目标跟踪虚拟测试中的适用性。我们已经证明了所提出的方法的有效性,提供了一个深入的传感器模型评估,使现有的差距可见,并使一个现实的估计总体模型保真度在不同的场景。 摘要:With the increasing safety validation requirements for the release of a self-driving car, alternative approaches, such as simulation-based testing, are emerging in addition to conventional real-world testing. In order to rely on virtual tests the employed sensor models have to be validated. For this reason, it is necessary to quantify the discrepancy between simulation and reality in order to determine whether a certain fidelity is sufficient for a desired intended use. There exists no sound method to measure this simulation-to-reality gap of radar perception for autonomous driving. We address this problem by introducing a multi-layered evaluation approach, which consists of a combination of an explicit and an implicit sensor model evaluation. The former directly evaluates the realism of the synthetically generated sensor data, while the latter refers to an evaluation of a downstream target application. In order to demonstrate the method, we evaluated the fidelity of three typical radar model types (ideal, data-driven, ray tracing-based) and their applicability for virtually testing radar-based multi-object tracking. We have shown the effectiveness of the proposed approach in terms of providing an in-depth sensor model assessment that renders existing disparities visible and enables a realistic estimation of the overall model fidelity across different scenarios.

OCR|文本相关(1篇)

【1】 TextStyleBrush: Transfer of Text Aesthetics from a Single Example 标题:TextStyleBrush:从单个示例传递文本美学

作者:Praveen Krishnan,Rama Kovvuri,Guan Pang,Boris Vassilev,Tal Hassner 备注:18 pages, 13 figures 链接:https://arxiv.org/abs/2106.08385 摘要:我们提出了一种新的方法来分离文本图像的内容,从各个方面的外观。然后,我们导出的外观表示可以应用于新内容,以便将源样式一次性转换为新内容。我们是以自我监督的方式来学习这种分离的。我们的方法处理整个词框,而不需要从背景中分割文本、每字符处理或对字符串长度进行假设。我们展示了不同文本域的结果,这些文本域以前是通过专门的方法处理的,例如场景文本、手写文本。为此,我们做出了一些技术贡献:(1)将文本图像的风格和内容分解为一个非参数、固定维向量(2) 我们提出了一种新的方法,这种方法受到StyleGAN的启发,但在不同的分辨率和内容上受到示例样式的限制(3) 我们提出了一种新的自监督训练准则,该准则使用预先训练好的字体分类器和文本识别器来保持源样式和目标内容。最后,我们还介绍了Imgur5K,一个新的具有挑战性的手写文字图像数据集。我们提供了大量的定性照片真实的结果,我们的方法。我们进一步证明,在场景文本和手写数据集的定量测试以及用户研究中,我们的方法优于以前的工作。 摘要:We present a novel approach for disentangling the content of a text image from all aspects of its appearance. The appearance representation we derive can then be applied to new content, for one-shot transfer of the source style to new content. We learn this disentanglement in a self-supervised manner. Our method processes entire word boxes, without requiring segmentation of text from background, per-character processing, or making assumptions on string lengths. We show results in different text domains which were previously handled by specialized methods, e.g., scene text, handwritten text. To these ends, we make a number of technical contributions: (1) We disentangle the style and content of a textual image into a non-parametric, fixed-dimensional vector. (2) We propose a novel approach inspired by StyleGAN but conditioned over the example style at different resolution and content. (3) We present novel self-supervised training criteria which preserve both source style and target content using a pre-trained font classifier and text recognizer. Finally, (4) we also introduce Imgur5K, a new challenging dataset for handwritten word images. We offer numerous qualitative photo-realistic results of our method. We further show that our method surpasses previous work in quantitative tests on scene text and handwriting datasets, as well as in a user study.

Attention注意力(3篇)

【1】 Invertible Attention 标题:反转注意力

作者:Jiajun Zha,Yiran Zhong,Jing Zhang,Liang Zheng,Richard Hartley 备注:19 pages 链接:https://arxiv.org/abs/2106.09003 摘要:注意力已被证明是一种有效的机制,以捕捉长期的依赖关系。然而,到目前为止,它还没有部署在可逆网络中。这是因为为了使网络可逆,网络中的每一个组成部分都需要是一个双射变换,而正常的注意块则不是。在本文中,我们提出了可逆注意,可以插入到现有的可逆模型。我们从数学和实验上证明了注意模型的可逆性可以通过严格限制其Lipschitz常数来实现。我们用CIFAR-10、SVHN和CelebA三个数据集验证了我们在图像重建任务中的可逆注意的可逆性。我们还表明,在密集预测任务中,我们的可逆注意与正常的不可逆注意取得了相似的性能。 摘要:Attention has been proved to be an efficient mechanism to capture long-range dependencies. However, so far it has not been deployed in invertible networks. This is due to the fact that in order to make a network invertible, every component within the network needs to be a bijective transformation, but a normal attention block is not. In this paper, we propose invertible attention that can be plugged into existing invertible models. We mathematically and experimentally prove that the invertibility of an attention model can be achieved by carefully constraining its Lipschitz constant. We validate the invertibility of our invertible attention on image reconstruction task with 3 popular datasets: CIFAR-10, SVHN, and CelebA. We also show that our invertible attention achieves similar performance in comparison with normal non-invertible attention on dense prediction tasks.

【2】 EdgeConv with Attention Module for Monocular Depth Estimation 标题:带注意模块的EdgeConv单目深度估计

作者:Minhyeok Lee,Sangwon Hwang,Chaewon Park,Sangyoun Lee 备注:10 pages, 7 figures 链接:https://arxiv.org/abs/2106.08615 摘要:单目深度估计是机器人技术和自主驾驶领域的一项重要任务,其中三维结构信息是必不可少的。然而,极端的光照条件和复杂的表面物体使得很难在单个图像中预测深度。因此,为了生成精确的深度图,模型学习场景的结构信息是非常重要的。为了解决单目深度估计的困难,我们提出了一种新的基于补丁的EdgeConv模块(PEM)和EdgeConv注意模块(EAM)。该模块通过边缘卷积学习空间中相邻图像块之间的关系来提取结构信息。我们的方法是评价两个流行的数据集,纽约大学深度V2和基蒂特征分裂,实现国家的最先进的性能。通过各种对比实验,我们证明了该模型在具有挑战性的场景中具有较强的深度预测能力。 摘要:Monocular depth estimation is an especially important task in robotics and autonomous driving, where 3D structural information is essential. However, extreme lighting conditions and complex surface objects make it difficult to predict depth in a single image. Therefore, to generate accurate depth maps, it is important for the model to learn structural information about the scene. We propose a novel Patch-Wise EdgeConv Module (PEM) and EdgeConv Attention Module (EAM) to solve the difficulty of monocular depth estimation. The proposed modules extract structural information by learning the relationship between image patches close to each other in space using edge convolution. Our method is evaluated on two popular datasets, the NYU Depth V2 and the KITTI Eigen split, achieving state-of-the-art performance. We prove that the proposed model predicts depth robustly in challenging scenes through various comparative experiments.

【3】 DMSANet: Dual Multi Scale Attention Network 标题:DMSANet:双重多尺度注意力网络

作者:Abhinav Sagar 备注:11 pages, 3 figures, 8 tables, Submitted to Neurips 2021 链接:https://arxiv.org/abs/2106.08382 摘要:近年来,注意机制在计算机视觉领域得到了广泛的应用。为了提高网络的性能,人们做了大量的工作,尽管这几乎总是导致计算复杂度的增加。在本文中,我们提出了一个新的注意模块,它不仅达到了最佳的性能,而且与现有的大多数模型相比,具有较小的参数。我们的注意力模块可以很容易地与其他卷积神经网络集成,因为它的轻量级。提出的双多尺度注意网络(DMSANet)由两部分组成:第一部分用于提取不同尺度的特征并进行聚合;第二部分使用空间和通道注意模块并行地自适应地将局部特征与其全局依赖性相结合。我们在ImageNet数据集上测试了我们的网络性能,在MS-COCO数据集上测试了目标检测和实例分割。 摘要:Attention mechanism of late has been quite popular in the computer vision community. A lot of work has been done to improve the performance of the network, although almost always it results in increased computational complexity. In this paper, we propose a new attention module that not only achieves the best performance but also has lesser parameters compared to most existing models. Our attention module can easily be integrated with other convolutional neural networks because of its lightweight nature. The proposed network named Dual Multi Scale Attention Network (DMSANet) is comprised of two parts: the first part is used to extract features at various scales and aggregate them, the second part uses spatial and channel attention modules in parallel to adaptively integrate local features with their global dependencies. We benchmark our network performance for Image Classification on ImageNet dataset, Object Detection and Instance Segmentation both on MS COCO dataset.

人脸|人群计数(1篇)

【1】 Toward Affective XAI: Facial Affect Analysis for Understanding Explainable Human-AI Interactions 标题:走向情感的XAI:理解可解释的人-AI交互的面部情感分析

作者:Luke Guerdan,Alex Raymond,Hatice Gunes 链接:https://arxiv.org/abs/2106.08761 摘要:随着机器学习方法被越来越多地用于增强人类的决策能力,可解释人工智能(XAI)的研究已经探索了将系统行为传达给人类的方法。然而,这些方法往往无法解释人类在与解释互动时的情绪反应。面部情感分析,研究人类面部表情的情绪,是一个有希望的镜头,了解用户如何参与解释。因此,在这项工作中,我们的目标是:(1)确定当人们与XAI接口交互时,哪些面部情感特征是显著的;(2)开发一个多任务特征嵌入,将面部情感信号与参与者使用解释联系起来。我们的分析和结果显示,当参与者不能有效地使用解释时,面部AU1和AU4的发生率和值以及唤醒都会增加。这表明,面部情感分析应纳入XAI,以个性化的解释个人的互动风格,并适应解释的难度为基础的任务执行。 摘要:As machine learning approaches are increasingly used to augment human decision-making, eXplainable Artificial Intelligence (XAI) research has explored methods for communicating system behavior to humans. However, these approaches often fail to account for the emotional responses of humans as they interact with explanations. Facial affect analysis, which examines human facial expressions of emotions, is one promising lens for understanding how users engage with explanations. Therefore, in this work, we aim to (1) identify which facial affect features are pronounced when people interact with XAI interfaces, and (2) develop a multitask feature embedding for linking facial affect signals with participants' use of explanations. Our analyses and results show that the occurrence and values of facial AU1 and AU4, and Arousal are heightened when participants fail to use explanations effectively. This suggests that facial affect analysis should be incorporated into XAI to personalize explanations to individuals' interaction styles and to adapt explanations based on the difficulty of the task performed.

跟踪(1篇)

【1】 SiamAPN++: Siamese Attentional Aggregation Network for Real-Time UAV Tracking 标题:SiamAPN++:用于无人机实时跟踪的暹罗注意聚合网络

作者:Ziang Cao,Changhong Fu,Junjie Ye,Bowen Li,Yiming Li 链接:https://arxiv.org/abs/2106.08816 摘要:最近,基于暹罗的方法已经脱颖而出,从众多的跟踪方法,由于其国家的最先进的(SOTA)性能。然而,由于无人机跟踪中的各种特殊挑战,例如严重遮挡和快速运动,大多数现有的基于暹罗的跟踪器很难将优异的性能与高效率结合起来。针对这一问题,本文提出了一种新的用于无人机实时跟踪的注意力暹罗跟踪器(SiamAPN++)。借助于注意机制,注意聚合网络(attentionalggregationnetwork,AAN)通过自我AAN和交叉AAN两种方式进行,最终提高特征的表达能力。前者通过空间维度和通道维度对单个特征映射的自语义相关性进行聚合和建模。后者旨在聚合不同语义特征之间的相互依赖关系,包括锚的位置信息。此外,提出了锚建议网络的双特征模型,提高了锚建议网络的鲁棒性,提高了对不同尺度目标的感知能力。在两个著名的权威基准上进行了实验,其中SiamAPN++优于其基准SiamAPN和其他SOTA跟踪器。此外,在一个典型的嵌入式平台上进行的实际测试表明,SiamAPN++具有很好的实时跟踪效果。 摘要:Recently, the Siamese-based method has stood out from multitudinous tracking methods owing to its state-of-the-art (SOTA) performance. Nevertheless, due to various special challenges in UAV tracking, \textit{e.g.}, severe occlusion, and fast motion, most existing Siamese-based trackers hardly combine superior performance with high efficiency. To this concern, in this paper, a novel attentional Siamese tracker (SiamAPN++) is proposed for real-time UAV tracking. By virtue of the attention mechanism, the attentional aggregation network (AAN) is conducted with self-AAN and cross-AAN, raising the expression ability of features eventually. The former AAN aggregates and models the self-semantic interdependencies of the single feature map via spatial and channel dimensions. The latter aims to aggregate the cross-interdependencies of different semantic features including the location information of anchors. In addition, the dual features version of the anchor proposal network is proposed to raise the robustness of proposing anchors, increasing the perception ability to objects with various scales. Experiments on two well-known authoritative benchmarks are conducted, where SiamAPN++ outperforms its baseline SiamAPN and other SOTA trackers. Besides, real-world tests onboard a typical embedded platform demonstrate that SiamAPN++ achieves promising tracking results with real-time speed.

表征学习(2篇)

【1】 Evolving Image Compositions for Feature Representation Learning 标题:用于特征表征学习的进化图像合成

作者:Paola Cascante-Bonilla,Arshdeep Sekhon,Yanjun Qi,Vicente Ordonez 链接:https://arxiv.org/abs/2106.09011 摘要:用于视觉识别的卷积神经网络需要大量的训练样本,并且通常受益于数据增强。本文提出了PatchMix,一种数据增强方法,通过将成对的图像按网格模式组合成补丁来创建新的样本。这些新样本的地面真值标签设置为与每个图像的面片数成比例。然后,我们在面片级别添加一组额外的损失,以在面片和图像级别进行正则化并鼓励良好的表示。使用PatchMix在ImageNet上训练的ResNet-50模型在一系列基准测试中表现出了优越的迁移学习能力。虽然PatchMix可以依赖于随机配对和随机网格模式进行混合,但我们探索了进化搜索作为一种指导策略来联合发现最佳网格模式和图像配对。为此,我们设想了一个适应度函数,它绕过了重新训练模型来评估每个选择的需要。通过这种方式,PatchMix比基于CIFAR-10(+1.91)、CIFAR-100(+5.31)、Tiny Imagenet(+3.52)和Imagenet(+1.16)的基础模型有显著的优势,也比以前最先进的成对增强策略有显著的优势。 摘要:Convolutional neural networks for visual recognition require large amounts of training samples and usually benefit from data augmentation. This paper proposes PatchMix, a data augmentation method that creates new samples by composing patches from pairs of images in a grid-like pattern. These new samples' ground truth labels are set as proportional to the number of patches from each image. We then add a set of additional losses at the patch-level to regularize and to encourage good representations at both the patch and image levels. A ResNet-50 model trained on ImageNet using PatchMix exhibits superior transfer learning capabilities across a wide array of benchmarks. Although PatchMix can rely on random pairings and random grid-like patterns for mixing, we explore evolutionary search as a guiding strategy to discover optimal grid-like patterns and image pairing jointly. For this purpose, we conceive a fitness function that bypasses the need to re-train a model to evaluate each choice. In this way, PatchMix outperforms a base model on CIFAR-10 (+1.91), CIFAR-100 (+5.31), Tiny Imagenet (+3.52), and ImageNet (+1.16) by significant margins, also outperforming previous state-of-the-art pairwise augmentation strategies.

【2】 Learning Implicit Glyph Shape Representation 标题:学习隐式字形形状表示

作者:Ying-Tian Liu,Yuan-Chen Guo,Yi-Xiao Li,Chen Wang,Song-Hai Zhang 链接:https://arxiv.org/abs/2106.08573 摘要:在本文中,我们提出了一种新的隐式字形形状表示法,它将字形建模为由二次曲线包围的形状原语,并自然地生成任意高分辨率的字形图像。通过对字体重建和插值任务的实验,验证了这种结构化的隐式表示方法适合于描述字形的结构和风格特征。此外,基于所提出的表示,我们设计了一个简单而有效的解纠缠网络来解决具有挑战性的一次性字体风格转换问题,并在定量和定性比较方面取得了与最先进的替代方案相比的最佳结果。得益于这种表示,我们生成的字形有可能通过后处理转换为矢量字体,减少光栅化图像和矢量图形之间的差距。我们希望这项工作能为二维形状分析和综合提供一个强有力的工具,并对二维形状建模中隐式表示的进一步开发有所启发。 摘要:In this paper, we present a novel implicit glyph shape representation, which models glyphs as shape primitives enclosed by quadratic curves, and naturally enables generating glyph images at arbitrary high resolutions. Experiments on font reconstruction and interpolation tasks verified that this structured implicit representation is suitable for describing both structure and style features of glyphs. Furthermore, based on the proposed representation, we design a simple yet effective disentangled network for the challenging one-shot font style transfer problem, and achieve the best results comparing to state-of-the-art alternatives in both quantitative and qualitative comparisons. Benefit from this representation, our generated glyphs have the potential to be converted to vector fonts through post-processing, reducing the gap between rasterized images and vector graphics. We hope this work can provide a powerful tool for 2D shape analysis and synthesis, and inspire further exploitation in implicit representations for 2D shape modeling.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Understanding and Evaluating Racial Biases in Image Captioning 标题:理解和评价图像字幕中的种族偏见

作者:Dora Zhao,Angelina Wang,Olga Russakovsky 链接:https://arxiv.org/abs/2106.08503 摘要:图像字幕是一项重要的任务,为基准视觉推理和使无障碍的人与视觉障碍。然而,在许多机器学习环境中,社会偏见会以不希望的方式影响图像字幕。在这项工作中,我们研究偏见传播途径内的图像字幕,特别是在可可数据集。以前的工作已经分析了性别偏见字幕使用自动衍生性别标签;在这里,我们检查种族和交叉偏见使用手动注释。我们的第一个贡献是在获得IRB批准后,对28315名被描绘人的感知性别和肤色进行注释。使用这些注释,我们比较了手动和自动生成的图像标题中存在的种族偏见。我们展示了浅肤色和深肤色的人在字幕表现、情感和词语选择上的差异。此外,我们发现这些差异的程度在现代字幕系统中比旧的字幕系统更大,因此,如果没有适当的考虑和缓解,这些差异只会变得越来越普遍。代码和数据可在https://princetonvisualai.github.io/imagecaptioning-bias . 摘要:Image captioning is an important task for benchmarking visual reasoning and for enabling accessibility for people with vision impairments. However, as in many machine learning settings, social biases can influence image captioning in undesirable ways. In this work, we study bias propagation pathways within image captioning, focusing specifically on the COCO dataset. Prior work has analyzed gender bias in captions using automatically-derived gender labels; here we examine racial and intersectional biases using manual annotations. Our first contribution is in annotating the perceived gender and skin color of 28,315 of the depicted people after obtaining IRB approval. Using these annotations, we compare racial biases present in both manual and automatically-generated image captions. We demonstrate differences in caption performance, sentiment, and word choice between images of lighter versus darker-skinned people. Further, we find the magnitude of these differences to be greater in modern captioning systems compared to older ones, thus leading to concerns that without proper consideration and mitigation these differences will only become increasingly prevalent. Code and data is available at https://princetonvisualai.github.io/imagecaptioning-bias .

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Differentiable Diffusion for Dense Depth Estimation from Multi-view Images 标题:多视点图像密集深度估计的可微扩散算法

作者:Numair Khan,Min H. Kim,James Tompkin 备注:None 链接:https://arxiv.org/abs/2106.08917 摘要:我们提出了一种通过优化稀疏点集来估计密集深度的方法,使得它们在深度图中的扩散最小化了RGB监督下的多视点重投影误差。我们优化点的位置,深度和权重的损失,通过微分飞溅模型点高斯分析透射率。此外,我们开发了一个高效的优化程序,可以同时优化复杂场景重建所需的50k+点。我们使用地面真实数据验证了我们的程序,并显示出高重建质量。然后,我们通过自我监督将此方法应用于光场和更宽的基线图像,并对从不精确的稀疏点扩散的深度图的平均误差和离群误差进行了改进。最后,我们将定性和定量结果与图像处理和深度学习方法进行了比较。 摘要:We present a method to estimate dense depth by optimizing a sparse set of points such that their diffusion into a depth map minimizes a multi-view reprojection error from RGB supervision. We optimize point positions, depths, and weights with respect to the loss by differential splatting that models points as Gaussians with analytic transmittance. Further, we develop an efficient optimization routine that can simultaneously optimize the 50k+ points required for complex scene reconstruction. We validate our routine using ground truth data and show high reconstruction quality. Then, we apply this to light field and wider baseline images via self supervision, and show improvements in both average and outlier error for depth maps diffused from inaccurate sparse points. Finally, we compare qualitative and quantitative results to image processing and deep learning methods.

多模态(1篇)

【1】 A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods 标题:多模态推文情感分析方法的公平全面比较

作者:Gullal S. Cheema,Sherzod Hakimov,Eric Müller-Budack,Ralph Ewerth 备注:Accepted in Workshop on Multi-ModalPre-Training for Multimedia Understanding (MMPT 2021), co-located with ICMR 2021 链接:https://arxiv.org/abs/2106.08829 摘要:意见和情绪分析是刻画社交媒体帖子中主观信息的一项重要任务。本文对现有的六种方法进行了综合实验评价和比较,并重新实现了其中一种方法。此外,我们还研究了不同的文本和视觉特征嵌入,包括内容的不同方面,以及最近引入的多模态剪辑嵌入。实验结果提出了两个不同的公开基准数据集的推特和相应的图像。与以往工作的评价方法相比,我们引入了一个可重复和公平的评价方案,使结果具有可比性。最后,我们进行了误差分析,以概述方法的局限性和未来工作的可能性。 摘要:Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-implemented one of them. In addition, we investigate different textual and visual feature embeddings that cover different aspects of the content, as well as the recently introduced multimodal CLIP embeddings. Experimental results are presented for two different publicly available benchmark datasets of tweets and corresponding images. In contrast to the evaluation methodology of previous work, we introduce a reproducible and fair evaluation scheme to make results comparable. Finally, we conduct an error analysis to outline the limitations of the methods and possibilities for the future work.

3D|3D重建等相关(2篇)

【1】 GelSight Wedge: Measuring High-Resolution 3D Contact Geometry with a Compact Robot Finger 标题:GelSight楔形板:用紧凑型机器人手指测量高分辨率三维接触几何

作者:Shaoxiong Wang,Yu She,Branden Romero,Edward Adelson 备注:ICRA 2021 链接:https://arxiv.org/abs/2106.08851 摘要:基于视觉的触觉传感器有可能提供重要的接触几何学来定位视觉遮挡的目标。然而,要同时满足光学和机械约束条件,测量紧凑型机器人手指的高分辨率三维接触几何结构是一个挑战。在这项工作中,我们提出了GelSight楔形传感器,它被优化为具有紧凑的机器人手指形状,同时实现高分辨率的三维重建。我们评估了不同光照配置下的三维重建,并将该方法从3个光照扩展到1或2个光照。我们通过将传感器缩小到人类手指的大小来完成精细的操作任务,从而展示了设计的灵活性。我们还展示了重建的三维几何体在三维空间姿态跟踪中的有效性和潜力。 摘要:Vision-based tactile sensors have the potential to provide important contact geometry to localize the objective with visual occlusion. However, it is challenging to measure high-resolution 3D contact geometry for a compact robot finger, to simultaneously meet optical and mechanical constraints. In this work, we present the GelSight Wedge sensor, which is optimized to have a compact shape for robot fingers, while achieving high-resolution 3D reconstruction. We evaluate the 3D reconstruction under different lighting configurations, and extend the method from 3 lights to 1 or 2 lights. We demonstrate the flexibility of the design by shrinking the sensor to the size of a human finger for fine manipulation tasks. We also show the effectiveness and potential of the reconstructed 3D geometry for pose tracking in the 3D space.

【2】 Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects 标题:模糊中的形状:快速移动物体的纹理3D形状和运动的恢复

作者:Denys Rozumnyi,Martin R. Oswald,Vittorio Ferrari,Marc Pollefeys 备注:15 pages, 8 figures, 2 tables 链接:https://arxiv.org/abs/2106.08762 摘要:我们提出了一个新的任务,联合重建三维形状,纹理和运动的物体从一个单一的运动模糊图像。虽然以前的方法只在二维图像域解决去模糊问题,但我们提出的在三维域中对所有对象属性的严格建模能够正确描述任意对象的运动。这将导致更好的图像分解和更清晰的去模糊结果。我们将观察到的运动模糊物体的外观建模为背景和具有恒定平移和旋转的三维物体的组合。我们的方法通过使用合适的正则化器进行可微渲染来最小化重建输入图像的损失。这使得能够以高保真度估计模糊对象的纹理三维网格。在快速运动物体去模糊的几个基准上,我们的方法明显优于其他方法。定性结果表明,重建的三维网格生成了高质量的时间超分辨率和新颖的图像。 摘要:We address the novel task of jointly reconstructing the 3D shape, texture, and motion of an object from a single motion-blurred image. While previous approaches address the deblurring problem only in the 2D image domain, our proposed rigorous modeling of all object properties in the 3D domain enables the correct description of arbitrary object motion. This leads to significantly better image decomposition and sharper deblurring results. We model the observed appearance of a motion-blurred object as a combination of the background and a 3D object with constant translation and rotation. Our method minimizes a loss on reconstructing the input image via differentiable rendering with suitable regularizers. This enables estimating the textured 3D mesh of the blurred object with high fidelity. Our method substantially outperforms competing approaches on several benchmarks for fast moving objects deblurring. Qualitative results show that the reconstructed 3D mesh generates high-quality temporal super-resolution and novel views of the deblurred object.

其他神经网络|深度学习|模型|建模(4篇)

【1】 Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch 标题:潜伏Agent:用于从头开始训练的神经网络的可扩展隐藏触发器后门

作者:Hossein Souri,Micah Goldblum,Liam Fowl,Rama Chellappa,Tom Goldstein 链接:https://arxiv.org/abs/2106.08970 摘要:随着机器学习数据的管理变得越来越自动化,数据集篡改是一个越来越大的威胁。后门攻击者篡改训练数据,将漏洞嵌入基于该数据训练的模型中。然后在推理时,通过在模型的输入中放置“触发器”来激活此漏洞。典型的后门攻击将触发器直接插入到训练数据中,尽管在检查时可以看到这种攻击的存在。相比之下,隐藏的触发器后门攻击完全不需要在训练数据中放置触发器就可以实现中毒。然而,这种隐藏的触发攻击对从零开始训练的神经网络是无效的。我们开发了一种新的隐藏触发攻击Sleeper Agent,该Agent在构造过程中采用了梯度匹配、数据选择和目标模型再训练等技术。Sleeper Agent是第一个能够有效对抗从头训练的神经网络的隐藏触发后门攻击。我们在ImageNet和黑盒环境中演示了它的有效性。我们的实现代码可以在https://github.com/hsouri/Sleeper-Agent. 摘要:As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a "trigger" into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings. Our implementation code can be found at https://github.com/hsouri/Sleeper-Agent.

【2】 Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation 标题:用于诱发表达式估计的位置编码时间卷积网络

作者:VanThong Huynh,Guee-Sang Lee,Hyung-Jeong Yang,Soo-Huyng Kim 备注:Oral presentation at AUVi Workshop - CVPR 2021 (this https URL). Source code available at this https URL 链接:https://arxiv.org/abs/2106.08596 摘要:提出了一种基于视频诱发表情(EEV)的人脸表情预测方法。我们利用计算机视觉和音频信号中大规模数据集上预先训练好的模型来提取视频中时间戳的深度表示。由于时间卷积网络在内存消耗和并行性方面的优势,它被用来研究时间关系,而不是类似于RNN的体系结构。此外,为了解决某些时间戳的缺失注释,在训练过程中丢弃这些时间戳时,采用位置编码来确保输入数据的连续性。我们在EEV挑战赛中取得了最先进的成绩,皮尔逊相关系数为0.05477,在EEV 2021挑战赛中排名第一。 摘要:This paper presents an approach for Evoked Expressions from Videos (EEV) challenge, which aims to predict evoked facial expressions from video. We take advantage of pre-trained models on large-scale datasets in computer vision and audio signals to extract the deep representation of timestamps in the video. A temporal convolution network, rather than an RNN like architecture, is used to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the missing annotations of some timestamps, positional encoding is employed to ensure continuity of input data when discarding these timestamps during training. We achieved state-of-the-art results on the EEV challenge with a Pearson correlation coefficient of 0.05477, the first ranked performance in the EEV 2021 challenge.

【3】 Machine learning-based analysis of hyperspectral images for automated sepsis diagnosis 标题:基于机器学习的高光谱图像分析在脓毒症自动诊断中的应用

作者:Maximilian Dietrich,Silvia Seidlitz,Nicholas Schreck,Manuel Wiesenfarth,Patrick Godau,Minu Tizabi,Jan Sellner,Sebastian Marx,Samuel Knödler,Michael M. Allers,Leonardo Ayala,Karsten Schmidt,Thorsten Brenner,Alexander Studier-Fischer,Felix Nickel,Beat P. Müller-Stich,Annette Kopp-Schneider,Markus A. Weigand,Lena Maier-Hein 备注:Maximilian Dietrich and Silvia Seidlitz contributed equally. Markus A. Weigand and Lena Maier-Hein contributed equally 链接:https://arxiv.org/abs/2106.08445 摘要:脓毒症是世界范围内导致死亡和严重疾病的主要原因。虽然用于早期诊断的可靠的生物标志物仍然缺失,但最近的研究表明,高光谱成像(HSI)有可能通过监测微循环改变来克服这一瓶颈。然而,基于HSI数据的基于机器学习的脓毒症自动诊断至今尚未被探索。鉴于文献中的这一差距,我们利用现有的数据集(1)研究基于HSI的脓毒症自动诊断是否可行,以及(2)提出一系列与基于HSI的组织分类相关的可能的混杂因素。虽然我们能够利用现有数据对脓毒症进行分类,准确度超过98美元,\%$,但我们的研究还揭示了几个与受试者、治疗和成像相关的混杂因素,这些因素可能导致在患者组之间不平衡时高估算法性能。我们的结论是,进一步的前瞻性研究,仔细设计这些混杂因素,是必要的,以证实在这项研究中获得的初步结果。 摘要:Sepsis is a leading cause of mortality and critical illness worldwide. While robust biomarkers for early diagnosis are still missing, recent work indicates that hyperspectral imaging (HSI) has the potential to overcome this bottleneck by monitoring microcirculatory alterations. Automated machine learning-based diagnosis of sepsis based on HSI data, however, has not been explored to date. Given this gap in the literature, we leveraged an existing data set to (1) investigate whether HSI-based automated diagnosis of sepsis is possible and (2) put forth a list of possible confounders relevant for HSI-based tissue classification. While we were able to classify sepsis with an accuracy of over $98\,\%$ using the existing data, our research also revealed several subject-, therapy- and imaging-related confounders that may lead to an overestimation of algorithm performance when not balanced across the patient groups. We conclude that further prospective studies, carefully designed with respect to these confounders, are necessary to confirm the preliminary results obtained in this study.

【4】 Explaining decision of model from its prediction 标题:从预测解释模型的决策

作者:Dipesh Tamboli 备注:Literature review 链接:https://arxiv.org/abs/2106.08366 摘要:本文总结了CAM、Grad-CAM、基于显著性的多实例学习定位、基于显著性的类印象、输入图像中的静音像素、对抗性方法和激活可视化、卷积滤波可视化、基于特征的方法等不同的视觉解释方法。我们还展示了不同方法产生的结果,以及CAM、GradCAM和引导反向传播之间的比较。 摘要:This document summarizes different visual explanations methods such as CAM, Grad-CAM, Localization using Multiple Instance Learning - Saliency-based methods, Saliency-driven Class-Impressions, Muting pixels in input image - Adversarial methods and Activation visualization, Convolution filter visualization - Feature-based methods. We have also shown the results produced by different methods and a comparison between CAM, GradCAM, and Guided Backpropagation.

其他(10篇)

【1】 An unifying point of view on expressive power of GNNs 标题:关于GNN表现力的统一观点

作者:Giuseppe Alessio D'Inverno,Monica Bianchini,Maria Lucia Sampoli,Franco Scarselli 备注:16 pages, 3 figures 链接:https://arxiv.org/abs/2106.08992 摘要:图神经网络是一类广泛应用于图处理的连接模型。它们在每个节点及其邻居上执行迭代消息传递操作,以解决分类/聚类任务——在某些节点上或在整个图上——收集所有此类消息,而不管它们的顺序如何。尽管属于这一类的各种模型之间存在差异,但大多数模型采用相同的计算方案,基于局部聚合机制,直观地说,局部计算框架主要负责GNNs的表达能力。本文证明了Weisfeiler—Lehman检验在图节点上导出了一个等价关系,该等价关系正好对应于原GNN模型上定义的展开等价。因此,关于原始广义神经网络表达能力的结果可以推广到一般广义神经网络,在温和的条件下,可以证明广义神经网络能够以概率和任何精度逼近图上尊重展开等价性的任何函数。 摘要:Graph Neural Networks (GNNs) are a wide class of connectionist models for graph processing. They perform an iterative message passing operation on each node and its neighbors, to solve classification/ clustering tasks --- on some nodes or on the whole graph --- collecting all such messages, regardless of their order. Despite the differences among the various models belonging to this class, most of them adopt the same computation scheme, based on a local aggregation mechanism and, intuitively, the local computation framework is mainly responsible for the expressive power of GNNs. In this paper, we prove that the Weisfeiler--Lehman test induces an equivalence relationship on the graph nodes that exactly corresponds to the unfolding equivalence, defined on the original GNN model. Therefore, the results on the expressive power of the original GNNs can be extended to general GNNs which, under mild conditions, can be proved capable of approximating, in probability and up to any precision, any function on graphs that respects the unfolding equivalence.

【2】 The Oxford Road Boundaries Dataset 标题:牛津路边界数据集

作者:Tarlan Suleymanov,Matthew Gadd,Daniele De Martini,Paul Newman 备注:Accepted for publication at the workshop "3D-DLAD: 3D-Deep Learning for Autonomous Driving" (WS15), Intelligent Vehicles Symposium (IV 2021) 链接:https://arxiv.org/abs/2106.08983 摘要:在本文中,我们提出了牛津道路边界数据集,设计用于训练和测试基于机器学习的道路边界检测和推理方法。我们从牛津Robotcar数据集中手工注释了两次10公里长的突袭,并从其他突袭中生成了数千个带有半注释道路边界遮罩的进一步示例。为了以这种方式增加训练样本的数量,我们使用了一个基于视觉的定位器,在不同的时间和天气条件下将标签从带注释的数据集投影到其他遍历。因此,我们发布了62605个标记样本,其中47639个样本是经过管理的。这些样品中的每一个都包含用于左右透镜的原始和分类掩模。我们的数据包含来自各种场景的图像,如直行道路、停放的汽车、交叉口等。可在以下网址下载文件和操作标记数据的工具:oxford-robotics-institute.github.io/road-borderations-dataset 摘要:In this paper we present the Oxford Road Boundaries Dataset, designed for training and testing machine-learning-based road-boundary detection and inference approaches. We have hand-annotated two of the 10 km-long forays from the Oxford Robotcar Dataset and generated from other forays several thousand further examples with semi-annotated road-boundary masks. To boost the number of training samples in this way, we used a vision-based localiser to project labels from the annotated datasets to other traversals at different times and weather conditions. As a result, we release 62605 labelled samples, of which 47639 samples are curated. Each of these samples contains both raw and classified masks for left and right lenses. Our data contains images from a diverse set of scenarios such as straight roads, parked cars, junctions, etc. Files for download and tools for manipulating the labelled data are available at: oxford-robotics-institute.github.io/road-boundaries-dataset

【3】 Structure First Detail Next: Image Inpainting with Pyramid Generator 标题:首先是结构,然后是细节:使用金字塔生成器进行图像修复

作者:Shuyi Qu,Zhenxing Niu,Kaizhu Huang,Jianke Zhu,Matan Protter,Gadi Zimerman,Yinghui Xu 备注:ICCV'21 under review 链接:https://arxiv.org/abs/2106.08905 摘要:最近的深生成模型在图像修复中取得了很好的效果。然而,由于其固有的光谱偏差,神经网络生成逼真的图像细节和纹理仍然是一个非常具有挑战性的问题。根据我们对艺术家工作方式的理解,我们建议采用“结构优先细节下一步”的工作流程进行图像修复。为此,我们提出了一个金字塔生成器,通过堆叠多个子生成器,其中较低层的子生成器侧重于恢复图像结构,而较高层的子生成器侧重于图像细节。给定一个输入图像,它将通过自下而上的方式通过整个金字塔逐渐恢复。特别是,我们的方法有一个学习方案,逐步增加孔的大小,这使它能够恢复大孔图像。此外,该方法能充分利用高分辨率图像学习的优点,适用于高分辨率图像修复。在基准数据集上的大量实验结果验证了该方法与现有方法的有效性。 摘要:Recent deep generative models have achieved promising performance in image inpainting. However, it is still very challenging for a neural network to generate realistic image details and textures, due to its inherent spectral bias. By our understanding of how artists work, we suggest to adopt a `structure first detail next' workflow for image inpainting. To this end, we propose to build a Pyramid Generator by stacking several sub-generators, where lower-layer sub-generators focus on restoring image structures while the higher-layer sub-generators emphasize image details. Given an input image, it will be gradually restored by going through the entire pyramid in a bottom-up fashion. Particularly, our approach has a learning scheme of progressively increasing hole size, which allows it to restore large-hole images. In addition, our method could fully exploit the benefits of learning with high-resolution images, and hence is suitable for high-resolution image inpainting. Extensive experimental results on benchmark datasets have validated the effectiveness of our approach compared with state-of-the-art methods.

【4】 Metamorphic image registration using a semi-Lagrangian scheme 标题:基于半拉格朗日方法的变形图像配准

作者:Anton François,Pietro Gori,Joan Glaunès 备注:None 链接:https://arxiv.org/abs/2106.08817 摘要:本文提出了一种利用半拉格朗日方法实现大变形差分对称度量映射(LDDMM)和变形图像配准的方法。我们建议解决这两个问题作为一个不精确的匹配提供一个单一的和统一的成本函数。我们证明了对于图像配准,使用半拉格朗日格式比标准的欧拉格式更稳定。我们的GPU实现基于PyTorch,由于其强大的自动微分引擎,大大简化和加速了计算。免费提供https://github.com/antonfrancois/Demeter_metamorphosis. 摘要:In this paper, we propose an implementation of both Large Deformation Diffeomorphic Metric Mapping (LDDMM) and Metamorphosis image registration using a semi-Lagrangian scheme for geodesic shooting. We propose to solve both problems as an inexact matching providing a single and unifying cost function. We demonstrate that for image registration the use of a semi-Lagrangian scheme is more stable than a standard Eulerian scheme. Our GPU implementation is based on PyTorch, which greatly simplifies and accelerates the computations thanks to its powerful automatic differentiation engine. It will be freely available at https://github.com/antonfrancois/Demeter_metamorphosis.

【5】 Robustness of Object Detectors in Degrading Weather Conditions 标题:目标探测器在恶劣天气条件下的稳健性

作者:Muhammad Jehanzeb Mirza,Cornelius Buerkle,Julio Jarquin,Michael Opitz,Fabian Oboril,Kay-Ulrich Scholl,Horst Bischof 备注:Accepted for publication at ITSC 2021 链接:https://arxiv.org/abs/2106.08795 摘要:最先进的自动驾驶目标检测系统在晴朗的天气条件下取得了很好的效果。然而,这种自主的安全关键系统也需要在恶劣的天气条件下工作,如雨雾和雪。不幸的是,大多数方法只对KITTI数据集进行评估,而KITTI数据集只包含清晰的天气场景。在本文中,我们解决了这个问题,并对在真实天气条件下采集的数据进行了最详细的单模态和双模态体系结构评估。我们分析了这些架构在恶劣天气条件下的性能退化。我们证明,在晴朗天气下表现良好的目标检测体系结构可能无法处理恶劣的天气条件。我们还对双模态结构进行了烧蚀研究,并指出了它们的局限性。 摘要:State-of-the-art object detection systems for autonomous driving achieve promising results in clear weather conditions. However, such autonomous safety critical systems also need to work in degrading weather conditions, such as rain, fog and snow. Unfortunately, most approaches evaluate only on the KITTI dataset, which consists only of clear weather scenes. In this paper we address this issue and perform one of the most detailed evaluation on single and dual modality architectures on data captured in real weather conditions. We analyse the performance degradation of these architectures in degrading weather conditions. We demonstrate that an object detection architecture performing good in clear weather might not be able to handle degrading weather conditions. We also perform ablation studies on the dual modality architectures and show their limitations.

【6】 Mobile Augmented Reality: User Interfaces, Frameworks, and Intelligence 标题:移动增强现实:用户界面、框架和智能

作者:Jacky Cao,Kit-Yung Lam,Lik-Hang Lee,Xiaoli Liu,Pan Hui,Xiang Su 备注:This work is currently under review in an international journal 链接:https://arxiv.org/abs/2106.08710 摘要:移动增强现实(MAR)将计算机生成的虚拟对象与移动设备的物理环境相结合。MAR系统使用户能够与MAR设备(如智能手机和头戴式可穿戴设备)进行交互,实现从物理世界到数字实体混合世界的无缝过渡。这些MAR系统通过使用MAR设备提供对数字内容的通用访问来支持用户体验。在过去的20年中,已经开发了许多MAR系统,然而,MAR框架的研究和设计还没有从以用户为中心的设计角度进行系统的回顾。本文介绍了对现有MAR框架的初步研究(计数:37),并通过自顶向下的方法进一步讨论了MAR的最新研究:1)MAR应用;2) 适应用户移动和环境的MAR可视化技术;3) 系统评估MAR框架,包括支持的平台和相应的功能,如跟踪、特征提取和感知能力;支持MAR系统中智能操作的底层机器学习方法。最后,我们总结了新兴研究领域的发展,当前的最新进展,并讨论了重要的开放挑战和可能的理论和技术方向。这项调查旨在使研究人员和MAR系统开发人员都受益。 摘要:Mobile Augmented Reality (MAR) integrates computer-generated virtual objects with physical environments for mobile devices. MAR systems enable users to interact with MAR devices, such as smartphones and head-worn wearables, and performs seamless transitions from the physical world to a mixed world with digital entities. These MAR systems support user experiences by using MAR devices to provide universal accessibility to digital contents. Over the past 20 years, a number of MAR systems have been developed, however, the studies and design of MAR frameworks have not yet been systematically reviewed from the perspective of user-centric design. This article presents the first effort of surveying existing MAR frameworks (count: 37) and further discusses the latest studies on MAR through a top-down approach: 1) MAR applications; 2) MAR visualisation techniques adaptive to user mobility and contexts; 3) systematic evaluation of MAR frameworks including supported platforms and corresponding features such as tracking, feature extraction plus sensing capabilities; and 4) underlying machine learning approaches supporting intelligent operations within MAR systems. Finally, we summarise the development of emerging research fields, current state-of-the-art, and discuss the important open challenges and possible theoretical and technical directions. This survey aims to benefit both researchers and MAR system developers alike.

【7】 ParticleAugment: Sampling-Based Data Augmentation 标题:颗粒增强:基于采样的数据增强

作者:Alexander Tsaregorodtsev,Vasileios Belagiannis 备注:11 pages. Submitted to NeurIPS 2021 链接:https://arxiv.org/abs/2106.08693 摘要:提出了一种用于图像分类的自动数据增强方法。我们将问题描述为montecarlo抽样,我们的目标是近似最优增广策略。提出了一种粒子滤波方法,用于在模型训练过程中寻找最优增广策略及其调度。我们的性能度量过程依赖于训练集的验证子集,而策略转换模型依赖于高斯先验和可选的增强速度参数。在我们的实验中,我们证明了我们的自动增强公式在CIFAR-10、CIFAR-100和ImageNet数据集上达到了令人满意的结果,这些数据集使用了解决这个问题的标准网络体系结构。通过与相关工作的比较,我们也证明了我们的方法在策略搜索的计算量和模型性能之间达到了平衡。 摘要:We present an automated data augmentation approach for image classification. We formulate the problem as Monte Carlo sampling where our goal is to approximate the optimal augmentation policies. We propose a particle filtering formulation to find optimal augmentation policies and their schedules during model training. Our performance measurement procedure relies on a validation subset of our training set, while the policy transition model depends on a Gaussian prior and an optional augmentation velocity parameter. In our experiments, we show that our formulation for automated augmentation reaches promising results on CIFAR-10, CIFAR-100, and ImageNet datasets using the standard network architectures for this problem. By comparing with the related work, we also show that our method reaches a balance between the computational cost of policy search and the model performance.

【8】 Revisit Visual Representation in Analytics Taxonomy: A Compression Perspective 标题:重温分析分类学中的视觉表征:压缩视角

作者:Yueyu Hu,Wenhan Yang,Haofeng Huang,Jiaying Liu 链接:https://arxiv.org/abs/2106.08512 摘要:视觉分析在物联网中扮演着越来越重要的角色,在物联网中,大量的视觉信号必须被压缩并输入机器。但是面对如此大的数据量和有限的带宽容量,现有的图像/视频压缩方法导致了非常低质量的表示,而现有的特征压缩技术无法支持具有低比特率表示的多样化的视觉分析应用/任务。本文提出并研究了用压缩视觉表示支持多个机器视觉分析任务的新问题,即分析分类学中的信息压缩问题。通过利用不同任务之间的内在可转移性,我们的框架成功地构建了低比特率下的紧凑表达表示,以支持多种多样的机器视觉任务,包括高级语义相关任务和中级几何分析任务。为了在表示中施加紧性,我们提出了一种基于码本的超先验,它有助于将表示映射到低维流形。由于它很好地拟合了深部视觉特征的信号结构,使得熵估计更加准确,压缩效率更高。利用所提出的框架和基于码本的hyperprior,我们进一步研究了不同抽象粒度的任务特征之间的关系。实验结果表明,与现有的压缩方案相比,该方案能够以更低的比特率支持一组多样化的任务。 摘要:Visual analytics have played an increasingly critical role in the Internet of Things, where massive visual signals have to be compressed and fed into machines. But facing such big data and constrained bandwidth capacity, existing image/video compression methods lead to very low-quality representations, while existing feature compression techniques fail to support diversified visual analytics applications/tasks with low-bit-rate representations. In this paper, we raise and study the novel problem of supporting multiple machine vision analytics tasks with the compressed visual representation, namely, the information compression problem in analytics taxonomy. By utilizing the intrinsic transferability among different tasks, our framework successfully constructs compact and expressive representations at low bit-rates to support a diversified set of machine vision tasks, including both high-level semantic-related tasks and mid-level geometry analytic tasks. In order to impose compactness in the representations, we propose a codebook-based hyperprior, which helps map the representation into a low-dimensional manifold. As it well fits the signal structure of the deep visual feature, it facilitates more accurate entropy estimation, and results in higher compression efficiency. With the proposed framework and the codebook-based hyperprior, we further investigate the relationship of different task features owning different levels of abstraction granularity. Experimental results demonstrate that with the proposed scheme, a set of diversified tasks can be supported at a significantly lower bit-rate, compared with existing compression schemes.

【9】 Multi-Resolution Continuous Normalizing Flows 标题:多分辨率连续归一化流

作者:Vikram Voleti,Chris Finlay,Adam Oberman,Christopher Pal 备注:9 pages, 5 figures, 3 tables, 17 equations 链接:https://arxiv.org/abs/2106.08462 摘要:最近的研究表明,神经常微分方程(ODEs)可以作为连续归一化流(CNFs)视角下的图像生成模型。这些模型提供了精确的似然计算和可逆的生成/密度估计。在这项工作中,我们引入了一种多分辨率的模型(MRCNF),通过描述生成与粗图像一致的精细图像所需的附加信息的条件分布。我们引入了一个分辨率之间的转换,允许对数似然没有变化。我们表明,这种方法可以为不同的图像数据集产生类似的似然值,在更高的分辨率和更少的参数下,仅使用1个GPU就可以提高性能。 摘要:Recent work has shown that Neural Ordinary Differential Equations (ODEs) can serve as generative models of images using the perspective of Continuous Normalizing Flows (CNFs). Such models offer exact likelihood calculation, and invertible generation/density estimation. In this work we introduce a Multi-Resolution variant of such models (MRCNF), by characterizing the conditional distribution over the additional information required to generate a fine image that is consistent with the coarse image. We introduce a transformation between resolutions that allows for no change in the log likelihood. We show that this approach yields comparable likelihood values for various image datasets, with improved performance at higher resolutions, with fewer parameters, using only 1 GPU.

【10】 Seeing Through Clouds in Satellite Images 标题:卫星图像中的穿透云层

作者:Mingmin Zhao,Peder A. Olsen,Ranveer Chandra 链接:https://arxiv.org/abs/2106.08408 摘要:提出了一种基于神经网络的卫星云图遮挡像素恢复方法。我们利用超/超高频波段的射频信号穿透云层来重建多光谱图像中的遮挡区域。介绍了第一个多模多时相云去除模型。我们的模型使用公开的卫星观测数据,每天生成无云图像。实验结果表明,该系统的峰值信噪比明显优于基线系统8dB。我们还演示了我们的系统在数字农业、洪水监测和野火探测方面的用例。我们将发布经过处理的数据集,以便于将来的研究。 摘要:This paper presents a neural-network-based solution to recover pixels occluded by clouds in satellite images. We leverage radio frequency (RF) signals in the ultra/super-high frequency band that penetrate clouds to help reconstruct the occluded regions in multispectral images. We introduce the first multi-modal multi-temporal cloud removal model. Our model uses publicly available satellite observations and produces daily cloud-free images. Experimental results show that our system significantly outperforms baselines by 8dB in PSNR. We also demonstrate use cases of our system in digital agriculture, flood monitoring, and wildfire detection. We will release the processed dataset to facilitate future research.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-06-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档