前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >计算机视觉与模式识别学术速递[12.14]

计算机视觉与模式识别学术速递[12.14]

作者头像
公众号-arXiv每日学术速递
发布2021-12-17 16:13:40
1.1K0
发布2021-12-17 16:13:40
举报
文章被收录于专栏:arXiv每日学术速递

cs.CV 方向,今日共计121篇

Transformer(7篇)

【1】 Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network 标题:基于空间交互Transformer网络的行人轨迹预测 链接:https://arxiv.org/abs/2112.06624

作者:Tong Su,Yu Meng,Yan Xu 机构: but in theTheauthorsarewiththeSchoolofMechanicalEngineering, UniversityofScienceandTechnologyBeijing 摘要:行人轨迹预测作为自主驾驶系统的核心技术,可以显著增强车辆主动安全功能,减少道路交通伤害。在交通场景中,当遇到迎面而来的行人时,行人可能会突然转弯或立即停车,这往往导致复杂的轨迹。为了预测这种不可预测的轨迹,我们可以深入了解行人之间的相互作用。在本文中,我们提出了一种新的生成方法,称为空间交互变换器(SIT),它通过注意机制学习行人轨迹的时空相关性。此外,我们引入条件变分自动编码器(CVAE)框架来模拟行人未来的潜在运动状态。特别是,基于大规模trafc数据集nuScenes[2]的实验表明,SIT比最先进的(SOTA)方法具有优异的性能。对具有挑战性的ETH和UCY数据集的实验评估证实了我们提出的模型的稳健性 摘要:As a core technology of the autonomous driving system, pedestrian trajectory prediction can significantly enhance the function of active vehicle safety and reduce road traffic injuries. In traffic scenes, when encountering with oncoming people, pedestrians may make sudden turns or stop immediately, which often leads to complicated trajectories. To predict such unpredictable trajectories, we can gain insights into the interaction between pedestrians. In this paper, we present a novel generative method named Spatial Interaction Transformer (SIT), which learns the spatio-temporal correlation of pedestrian trajectories through attention mechanisms. Furthermore, we introduce the conditional variational autoencoder (CVAE) framework to model the future latent motion states of pedestrians. In particular, the experiments based on large-scale trafc dataset nuScenes [2] show that SIT has an outstanding performance than state-of-the-art (SOTA) methods. Experimental evaluation on the challenging ETH and UCY datasets conrms the robustness of our proposed model

【2】 Embracing Single Stride 3D Object Detector with Sparse Transformer 标题:用稀疏Transformer拥抱单跨距三维目标探测器 链接:https://arxiv.org/abs/2112.06375

作者:Lue Fan,Ziqi Pang,Tianyuan Zhang,Yu-Xiong Wang,Hang Zhao,Feng Wang,Naiyan Wang,Zhaoxiang Zhang 机构:CASIA, UIUC, CMU, THU, TuSimple 摘要:在基于激光雷达的自动驾驶3D目标检测中,与2D检测情况相比,目标大小与输入场景大小的比率要小得多。忽略这一差异,许多3D探测器直接遵循2D探测器的常规做法,即使在量化点云之后,2D探测器也会对特征地图进行降采样。在本文中,我们从重新思考这种多步刻板印象如何影响基于激光雷达的三维物体探测器开始。我们的实验表明,下采样操作带来的好处很少,并导致不可避免的信息丢失。为了解决这个问题,我们提出了单步长稀疏变换器(SST)来保持网络从开始到结束的原始分辨率。在配备Transformer的情况下,我们的方法解决了单跨结构中感受野不足的问题。它还与点云的稀疏性很好地协作,自然避免了昂贵的计算。最终,我们的SST在大规模Waymo开放数据集上实现了最先进的结果。值得一提的是,由于单步幅的特点,我们的方法在小对象(行人)检测方面可以实现令人兴奋的性能(验证分割为83.8级1 AP)。守则将于https://github.com/TuSimple/SST 摘要:In LiDAR-based 3D object detection for autonomous driving, the ratio of the object size to input scene size is significantly smaller compared to 2D detection cases. Overlooking this difference, many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps even after quantizing the point clouds. In this paper, we start by rethinking how such multi-stride stereotype affects the LiDAR-based 3D object detectors. Our experiments point out that the downsampling operations bring few advantages, and lead to inevitable information loss. To remedy this issue, we propose Single-stride Sparse Transformer (SST) to maintain the original resolution from the beginning to the end of the network. Armed with transformers, our method addresses the problem of insufficient receptive field in single-stride architectures. It also cooperates well with the sparsity of point clouds and naturally avoids expensive computation. Eventually, our SST achieves state-of-the-art results on the large scale Waymo Open Dataset. It is worth mentioning that our method can achieve exciting performance (83.8 LEVEL 1 AP on validation split) on small object (pedestrian) detection due to the characteristic of single stride. Codes will be released at https://github.com/TuSimple/SST

【3】 Implicit Transformer Network for Screen Content Image Continuous Super-Resolution 标题:用于屏幕内容图像连续超分辨率的隐式变换网络 链接:https://arxiv.org/abs/2112.06174

作者:Jingyu Yang,Sheng Shen,Huanjing Yue,Kun Li 机构:School of Electrical and Information Engineering, Tianjin University, College of Intelligence and Computing, Tianjin University 备注:24 pages with 3 figures, NeurIPS 2021 摘要:如今,由于屏幕共享、远程合作和在线教育的广泛应用,屏幕内容呈爆炸式增长。为了匹配有限的终端带宽,可以对高分辨率(HR)屏幕内容进行降采样和压缩。在接收器端,HR显示器或用户对低分辨率(LR)屏幕内容图像(SCI)的超分辨率(SR)有很高的要求,以便放大进行详细观察。然而,由于不同的图像特征以及SCI在任意尺度下浏览的要求,大多数针对自然图像设计的图像SR方法不能很好地适用于SCI。为此,我们提出了一种新的用于SCISR的隐式Transformer超分辨率网络(ITSRN)。对于任意比例的高质量连续SR,通过所提出的隐式变换器从关键坐标处的图像特征推断查询坐标处的像素值,并提出隐式位置编码方案来聚合与查询坐标处相似的相邻像素值。我们使用LR和HR SCI对构建基准SCI1K和SCI1K压缩数据集。大量实验表明,对于压缩和未压缩的SCI,所提出的ITSRN显著优于几种竞争性的连续和离散SR方法。 摘要:Nowadays, there is an explosive growth of screen contents due to the wide application of screen sharing, remote cooperation, and online education. To match the limited terminal bandwidth, high-resolution (HR) screen contents may be downsampled and compressed. At the receiver side, the super-resolution (SR) of low-resolution (LR) screen content images (SCIs) is highly demanded by the HR display or by the users to zoom in for detail observation. However, image SR methods mostly designed for natural images do not generalize well for SCIs due to the very different image characteristics as well as the requirement of SCI browsing at arbitrary scales. To this end, we propose a novel Implicit Transformer Super-Resolution Network (ITSRN) for SCISR. For high-quality continuous SR at arbitrary ratios, pixel values at query coordinates are inferred from image features at key coordinates by the proposed implicit transformer and an implicit position encoding scheme is proposed to aggregate similar neighboring pixel values to the query one. We construct benchmark SCI1K and SCI1K-compression datasets with LR and HR SCI pairs. Extensive experiments show that the proposed ITSRN significantly outperforms several competitive continuous and discrete SR methods for both compressed and uncompressed SCIs.

【4】 Improving Vision Transformers for Incremental Learning 标题:改进用于增量学习的视觉转换器 链接:https://arxiv.org/abs/2112.06103

作者:Pei Yu,Yinpeng Chen,Ying Jin,Zicheng Liu 机构:Microsoft 摘要:本文研究了视觉变换器(ViT)在课堂增量学习中的应用。令人惊讶的是,单纯应用ViT来取代卷积神经网络(CNN)会导致性能下降。我们的分析揭示了单纯使用ViT的三个问题:(a)当类数较少时,ViT的收敛速度非常慢,(b)ViT比基于CNN的模型更倾向于新类,以及(c)ViT的正确学习率太低,无法学习好的分类器。基于这一分析,我们表明这些问题可以通过使用现有技术简单地解决:使用卷积干、平衡微调来纠正偏差,以及分类器的更高学习率。我们的简单解决方案名为ViTIL(增量学习的ViT),以明显的优势实现了所有三类增量学习设置的最新水平,为研究社区提供了强大的基线。例如,在ImageNet-1000上,我们的ViTIL在500个初始类和5个增量步骤(每个步骤100个新类)的协议中达到69.20%的top-1准确性,比LUCIR+DDE高出1.69%。对于10个增量步骤(100个新类)的更具挑战性的协议,我们的方法比PODNet好7.27%(65.13%对57.86%)。 摘要:This paper studies using Vision Transformers (ViT) in class incremental learning. Surprisingly, naive application of ViT to replace convolutional neural networks (CNNs) results in performance degradation. Our analysis reveals three issues of naively using ViT: (a) ViT has very slow convergence when class number is small, (b) more bias towards new classes is observed in ViT than CNN-based models, and (c) the proper learning rate of ViT is too low to learn a good classifier. Base on this analysis, we show these issues can be simply addressed by using existing techniques: using convolutional stem, balanced finetuning to correct bias, and higher learning rate for the classifier. Our simple solution, named ViTIL (ViT for Incremental Learning), achieves the new state-of-the-art for all three class incremental learning setups by a clear margin, providing a strong baseline for the research community. For instance, on ImageNet-1000, our ViTIL achieves 69.20% top-1 accuracy for the protocol of 500 initial classes with 5 incremental steps (100 new classes for each), outperforming LUCIR+DDE by 1.69%. For more challenging protocol of 10 incremental steps (100 new classes), our method outperforms PODNet by 7.27% (65.13% vs. 57.86%).

【5】 Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition 标题:基于短程和长程关系的微表情识别时空变换 链接:https://arxiv.org/abs/2112.05851

作者:Liangfei Zhang,Xiaopeng Hong,Ognjen Arandjelovic,Guoying Zhao 机构: Hong is with Harbin Institute of Technology 备注:12 pages, 8 figures 摘要:微表情是无意识和自发的,在推断一个人的真实情绪时很有用,即使试图隐藏它们。由于微表情持续时间短、强度低,识别微表情是情感计算中的一个难点。早期的作品基于手工制作的时空特征,显示出一些希望,最近被不同的深度学习方法所取代,这些方法现在正在竞争最先进的表演。然而,捕获局部和全局时空模式的问题仍然具有挑战性。为此,本文提出了一种新颖的时空变换器体系结构——据我们所知,这是第一种纯基于变换器的微表情识别方法(即,避免使用任何卷积网络)。该体系结构包括学习空间模式的空间编码器、用于时间维度分析的时间聚合器和分类头。对三个广泛使用的自发微表达数据集,即SMIC-HS、CASME II和SAMM的综合评估表明,所提出的方法始终优于最新技术,并且是已发表的微表情识别文献中第一个在上述任何数据集上实现未加权F1分数大于0.9的框架。 摘要:Being unconscious and spontaneous, micro-expressions are useful in the inference of a person's true emotions even if an attempt is made to conceal them. Due to their short duration and low intensity, the recognition of micro-expressions is a difficult task in affective computing. The early work based on handcrafted spatio-temporal features which showed some promise, has recently been superseded by different deep learning approaches which now compete for the state of the art performance. Nevertheless, the problem of capturing both local and global spatio-temporal patterns remains challenging. To this end, herein we propose a novel spatio-temporal transformer architecture -- to the best of our knowledge, the first purely transformer based approach (i.e. void of any convolutional network use) for micro-expression recognition. The architecture comprises a spatial encoder which learns spatial patterns, a temporal aggregator for temporal dimension analysis, and a classification head. A comprehensive evaluation on three widely used spontaneous micro-expression data sets, namely SMIC-HS, CASME II and SAMM, shows that the proposed approach consistently outperforms the state of the art, and is the first framework in the published literature on micro-expression recognition to achieve the unweighted F1-score greater than 0.9 on any of the aforementioned data sets.

【6】 Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection 标题:Hformer:用于条纹投影相位展开的条纹级数预测的混合CNN-Transformer 链接:https://arxiv.org/abs/2112.06759

作者:Xinjun Zhu,Zhiqiang Han,Mengkai Yuan,Qinghua Guo,Hongyi Wang 机构:School of Artificial Intelligence, Tiangong University, Tianjin , China, School of Electrical, Computer and Tele communications Engineering, University of Wollongong, Wollongong, NSW , Australia 摘要:近年来,深度学习在条纹投影三维(3D)测量的相位展开中引起了越来越多的关注,其目的是利用强大的卷积神经网络(CNN)模型来提高性能。在本文中,我们首次(据我们所知)将Transformer引入到相位展开中,这与CNN不同,并提出了通过条纹阶数预测进行相位展开的Hformer模型。该模型采用混合CNN变换器结构,主要由主干网、编码器和解码器组成,充分利用CNN和变换器的优点。设计了具有交叉注意的编码器和解码器,用于条纹阶数预测。实验结果表明,与U-Net和DCNN等CNN模型相比,该Hformer模型在条纹阶数预测方面具有更好的性能。此外,还对高频Transformer进行了烧蚀研究,以验证改进的特征金字塔网络(FPN)和按预测条纹顺序翻转的测试策略。我们的工作为基于深度学习的相位展开方法开辟了另一条途径,这种方法在条纹投影三维测量中主要由CNN控制。 摘要:Recently, deep learning has attracted more and more attention in phase unwrapping of fringe projection three-dimensional (3D) measurement, with the aim to improve the performance leveraging the powerful Convolutional Neural Network (CNN) models. In this paper, for the first time (to the best of our knowledge), we introduce the Transformer into the phase unwrapping which is different from CNN and propose Hformer model dedicated to phase unwrapping via fringe order prediction. The proposed model has a hybrid CNN-Transformer architecture that is mainly composed of backbone, encoder and decoder to take advantage of both CNN and Transformer. Encoder and decoder with cross attention are designed for the fringe order prediction. Experimental results show that the proposed Hformer model achieves better performance in fringe order prediction compared with the CNN models such as U-Net and DCNN. Moreover, ablation study on Hformer is made to verify the improved feature pyramid networks (FPN) and testing strategy with flipping in the predicted fringe order. Our work opens an alternative way to deep learning based phase unwrapping methods, which are dominated by CNN in fringe projection 3D measurement.

【7】 Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks 标题:用于功能磁共振预测任务的Transformer的预训练和微调 链接:https://arxiv.org/abs/2112.05761

作者:Itzik Malkiel,Gony Rosenman,Lior Wolf,Talma Hendler 机构:School of Computer Science, Tel Aviv University, Sagol School of Neuroscience, Tel Aviv University 摘要:我们提出了用于分析功能磁共振成像(fMRI)数据的TFFTransformer框架。TFF采用基于Transformer的体系结构和两阶段训练方法。首先,将自我监督训练应用于一组功能磁共振扫描,其中训练模型以重建三维体积数据。第二,预先训练的模型在特定任务上进行微调,利用地面真相标签。我们的结果显示了在各种功能磁共振成像任务上的最新表现,包括年龄和性别预测,以及精神分裂症识别。 摘要:We present the TFF Transformer framework for the analysis of functional Magnetic Resonance Imaging (fMRI) data. TFF employs a transformer-based architecture and a two-phase training approach. First, self-supervised training is applied to a collection of fMRI scans, where the model is trained for the reconstruction of 3D volume data. Second, the pre-trained model is fine-tuned on specific tasks, utilizing ground truth labels. Our results show state-of-the-art performance on a variety of fMRI tasks, including age and gender prediction, as well as schizophrenia recognition.

检测相关(10篇)

【1】 Anchor Retouching via Model Interaction for Robust Object Detection in Aerial Images 标题:航拍图像中基于模型交互的锚点修补法鲁棒目标检测 链接:https://arxiv.org/abs/2112.06701

作者:Dong Liang,Qixiang Geng,Zongqi Wei,Dmitry A. Vorontsov,Ekaterina L. Kim,Mingqiang Wei,Huiyu Zhou 机构: Kim are with the National Research LobachevskyState University of Nizhny Novgorod 摘要:目标检测在计算机视觉领域取得了巨大的进步。具有外观退化的小目标检测是一个突出的挑战,尤其是对于航空观测而言。为了收集足够的正/负样本进行启发式训练,大多数目标检测器预设区域锚,以便根据地面真实数据计算联合上的交点(IoU)。在这种情况下,小对象经常被丢弃或贴错标签。在本文中,我们提出了一种有效的动态增强锚(DEA)网络来构造一种新的训练样本生成器。与其他最先进的技术不同,所提出的网络利用样本鉴别器来实现基于锚单元和无锚单元之间的交互式样本筛选,以生成合格样本。此外,多任务联合训练和基于保守锚的推理方案提高了模型的性能,同时降低了计算复杂度。该方案支持面向对象和水平对象检测任务。在两个具有挑战性的航空基准(即DOTA和HRSC2016)上进行的大量实验表明,我们的方法在准确性方面达到了最先进的水平,推理速度适中,训练的计算开销也适中。在DOTA上,我们与RoI Transformer基线相结合的DEA网络超过了先进的方法,对于较弱主干网络(ResNet-101 vs ResNet-152)的定向对象检测,其平均精度(mAP)为0.40%,对于相同主干网络的水平对象检测,其平均精度(mAP)为3.08%。此外,我们的DEA网络与ReDet基线相结合,达到了80.37%的最先进性能。在HRSC2016上,仅使用3个水平锚,它就超过了之前的最佳模型1.1%。 摘要:Object detection has made tremendous strides in computer vision. Small object detection with appearance degradation is a prominent challenge, especially for aerial observations. To collect sufficient positive/negative samples for heuristic training, most object detectors preset region anchors in order to calculate Intersection-over-Union (IoU) against the ground-truthed data. In this case, small objects are frequently abandoned or mislabeled. In this paper, we present an effective Dynamic Enhancement Anchor (DEA) network to construct a novel training sample generator. Different from the other state-of-the-art techniques, the proposed network leverages a sample discriminator to realize interactive sample screening between an anchor-based unit and an anchor-free unit to generate eligible samples. Besides, multi-task joint training with a conservative anchor-based inference scheme enhances the performance of the proposed model while reducing computational complexity. The proposed scheme supports both oriented and horizontal object detection tasks. Extensive experiments on two challenging aerial benchmarks (i.e., DOTA and HRSC2016) indicate that our method achieves state-of-the-art performance in accuracy with moderate inference speed and computational overhead for training. On DOTA, our DEA-Net which integrated with the baseline of RoI-Transformer surpasses the advanced method by 0.40% mean-Average-Precision (mAP) for oriented object detection with a weaker backbone network (ResNet-101 vs ResNet-152) and 3.08% mean-Average-Precision (mAP) for horizontal object detection with the same backbone. Besides, our DEA-Net which integrated with the baseline of ReDet achieves the state-of-the-art performance by 80.37%. On HRSC2016, it surpasses the previous best model by 1.1% using only 3 horizontal anchors.

【2】 Centroid-UNet: Detecting Centroids in Aerial Images 标题:Centroid-UNET:航空图像中的质心检测 链接:https://arxiv.org/abs/2112.06530

作者:N. Lakmal Deshapriya,Dan Tran,Sriram Reddy,Kavinda Gunasekara 机构:Geoinformatics Center, Asian Institute of Technology, P.O. Box , Klong Luang, Pathumthani , Thailand, KEY WORDS: deep-learning, satellite-imagery, centroids, building-footprint, tree-canopy 备注:None 摘要:在航空/卫星图像分析(遥感)的许多应用中,生成物体的精确形状是一项繁琐的任务。在大多数遥感应用中,如物体计数,只需要对物体进行位置估计。因此,在航空/卫星图像中定位物体质心对于不需要物体精确形状的任务来说是一个简单的解决方案。因此,本研究的重点是评估使用深度神经网络在卫星图像中定位目标质心的可行性。我们的模型名为质心UNet。质心UNet模型基于经典的U-Net语义分割体系结构。我们修改并调整了U-Net语义分割架构,使其成为一个质心检测模型,保持了原始模型的简单性。此外,我们还通过两个涉及航空/卫星图像的案例研究对我们的模型进行了测试和评估。这两个案例研究分别是建筑质心检测案例研究和椰子树质心检测案例研究。与其他方法相比,我们的评估结果达到了相当好的准确性,并且提供了简单性。根据本研究开发的代码和模型也可在Centroid UNet GitHub存储库中获得:https://github.com/gicait/centroid-unet 摘要:In many applications of aerial/satellite image analysis (remote sensing), the generation of exact shapes of objects is a cumbersome task. In most remote sensing applications such as counting objects requires only location estimation of objects. Hence, locating object centroids in aerial/satellite images is an easy solution for tasks where the object's exact shape is not necessary. Thus, this study focuses on assessing the feasibility of using deep neural networks for locating object centroids in satellite images. Name of our model is Centroid-UNet. The Centroid-UNet model is based on classic U-Net semantic segmentation architecture. We modified and adapted the U-Net semantic segmentation architecture into a centroid detection model preserving the simplicity of the original model. Furthermore, we have tested and evaluated our model with two case studies involving aerial/satellite images. Those two case studies are building centroid detection case study and coconut tree centroid detection case study. Our evaluation results have reached comparably good accuracy compared to other methods, and also offer simplicity. The code and models developed under this study are also available in the Centroid-UNet GitHub repository: https://github.com/gicait/centroid-unet

【3】 Decoupling Object Detection from Human-Object Interaction Recognition 标题:目标检测与人机交互识别的解耦 链接:https://arxiv.org/abs/2112.06392

作者:Ying Jin,Yinpeng Chen,Lijuan Wang,Jianfeng Wang,Pei Yu,Lin Liang,Jenq-Neng Hwang,Zicheng Liu 机构:Microsoft†, University of Washington‡ 摘要:我们提出了DEFR,一种无检测的方法来识别图像级的人-物交互(HOI),而不使用物体位置或人体姿势。这是一个挑战,因为探测器是现有方法的一个组成部分。在本文中,我们提出了两个结果来提高无检测方法的性能,该方法显著优于现有的检测辅助方法。首先,我们发现有效利用HOI类之间的语义相关性至关重要。通过使用HOI标签的语言嵌入来初始化线性分类器,对HOI的结构进行编码以指导训练,可以获得显著的效果。此外,我们还提出了Log Sum Exp Sign(LSE Sign)loss,通过在softmax格式的所有类上平衡梯度来促进长尾数据集上的多标签学习。我们的无检测方法在HICO的HOI分类中实现了65.6 mAP,比检测辅助的最新技术(SOTA)高出18.5 mAP,在一次性分类中达到52.7 mAP,比SOTA高出27.3 mAP。与以前的工作不同,我们的分类模型(DEFR)可以直接用于HOI检测,而无需任何额外的训练,方法是连接到现成的对象检测器,其边界框输出被转换为用于DEFR的二进制掩码。令人惊讶的是,两个解耦模型的简单连接实现了SOTA性能(32.35 mAP)。 摘要:We propose DEFR, a DEtection-FRee method to recognize Human-Object Interactions (HOI) at image level without using object location or human pose. This is challenging as the detector is an integral part of existing methods. In this paper, we propose two findings to boost the performance of the detection-free approach, which significantly outperforms the detection-assisted state of the arts. Firstly, we find it crucial to effectively leverage the semantic correlations among HOI classes. Remarkable gain can be achieved by using language embeddings of HOI labels to initialize the linear classifier, which encodes the structure of HOIs to guide training. Further, we propose Log-Sum-Exp Sign (LSE-Sign) loss to facilitate multi-label learning on a long-tailed dataset by balancing gradients over all classes in a softmax format. Our detection-free approach achieves 65.6 mAP in HOI classification on HICO, outperforming the detection-assisted state of the art (SOTA) by 18.5 mAP, and 52.7 mAP in one-shot classes, surpassing the SOTA by 27.3 mAP. Different from previous work, our classification model (DEFR) can be directly used in HOI detection without any additional training, by connecting to an off-the-shelf object detector whose bounding box output is converted to binary masks for DEFR. Surprisingly, such a simple connection of two decoupled models achieves SOTA performance (32.35 mAP).

【4】 Change Detection Meets Visual Question Answering 标题:变化检测与视觉问答相遇 链接:https://arxiv.org/abs/2112.06343

作者:Zhenghang Yuan,Lichao Mou,Zhitong Xiong,Xiaoxiang Zhu 摘要:地球表面在不断变化,识别变化在城市规划和可持续发展中发挥着重要作用。虽然变化检测技术已经成功开发多年,但这些技术仍然局限于相关领域的专家和促进者。为了让每个用户都能灵活地访问变化信息,帮助他们更好地理解土地覆盖变化,我们引入了一个新的任务:基于变化检测的多时相航空图像视觉问答(CDVQA)。特别地,可以根据两个输入图像之间的内容变化来查询多时相图像以获得基于变化的高级信息。我们首先使用自动问答生成方法构建了一个包含多时相图像问答三元组的CDVQA数据集。然后,本文设计了一个基线CDVQA框架,它包括四个部分:多时相特征编码、多时相融合、多模态融合和答案预测。此外,我们还将变化增强模块引入到多时相特征编码中,目的是融合更多的变化相关信息。最后,研究了不同主干和多时相融合策略对CDVQA任务性能的影响。实验结果为开发更好的CDVQA模型提供了有用的见解,这对今后这项任务的研究很重要。我们将公开我们的数据集和代码。 摘要:The Earth's surface is continually changing, and identifying changes plays an important role in urban planning and sustainability. Although change detection techniques have been successfully developed for many years, these techniques are still limited to experts and facilitators in related fields. In order to provide every user with flexible access to change information and help them better understand land-cover changes, we introduce a novel task: change detection-based visual question answering (CDVQA) on multi-temporal aerial images. In particular, multi-temporal images can be queried to obtain high level change-based information according to content changes between two input images. We first build a CDVQA dataset including multi-temporal image-question-answer triplets using an automatic question-answer generation method. Then, a baseline CDVQA framework is devised in this work, and it contains four parts: multi-temporal feature encoding, multi-temporal fusion, multi-modal fusion, and answer prediction. In addition, we also introduce a change enhancing module to multi-temporal feature encoding, aiming at incorporating more change-related information. Finally, effects of different backbones and multi-temporal fusion strategies are studied on the performance of CDVQA task. The experimental results provide useful insights for developing better CDVQA models, which are important for future research on this task. We will make our dataset and code publicly available.

【5】 Anomaly Crossing: A New Method for Video Anomaly Detection as Cross-domain Few-shot Learning 标题:异常交叉:一种跨域Few-Shot学习的视频异常检测新方法 链接:https://arxiv.org/abs/2112.06320

作者:Guangyu Sun,Zhang Liu,Lianggong Wen,Jing Shi,Chenliang Xu 机构:University of Rochester, Corning Inc. 摘要:视频异常检测旨在识别视频中发生的异常事件。由于异常事件相对较少,因此收集一个平衡的数据集并训练一个二进制分类器来解决该任务是不可行的。因此,以前的大多数方法仅使用无监督或半监督方法从正常视频中学习。显然,它们在捕获和利用鉴别异常特征方面受到限制,这导致异常检测性能受损。在本文中,为了解决这个问题,我们提出了一种新的学习范式,充分利用正常和异常视频进行视频异常检测。特别是,我们提出了一个新的学习任务:跨域的少数镜头异常检测,它可以传递从源域的大量视频中学习到的知识,帮助解决目标域的少数镜头异常检测问题。具体地说,我们利用对目标正常视频的自我监督训练来缩小领域差距,并设计了元上下文感知模块来探索在少数镜头设置下事件的视频上下文。我们的实验表明,在DoTA和UCF犯罪数据集上,我们的方法明显优于基线方法,并且新任务有助于为异常检测提供更实用的训练范式。 摘要:Video anomaly detection aims to identify abnormal events that occurred in videos. Since anomalous events are relatively rare, it is not feasible to collect a balanced dataset and train a binary classifier to solve the task. Thus, most previous approaches learn only from normal videos using unsupervised or semi-supervised methods. Obviously, they are limited in capturing and utilizing discriminative abnormal characteristics, which leads to compromised anomaly detection performance. In this paper, to address this issue, we propose a new learning paradigm by making full use of both normal and abnormal videos for video anomaly detection. In particular, we formulate a new learning task: cross-domain few-shot anomaly detection, which can transfer knowledge learned from numerous videos in the source domain to help solve few-shot abnormality detection in the target domain. Concretely, we leverage self-supervised training on the target normal videos to reduce the domain gap and devise a meta context perception module to explore the video context of the event in the few-shot setting. Our experiments show that our method significantly outperforms baseline methods on DoTA and UCF-Crime datasets, and the new task contributes to a more practical training paradigm for anomaly detection.

【6】 Few-shot Keypoint Detection with Uncertainty Learning for Unseen Species 标题:基于不确定性学习的未知物种Few-Shot关键点检测 链接:https://arxiv.org/abs/2112.06183

作者:Changsheng Lu,Piotr Koniusz 机构:†The Australian National University, §Data,CSIRO 备注:8 pages for main paper, 6 pages for supplementary materials 摘要:当前的非刚性对象关键点检测器在选定种类的物种和身体部位上表现良好,并且需要大量标记的关键点进行训练。此外,根据特定身体部位定制的热图无法识别未知物种上的新关键点(未标记用于训练的关键点)。我们提出了一个有趣但具有挑战性的问题:如何在给定一些注释样本的情况下,同时检测未知物种的基础(注释用于训练)和新的关键点?因此,我们提出了一种通用的少数镜头关键点检测(FSKD)管道,它可以检测不同种类的不同数量的关键点。我们的FSKD提供预测关键点的不确定性估计。具体来说,FSKD包括主关键点表示学习、辅助关键点表示学习、相似性学习和关键点定位以及不确定性建模,以解决定位噪声问题。此外,我们通过多元高斯分布对多组关键点之间的不确定性进行建模,以利用相邻关键点之间的隐式相关性。我们展示了我们的FSKD在(i)未知物种的新关键点检测和(ii)少量镜头细粒度视觉识别(FGVR)和(iii)语义对齐(SA)下游任务上的有效性。对于FGVR,检测到的关键点提高了分类精度。对于SA,我们展示了一种新的薄板样条曲线扭曲,该扭曲使用不完全关键点响应下的估计关键点不确定性。 摘要:Current non-rigid object keypoint detectors perform well on a chosen kind of species and body parts, and require a large amount of labelled keypoints for training. Moreover, their heatmaps, tailored to specific body parts, cannot recognize novel keypoints (keypoints not labelled for training) on unseen species. We raise an interesting yet challenging question: how to detect both base (annotated for training) and novel keypoints for unseen species given a few annotated samples? Thus, we propose a versatile Few-shot Keypoint Detection (FSKD) pipeline, which can detect a varying number of keypoints of different kinds. Our FSKD provides the uncertainty estimation of predicted keypoints. Specifically, FSKD involves main and auxiliary keypoint representation learning, similarity learning, and keypoint localization with uncertainty modeling to tackle the localization noise. Moreover, we model the uncertainty across groups of keypoints by multivariate Gaussian distribution to exploit implicit correlations between neighboring keypoints. We show the effectiveness of our FSKD on (i) novel keypoint detection for unseen species, and (ii) few-shot Fine-Grained Visual Recognition (FGVR) and (iii) Semantic Alignment (SA) downstream tasks. For FGVR, detected keypoints improve the classification accuracy. For SA, we showcase a novel thin-plate-spline warping that uses estimated keypoint uncertainty under imperfect keypoint corespondences.

【7】 Synthetic Map Generation to Provide Unlimited Training Data for Historical Map Text Detection 标题:为历史地图文本检测提供无限训练数据的合成地图生成 链接:https://arxiv.org/abs/2112.06104

作者:Zekun Li,Runyu Guan,Qianmu Yu,Yao-Yi Chiang,Craig A. Knoblock 机构:University of Minnesota, Minneapolis, USA, University of Southern California, Los Angeles, USA 摘要:许多历史地图页可公开用于需要长期历史地理数据的研究。这些地图的制图设计包括地图符号和文字标签的组合。从地图图像中自动读取文本标签可以大大加快地图解释速度,并有助于生成描述地图内容的丰富元数据。已经提出了许多文本检测算法来自动定位地图图像中的文本区域,但大多数算法都是在域外数据集(例如,风景图像)上训练的。训练数据决定了机器学习模型的质量,而在地图图像中手动注释文本区域既费时又费力。另一方面,现有的地理数据源,如开放式街道地图(OSM),包含机器可读的地图层,这使我们能够分离文本层并轻松获得文本标签注释。然而,OSM地图分幅和历史地图之间的制图风格存在显著差异。本文提出了一种自动生成无限量注释历史地图图像的方法,用于训练文本检测模型。我们使用样式转换模型将当代地图图像转换为历史样式,并在其上放置文本标签。我们表明,最先进的文本检测模型(如PSENet)可以从合成历史地图中获益,并实现历史地图文本检测的显著改进。 摘要:Many historical map sheets are publicly available for studies that require long-term historical geographic data. The cartographic design of these maps includes a combination of map symbols and text labels. Automatically reading text labels from map images could greatly speed up the map interpretation and helps generate rich metadata describing the map content. Many text detection algorithms have been proposed to locate text regions in map images automatically, but most of the algorithms are trained on out-ofdomain datasets (e.g., scenic images). Training data determines the quality of machine learning models, and manually annotating text regions in map images is labor-extensive and time-consuming. On the other hand, existing geographic data sources, such as Open- StreetMap (OSM), contain machine-readable map layers, which allow us to separate out the text layer and obtain text label annotations easily. However, the cartographic styles between OSM map tiles and historical maps are significantly different. This paper proposes a method to automatically generate an unlimited amount of annotated historical map images for training text detection models. We use a style transfer model to convert contemporary map images into historical style and place text labels upon them. We show that the state-of-the-art text detection models (e.g., PSENet) can benefit from the synthetic historical maps and achieve significant improvement for historical map text detection.

【8】 NeuroHSMD: Neuromorphic Hybrid Spiking Motion Detector 标题:NeuroHSMD:神经形态混合型尖峰运动检测器 链接:https://arxiv.org/abs/2112.06102

作者:Pedro Machado,Eiman Kanjo,Andreas Oikonomou Ahmad Lotfi 机构: Andreas Oikonomou, and Ahmad Lotfi, Department of Computer Science, Clifton Lane, Nottingham, NG,NS, United Kingdom 摘要:脊椎动物的视网膜在处理诸如检测运动物体等琐碎的视觉任务方面非常高效,但对于现代计算机来说这是一项复杂的任务。物体运动的检测是由名为物体运动敏感神经节细胞(OMS-GC)的特殊视网膜神经节细胞完成的。OMS-GC处理连续信号,并产生视觉皮层后处理的尖峰模式。本文提出的神经形态混合尖峰运动检测器(NeuroHSMD)使用现场可编程门阵列(FPGA)加速了HSMD算法。混合尖峰运动检测器(HSMD)算法是第一个使用定制的3层尖峰神经网络(SNN)增强动态背景减法(DBS)算法的混合算法,该网络生成OMS-GC尖峰样响应。使用相同的2012年变更检测(CDnet2012)和2014年变更检测(CDnet2014)基准数据集,将NeuroHSMD算法与HSMD算法进行比较。结果表明,神经HSMD实时生成与HSMD算法相同的结果,且质量没有下降。此外,本文提出的神经HSMD完全用开放式计算机语言(OpenCL)实现,因此很容易在其他设备中复制,如图形处理器单元(GPU)和中央处理器单元集群(CPU)。 摘要:Vertebrate retinas are highly-efficient in processing trivial visual tasks such as detecting moving objects, yet a complex task for modern computers. The detection of object motion is done by specialised retinal ganglion cells named Object-motion-sensitive ganglion cells (OMS-GC). OMS-GC process continuous signals and generate spike patterns that are post-processed by the Visual Cortex. The Neuromorphic Hybrid Spiking Motion Detector (NeuroHSMD) proposed in this work accelerates the HSMD algorithm using Field-Programmable Gate Arrays (FPGAs). The Hybrid Spiking Motion Detector (HSMD) algorithm was the first hybrid algorithm to enhance dynamic background subtraction (DBS) algorithms with a customised 3-layer spiking neural network (SNN) that generates OMS-GC spiking-like responses. The NeuroHSMD algorithm was compared against the HSMD algorithm, using the same 2012 change detection (CDnet2012) and 2014 change detection (CDnet2014) benchmark datasets. The results show that the NeuroHSMD has produced the same results as the HSMD algorithm in real-time without degradation of quality. Moreover, the NeuroHSMD proposed in this paper was completely implemented in Open Computer Language (OpenCL) and therefore is easily replicated in other devices such as Graphical Processor Units (GPUs) and clusters of Central Processor Units (CPUs).

【9】 Guided Generative Models using Weak Supervision for Detecting Object Spatial Arrangement in Overhead Images 标题:基于弱监督的制导生成模型在高空图像目标空间排列检测中的应用 链接:https://arxiv.org/abs/2112.05786

作者:Weiwei Duan,Yao-Yi Chiang,Stefan Leyk,Johannes H. Uhl,Craig A. Knoblock 机构:University of Southern California, University of Minnesota, University of Colorado Boulder 摘要:越来越多的高空图像的可用性和可访问性使我们能够估计和评估地理空间目标对象组的空间排列,这有助于许多应用,如交通监测和农业监测。空间排列估计是识别头顶图像中包含所需对象的区域的过程。传统的有监督目标检测方法可以估计精确的空间排列,但需要大量的边界框标注。最近的半监督聚类方法可以减少手动标记,但仍然需要对图像中的所有对象类别进行注释。在变分自动编码器(VAE)框架下,提出了目标引导生成模型(TGGM),该模型使用高斯混合模型(GMM)来估计VAE中隐藏变量和解码变量的分布。通过GMM对隐藏变量和解码器变量进行建模,大大减少了空间排列估计所需的手动注释。与现有方法不同的是,训练过程只能在优化迭代中整体更新GMM(例如,“小批量”),TGGM允许在同一优化迭代中单独更新单个GMM组件。单独优化GMM组件使TGGM能够利用空间数据中的语义关系,并且只需要几个标签就可以启动和指导生成过程。我们的实验表明,TGGM实现了与最先进的半监督方法相当的结果,并且基于$F_{1}$分数,TGGM的性能比无监督方法高出10%,同时需要的标记数据要少得多。 摘要:The increasing availability and accessibility of numerous overhead images allows us to estimate and assess the spatial arrangement of groups of geospatial target objects, which can benefit many applications, such as traffic monitoring and agricultural monitoring. Spatial arrangement estimation is the process of identifying the areas which contain the desired objects in overhead images. Traditional supervised object detection approaches can estimate accurate spatial arrangement but require large amounts of bounding box annotations. Recent semi-supervised clustering approaches can reduce manual labeling but still require annotations for all object categories in the image. This paper presents the target-guided generative model (TGGM), under the Variational Auto-encoder (VAE) framework, which uses Gaussian Mixture Models (GMM) to estimate the distributions of both hidden and decoder variables in VAE. Modeling both hidden and decoder variables by GMM reduces the required manual annotations significantly for spatial arrangement estimation. Unlike existing approaches that the training process can only update the GMM as a whole in the optimization iterations (e.g., a "minibatch"), TGGM allows the update of individual GMM components separately in the same optimization iteration. Optimizing GMM components separately allows TGGM to exploit the semantic relationships in spatial data and requires only a few labels to initiate and guide the generative process. Our experiments shows that TGGM achieves results comparable to the state-of-the-art semi-supervised methods and outperforms unsupervised methods by 10% based on the $F_{1}$ scores, while requiring significantly fewer labeled data.

【10】 Two New Stenosis Detection Methods of Coronary Angiograms 标题:两种新的冠状动脉造影狭窄检测方法 链接:https://arxiv.org/abs/2112.06149

作者:Yaofang Liu,Xinyue Zhang,Wenlong Wan,Shaoyu Liu,Yingdi Liu,Hu Liu,Xueying Zeng,Qing Zhang 机构:Received: date Accepted: date 备注:Correspondence should be addressed to Qing Zhang. arXiv admin note: substantial text overlap with arXiv:2108.01516 摘要:冠状动脉造影是诊断冠心病(CAD)的“金标准”。目前,检测和评估冠状动脉狭窄的方法不能满足临床需要,例如,目前还没有在临床实践中需要的检测预先指定血管段狭窄的研究。提出了两种血管狭窄检测方法来辅助诊断。第一种是自动方法,它可以自动提取整个冠状动脉树并标记所有可能的狭窄。第二种是交互式方法。使用此方法,用户可以选择任何血管段对其狭窄进行进一步分析。实验表明,该方法对各种血管结构的血管造影图像具有良好的鲁棒性。自动狭窄检测方法的精确度、灵敏度和$F_1$评分分别为0.821、0.757和0.788。进一步的研究证明,交互式方法可以提供更精确的狭窄检测结果,我们的定量分析更接近实际。所提出的自动化方法和交互式方法是有效的,在临床实践中可以相互补充。第一种方法可用于初步筛选,第二种方法可用于进一步定量分析。我们认为该方法更适合于冠心病的临床诊断。 摘要:Coronary angiography is the "gold standard" for diagnosing coronary artery disease (CAD). At present, the methods for detecting and evaluating coronary artery stenosis cannot satisfy the clinical needs, e.g., there is no prior study of detecting stenoses in prespecified vessel segments, which is necessary in clinical practice. Two vascular stenosis detection methods are proposed to assist the diagnosis. The first one is an automatic method, which can automatically extract the entire coronary artery tree and mark all the possible stenoses. The second one is an interactive method. With this method, the user can choose any vessel segment to do further analysis of its stenoses. Experiments show that the proposed methods are robust for angiograms with various vessel structures. The precision, sensitivity, and $F_1$ score of the automatic stenosis detection method are 0.821, 0.757, and 0.788, respectively. Further investigation proves that the interactive method can provide a more precise outcome of stenosis detection, and our quantitative analysis is closer to reality. The proposed automatic method and interactive method are effective and can complement each other in clinical practice. The first method can be used for preliminary screening, and the second method can be used for further quantitative analysis. We believe the proposed solution is more suitable for the clinical diagnosis of CAD.

分类|识别相关(13篇)

【1】 Tracking and Long-Term Identification Using Non-Visual Markers 标题:使用非视觉标记进行跟踪和长期识别 链接:https://arxiv.org/abs/2112.06809

作者:Michael P. J. Camilleri,Li Zhang,Andrew Zisserman,Christopher K. I. Williams 机构:School of Informatics, University of Edinburgh, School of Data Science, Fudan University, Dept. of Engineering Science, University of Oxford 摘要:我们的目标是在杂乱的家庭笼子环境中跟踪和识别老鼠,作为生物研究中自动行为识别的先驱。这是一个非常具有挑战性的问题,因为(i)每只鼠标缺乏明显的视觉特征,以及(ii)连续遮挡的场景限制,使得标准的视觉跟踪方法无法使用。然而,通过独特的RFID植入物可以粗略估计每个鼠标的位置,因此有可能将(弱)跟踪信息与身份粗略信息进行最佳组合。为了实现我们的目标,我们做出了以下关键贡献:(a)将识别问题表述为分配问题(使用整数线性规划解决),以及(b)tracklet和RFID数据之间亲和力的新概率模型。后者是该模型的一个关键部分,因为它提供了给定粗略定位的目标检测的原则性概率处理。我们的方法在这个识别问题上达到了77%的准确率,并且能够在动物被隐藏时拒绝虚假的检测。 摘要:Our objective is to track and identify mice in a cluttered home-cage environment, as a precursor to automated behaviour recognition for biological research. This is a very challenging problem due to (i) the lack of distinguishing visual features for each mouse, and (ii) the close confines of the scene with constant occlusion, making standard visual tracking approaches unusable. However, a coarse estimate of each mouse's location is available from a unique RFID implant, so there is the potential to optimally combine information from (weak) tracking with coarse information on identity. To achieve our objective, we make the following key contributions: (a) the formulation of the identification problem as an assignment problem (solved using Integer Linear Programming), and (b) a novel probabilistic model of the affinity between tracklets and RFID data. The latter is a crucial part of the model, as it provides a principled probabilistic treatment of object detections given coarse localisation. Our approach achieves 77% accuracy on this identification problem, and is able to reject spurious detections when the animals are hidden.

【2】 A Survey of Unsupervised Domain Adaptation for Visual Recognition 标题:面向视觉识别的无监督领域自适应研究综述 链接:https://arxiv.org/abs/2112.06745

作者:Youshan Zhang 机构:Computer Science and Engineering, Lehigh University, USA 摘要:虽然在许多领域中生成并提供了大量未标记的数据,但对可视化数据的自动理解的需求比以往任何时候都高。大多数现有的机器学习模型通常依赖于大量标记的训练数据来实现高性能。不幸的是,在实际应用中无法满足这样的要求。标签的数量有限,手动注释数据既昂贵又耗时。通常需要将知识从现有的标记域转移到新域。然而,由于域之间的差异(域偏移或数据集偏差),模型性能会下降。为了克服标注的负担,领域自适应(DA)旨在减轻将知识从一个领域转移到另一个相似但不同的领域时的领域转移问题。无监督DA(UDA)处理标记的源域和未标记的目标域。UDA的主要目标是减少标记源数据和未标记目标数据之间的域差异,并在训练期间学习跨两个域的域不变表示。在本文中,我们首先定义了UDA问题。其次,我们从传统方法和基于深度学习的方法两方面概述了用于不同类别UDA的最新方法。最后,我们收集了常用的基准数据集,并报告了UDA在视觉识别问题上的最新方法的结果。 摘要:While huge volumes of unlabeled data are generated and made available in many domains, the demand for automated understanding of visual data is higher than ever before. Most existing machine learning models typically rely on massive amounts of labeled training data to achieve high performance. Unfortunately, such a requirement cannot be met in real-world applications. The number of labels is limited and manually annotating data is expensive and time-consuming. It is often necessary to transfer knowledge from an existing labeled domain to a new domain. However, model performance degrades because of the differences between domains (domain shift or dataset bias). To overcome the burden of annotation, Domain Adaptation (DA) aims to mitigate the domain shift problem when transferring knowledge from one domain into another similar but different domain. Unsupervised DA (UDA) deals with a labeled source domain and an unlabeled target domain. The principal objective of UDA is to reduce the domain discrepancy between the labeled source data and unlabeled target data and to learn domain-invariant representations across the two domains during training. In this paper, we first define UDA problem. Secondly, we overview the state-of-the-art methods for different categories of UDA from both traditional methods and deep learning based methods. Finally, we collect frequently used benchmark datasets and report results of the state-of-the-art methods of UDA on visual recognition problem.

【3】 Long-tail Recognition via Compositional Knowledge Transfer 标题:基于组合知识转移的长尾识别 链接:https://arxiv.org/abs/2112.06741

作者:Sarah Parisot,Pedro M. Esperanca,Steven McDonagh,Tamas J. Madarasz,Yongxin Yang,Zhenguo Li 机构:Pedro M. Esperanc¸a, Huawei Noah’s Ark Lab 摘要:在这项工作中,我们引入了一种新的长尾识别策略,该策略通过无训练的知识转移来解决尾类的Few-Shot问题。我们的目标是将从信息丰富的公共类中获取的知识转移到语义相似但数据匮乏的稀有类中,以获得更强的尾部类表示。我们利用类原型和学习的余弦分类器在特征空间中提供两种不同的、互补的类聚类中心表示,并使用注意机制从普通类中选择和重新组合学习的分类器特征,以获得更高质量的稀有类表示。我们的知识转移过程是免费训练的,减少了过度拟合的风险,并且能够持续将分类器扩展到新的类别。实验表明,我们的方法可以显著提高稀有类的性能,同时保持健壮的普通类性能,优于直接可比的最新模型。 摘要:In this work, we introduce a novel strategy for long-tail recognition that addresses the tail classes' few-shot problem via training-free knowledge transfer. Our objective is to transfer knowledge acquired from information-rich common classes to semantically similar, and yet data-hungry, rare classes in order to obtain stronger tail class representations. We leverage the fact that class prototypes and learned cosine classifiers provide two different, complementary representations of class cluster centres in feature space, and use an attention mechanism to select and recompose learned classifier features from common classes to obtain higher quality rare class representations. Our knowledge transfer process is training free, reducing overfitting risks, and can afford continual extension of classifiers to new classes. Experiments show that our approach can achieve significant performance boosts on rare classes while maintaining robust common class performance, outperforming directly comparable state-of-the-art models.

【4】 Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation 标题:协同防遗忘和适应的终身无监督领域适应性人再识别 链接:https://arxiv.org/abs/2112.06632

作者:Zhipeng Huang,Zhizheng Zhang,Cuiling Lan,Wenjun Zeng,Peng Chu,Quanzeng You,Jiang Wang,Zicheng Liu,Zheng-jun Zha 机构:University of Science and Technology of China, Microsoft, EIT Institute for Advanced Study 摘要:无监督领域自适应人员再识别(ReID)已被广泛研究,以减轻领域差距的不利影响。这些工作假设目标域数据可以一次性访问。然而,对于真实世界的流数据,这阻碍了及时适应不断变化的数据统计和充分利用不断增加的样本。在本文中,为了解决更多的实际情况,我们提出了一个新的任务,终身无监督领域自适应(LUDA)人里德。这是一个挑战,因为它要求模型不断地适应目标环境中未标记的数据,同时减轻这种细粒度人员检索任务的灾难性遗忘。我们为这项任务设计了一个有效的方案,称为CLUDA-ReID,其中反遗忘与适应协调一致。具体而言,提出了一种基于元的协调数据重放策略,用于重放旧数据,并以协调的优化方向更新网络,以适应和记忆。此外,我们针对基于检索的任务目标,提出了旧知识提取/继承的关系一致性学习。我们设置了两个评估设置来模拟实际应用场景。大量实验证明了我们的CLUDA ReID对于固定目标流场景和动态目标流场景的有效性。 摘要:Unsupervised domain adaptive person re-identification (ReID) has been extensively investigated to mitigate the adverse effects of domain gaps. Those works assume the target domain data can be accessible all at once. However, for the real-world streaming data, this hinders the timely adaptation to changing data statistics and sufficient exploitation of increasing samples. In this paper, to address more practical scenarios, we propose a new task, Lifelong Unsupervised Domain Adaptive (LUDA) person ReID. This is challenging because it requires the model to continuously adapt to unlabeled data of the target environments while alleviating catastrophic forgetting for such a fine-grained person retrieval task. We design an effective scheme for this task, dubbed CLUDA-ReID, where the anti-forgetting is harmoniously coordinated with the adaptation. Specifically, a meta-based Coordinated Data Replay strategy is proposed to replay old data and update the network with a coordinated optimization direction for both adaptation and memorization. Moreover, we propose Relational Consistency Learning for old knowledge distillation/inheritance in line with the objective of retrieval-based tasks. We set up two evaluation settings to simulate the practical application scenarios. Extensive experiments demonstrate the effectiveness of our CLUDA-ReID for both scenarios with stationary target streams and scenarios with dynamic target streams.

【5】 MinkLoc3D-SI: 3D LiDAR place recognition with sparse convolutions, spherical coordinates, and intensity 标题:MinkLoc3D-SI:使用稀疏卷积、球面坐标和强度的3D LiDAR位置识别 链接:https://arxiv.org/abs/2112.06539

作者:Kamil Żywanowski,Adam Banaszczyk,Michał R. Nowicki,Jacek Komorowski 机构:©, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media 摘要:三维激光雷达位置识别旨在基于旋转三维激光雷达传感器的单次扫描,估计先前所见环境中的粗略定位。该问题的现有解决方案包括手工制作的点云描述符(例如ScanContext、M2DP、LiDAR IRIS)和基于深度学习的解决方案(例如PointNetVLAD、PCAN、LPDNet、DAGC、MinkLoc3D),这些解决方案通常仅根据牛津RobotCar数据集的累积2D扫描进行评估。我们介绍了MinkLoc3D SI,这是一种基于稀疏卷积的解决方案,利用3D点的球坐标并处理3D激光雷达测量的强度,提高了使用单个3D激光雷达扫描时的性能。我们的方法将手工描述符(如ScanContext)的典型改进与最有效的3D稀疏卷积(MinkLoc3D)相结合。我们的实验表明,从三维激光雷达(USyd校园数据集)和强大的泛化能力(KITTI数据集)单次扫描的结果得到了改进。在累积2D扫描(RobotCar intensity dataset)上使用强度信息可以提高性能,即使球形表示不会产生明显的改进。因此,MinkLoc3D SI适用于从3D激光雷达获得的单次扫描,使其适用于自动驾驶车辆。 摘要:The 3D LiDAR place recognition aims to estimate a coarse localization in a previously seen environment based on a single scan from a rotating 3D LiDAR sensor. The existing solutions to this problem include hand-crafted point cloud descriptors (e.g., ScanContext, M2DP, LiDAR IRIS) and deep learning-based solutions (e.g., PointNetVLAD, PCAN, LPDNet, DAGC, MinkLoc3D), which are often only evaluated on accumulated 2D scans from the Oxford RobotCar dataset. We introduce MinkLoc3D-SI, a sparse convolution-based solution that utilizes spherical coordinates of 3D points and processes the intensity of 3D LiDAR measurements, improving the performance when a single 3D LiDAR scan is used. Our method integrates the improvements typical for hand-crafted descriptors (like ScanContext) with the most efficient 3D sparse convolutions (MinkLoc3D). Our experiments show improved results on single scans from 3D LiDARs (USyd Campus dataset) and great generalization ability (KITTI dataset). Using intensity information on accumulated 2D scans (RobotCar Intensity dataset) improves the performance, even though spherical representation doesn't produce a noticeable improvement. As a result, MinkLoc3D-SI is suited for single scans obtained from a 3D LiDAR, making it applicable in autonomous vehicles.

【6】 Makeup216: Logo Recognition with Adversarial Attention Representations 标题:Makeup216:具有对抗性注意表征的徽标识别 链接:https://arxiv.org/abs/2112.06533

作者:Junjun Hu,Yanhao Zhu,Bo Zhao,Jiexin Zheng,Chenxu Zhao,Xiangyu Zhu,Kangle Wu,Darun Tang 机构: Mininglamp Academy of Sciences, Mininglamp Technology, Chang’an University, Institute of Automation, Chinese Academy of Sciences 摘要:标识识别的挑战之一在于形式的多样性,如符号、文本或两者的组合;此外,logo在设计上往往非常简洁,而在外观上却非常相似,这表明学习区分性表达的困难。为了研究logo的多样性和代表性,我们引入了Makeup216,这是化妆领域中最大、最复杂的logo数据集,它是从现实世界中获取的。它由216个徽标和157个品牌组成,包括10019个图像和37018个带注释的徽标对象。此外,我们发现纯徽标周围的边缘背景可以提供重要的上下文信息,并提出了一个对抗性注意表征框架(AAR)来分别关注徽标主题和辅助边缘背景,可以将其结合起来以获得更好的表征。我们提出的框架在Makeup216和另一个大规模开放徽标数据集上取得了有竞争力的结果,这为徽标识别提供了新的思路。Makeup216的数据集和建议框架的代码将很快发布。 摘要:One of the challenges of logo recognition lies in the diversity of forms, such as symbols, texts or a combination of both; further, logos tend to be extremely concise in design while similar in appearance, suggesting the difficulty of learning discriminative representations. To investigate the variety and representation of logo, we introduced Makeup216, the largest and most complex logo dataset in the field of makeup, captured from the real world. It comprises of 216 logos and 157 brands, including 10,019 images and 37,018 annotated logo objects. In addition, we found that the marginal background around the pure logo can provide a important context information and proposed an adversarial attention representation framework (AAR) to attend on the logo subject and auxiliary marginal background separately, which can be combined for better representation. Our proposed framework achieved competitive results on Makeup216 and another large-scale open logo dataset, which could provide fresh thinking for logo recognition. The dataset of Makeup216 and the code of the proposed framework will be released soon.

【7】 Real Time Action Recognition from Video Footage 标题:视频片段中的实时动作识别 链接:https://arxiv.org/abs/2112.06456

作者:Tasnim Sakib Apon,Mushfiqul Islam Chowdhury,MD Zubair Reza,Arpita Datta,Syeda Tanjina Hasan,MD. Golam Rabiul Alam 机构:BRAC University, Dhaka, Bangladesh, Dept. of Computer Science and Engineering 摘要:犯罪率随着人口的增长而成比例地增加。最突出的方法是采用闭路电视(CCTV)摄像头监控来解决这一问题。视频监控摄像机为侦查犯罪增加了新的视角。目前正在进行几项关于自主安全摄像头监控的研究工作,其基本目标是从视频源中发现暴力活动。从技术角度来看,这是一个具有挑战性的问题,因为分析一组帧(即,从时间维度检测暴力的视频)可能需要仔细的机器学习模型训练,以减少错误结果。本研究通过整合最先进的深度学习方法来关注这一问题,以确保建立一个强大的管道,用于自主监测暴力活动,例如踢、打和扇。最初,我们设计了一个特定兴趣的数据集,其中包含600个视频(每个动作200个)。之后,我们利用现有的预训练模型架构来提取特征,然后使用深度学习网络进行分类。此外,我们还根据不同的预训练体系结构(如VGG16、InceptionV3、ResNet50、Exception和MobileNet V2)对模型的准确性和混淆矩阵进行了分类,其中VGG16和MobileNet V2表现更好。 摘要:Crime rate is increasing proportionally with the increasing rate of the population. The most prominent approach was to introduce Closed-Circuit Television (CCTV) camera-based surveillance to tackle the issue. Video surveillance cameras have added a new dimension to detect crime. Several research works on autonomous security camera surveillance are currently ongoing, where the fundamental goal is to discover violent activity from video feeds. From the technical viewpoint, this is a challenging problem because analyzing a set of frames, i.e., videos in temporal dimension to detect violence might need careful machine learning model training to reduce false results. This research focuses on this problem by integrating state-of-the-art Deep Learning methods to ensure a robust pipeline for autonomous surveillance for detecting violent activities, e.g., kicking, punching, and slapping. Initially, we designed a dataset of this specific interest, which contains 600 videos (200 for each action). Later, we have utilized existing pre-trained model architectures to extract features, and later used deep learning network for classification. Also, We have classified our models' accuracy, and confusion matrix on different pre-trained architectures like VGG16, InceptionV3, ResNet50, Xception and MobileNet V2 among which VGG16 and MobileNet V2 performed better.

【8】 Self-Supervised Modality-Aware Multiple Granularity Pre-Training for RGB-Infrared Person Re-Identification 标题:用于RGB-红外人再识别的自监督模态感知多粒度预训练 链接:https://arxiv.org/abs/2112.06147

作者:Lin Wan,Qianyan Jing,Zongyuan Sun,Chuang Zhang,Zhihang Li,Yehansen Chen 机构:School of Geography and Information Engineering, China University of Geosciences (Wuhan), School of Computer Science and Engineering, Nanjing University of Science and Technology, School of Artificial Intelligence, University of Chinese Academy of Sciences 备注:7 pages, 2 figures 摘要:虽然RGB红外跨通道人员再识别(RGB-IR ReID)在24小时智能监控方面取得了巨大进步,但最先进的技术仍然严重依赖于微调ImageNet预训练网络。由于单模态的性质,这种大规模的预训练可能会产生RGB偏差表示,从而阻碍跨模态图像检索的性能。本文提出了一种自我监督的预训练方案,称为模态感知多粒度学习(MMGL),它直接在多模态ReID数据集上从头开始训练模型,但在没有外部数据和复杂调整技巧的情况下获得有竞争力的结果。具体而言,MMGL将无序RGB-IR图像全局映射到共享的潜在置换空间,并通过最大化周期一致性RGB-IR图像块之间的一致性,进一步提高局部可分辨性。实验表明,与ImageNet预训练相比,MMGL学习更好的表示(+6.47%Rank-1),训练速度更快(几小时内收敛),数据效率更高(<5%数据大小)。研究结果还表明,它可以很好地推广到各种现有的模型,并具有良好的跨数据集的可移植性。代码将被发布。 摘要:While RGB-Infrared cross-modality person re-identification (RGB-IR ReID) has enabled great progress in 24-hour intelligent surveillance, state-of-the-arts still heavily rely on fine-tuning ImageNet pre-trained networks. Due to the single-modality nature, such large-scale pre-training may yield RGB-biased representations that hinder the performance of cross-modality image retrieval. This paper presents a self-supervised pre-training alternative, named Modality-Aware Multiple Granularity Learning (MMGL), which directly trains models from scratch on multi-modality ReID datasets, but achieving competitive results without external data and sophisticated tuning tricks. Specifically, MMGL globally maps shuffled RGB-IR images into a shared latent permutation space and further improves local discriminability by maximizing agreement between cycle-consistent RGB-IR image patches. Experiments demonstrate that MMGL learns better representations (+6.47% Rank-1) with faster training speed (converge in few hours) and solider data efficiency (<5% data size) than ImageNet pre-training. The results also suggest it generalizes well to various existing models, losses and has promising transferability across datasets. The code will be released.

【9】 On Automatic Data Augmentation for 3D Point Cloud Classification 标题:三维点云分类中的数据自动增强技术研究 链接:https://arxiv.org/abs/2112.06029

作者:Wanyue Zhang,Xun Xu,Fayao Liu,Chuan-Sheng Foo 机构:Le Zhang‡, Institute for Infocomm Research, ASTAR, Singapore 备注:BMVC 2021 摘要:数据扩充是减少过度拟合和提高学习性能的一项重要技术,但现有的三维点云数据扩充工作都是基于启发式的。在这项工作中,我们建议使用双层优化自动学习数据扩充策略。增广器以类似于条件生成器的方式设计,并在使用增广输入训练模型时通过最小化基本模型在验证集上的损失来优化。该公式提供了一种更具原则性的方法来学习3D点云上的数据增强。我们评估了我们在标准点云分类任务和更具挑战性的训练和验证/测试集之间姿势不一致的设置上的方法。提出的策略在这两项任务上都取得了有竞争力的性能,我们进一步深入了解了增强器学习验证集分布的能力。 摘要:Data augmentation is an important technique to reduce overfitting and improve learning performance, but existing works on data augmentation for 3D point cloud data are based on heuristics. In this work, we instead propose to automatically learn a data augmentation strategy using bilevel optimization. An augmentor is designed in a similar fashion to a conditional generator and is optimized by minimizing a base model's loss on a validation set when the augmented input is used for training the model. This formulation provides a more principled way to learn data augmentation on 3D point clouds. We evaluate our approach on standard point cloud classification tasks and a more challenging setting with pose misalignment between training and validation/test sets. The proposed strategy achieves competitive performance on both tasks and we provide further insight into the augmentor's ability to learn the validation set distribution.

【10】 You Only Need End-to-End Training for Long-Tailed Recognition 标题:对于长尾识别,您只需要进行端到端训练 链接:https://arxiv.org/abs/2112.05958

作者:Zhiwei Zhang,Hongsheng Li 机构:CPII, The Chinese University of Hong Kong, Centre for Perceptual and Interactive Intelligence 备注:16 pages, 11 figures, 8 tables 摘要:长尾数据集上的泛化差距很大程度上是由于大多数类别只占用很少的训练样本。解耦训练通过分别训练主干和分类器来获得更好的性能。是什么导致端到端模型训练的绩效较差(例如,logits基于边际的方法)?在这项工作中,我们确定了影响分类器学习的一个关键因素:在输入分类器之前具有低熵的通道相关特征。从信息论的角度,我们分析了为什么交叉熵损失会在不平衡数据上产生高度相关的特征。此外,我们还从理论上分析和证明了它对分类器权重梯度、Hessian条件数和基于logits边界的方法的影响。因此,我们首先提出使用通道白化对分类器的输入进行去相关(“分散”)以解耦权重更新和重塑倾斜的决策边界,并结合基于logits边界的方法获得了令人满意的结果。然而,当小班数量较多时,批次不平衡和更多的训练参与导致了大班的过度匹配。我们还提出了两个新的模块,基于块的相对平衡批量采样器(B3RS)和批量嵌入训练(BET)来解决上述问题,这使得端到端训练比解耦训练具有更好的性能。在长尾分类基准(CIFAR-LT和ImageNet LT)上的实验结果证明了该方法的有效性。 摘要:The generalization gap on the long-tailed data sets is largely owing to most categories only occupying a few training samples. Decoupled training achieves better performance by training backbone and classifier separately. What causes the poorer performance of end-to-end model training (e.g., logits margin-based methods)? In this work, we identify a key factor that affects the learning of the classifier: the channel-correlated features with low entropy before inputting into the classifier. From the perspective of information theory, we analyze why cross-entropy loss tends to produce highly correlated features on the imbalanced data. In addition, we theoretically analyze and prove its impacts on the gradients of classifier weights, the condition number of Hessian, and logits margin-based approach. Therefore, we firstly propose to use Channel Whitening to decorrelate ("scatter") the classifier's inputs for decoupling the weight update and reshaping the skewed decision boundary, which achieves satisfactory results combined with logits margin-based method. However, when the number of minor classes are large, batch imbalance and more participation in training cause over-fitting of the major classes. We also propose two novel modules, Block-based Relatively Balanced Batch Sampler (B3RS) and Batch Embedded Training (BET) to solve the above problems, which makes the end-to-end training achieve even better performance than decoupled training. Experimental results on the long-tailed classification benchmarks, CIFAR-LT and ImageNet-LT, demonstrate the effectiveness of our method.

【11】 A Discriminative Channel Diversification Network for Image Classification 标题:一种用于图像分类的鉴别通道多样化网络 链接:https://arxiv.org/abs/2112.05861

作者:Krushi Patel,Guanghui Wang 机构:Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA, Department of Computer Science, Ryerson University, Toronto, ON, Canada M,B ,K 摘要:卷积神经网络中的通道注意机制已被证明在各种计算机视觉任务中是有效的。然而,性能的提高带来了额外的模型复杂性和计算成本。在本文中,我们提出了一个轻量级和有效的注意模块,称为渠道多样化块,通过在全球层面建立渠道关系来增强全球环境。与其他通道注意机制不同,该模块在考虑通道激活的同时,更多地关注空间上可区分的通道,从而关注最具辨别力的特征。与其他将模块插入到多个中间层之间的注意模型不同,所提出的模块嵌入在主干网络的末端,因此易于实现。在CIFAR-10、SVHN和Tiny ImageNet数据集上的大量实验表明,所提出的模块将基线网络的性能平均提高了3%。 摘要:Channel attention mechanisms in convolutional neural networks have been proven to be effective in various computer vision tasks. However, the performance improvement comes with additional model complexity and computation cost. In this paper, we propose a light-weight and effective attention module, called channel diversification block, to enhance the global context by establishing the channel relationship at the global level. Unlike other channel attention mechanisms, the proposed module focuses on the most discriminative features by giving more attention to the spatially distinguishable channels while taking account of the channel activation. Different from other attention models that plugin the module in between several intermediate layers, the proposed module is embedded at the end of the backbone networks, making it easy to implement. Extensive experiments on CIFAR-10, SVHN, and Tiny-ImageNet datasets demonstrate that the proposed module improves the performance of the baseline networks by a margin of 3% on average.

【12】 Quality-Aware Multimodal Biometric Recognition 标题:质量感知的多模式生物特征识别 链接:https://arxiv.org/abs/2112.05827

作者:Sobhan Soleymani,Ali Dabouei,Fariborz Taherkhani,Seyed Mehdi Iranmanesh,Jeremy Dawson,Nasser M. Nasrabadi 机构:Lane Department of Computer Science and Electrical Engineering, West Virginia University 备注:IEEE Transactions on Biometrics, Behavior, and Identity Science 摘要:我们提出了一个质量感知的多模式识别框架,该框架将多个生物特征的表示与不同质量和数量的样本相结合,通过提取基于样本质量的互补识别信息来提高识别精度。我们开发了一个质量感知框架,用于融合输入模式的表示,方法是使用弱监督方式估计的质量分数对其重要性进行加权。该框架利用两个融合块,每个融合块由一组质量感知和聚合网络表示。除了架构修改之外,我们还提出了两个特定于任务的损失函数:多模可分性损失和多模紧致性损失。第一损失确保一类模式的表示具有可比较的量级以提供更好的质量估计,而不同类的多模态表示分布以实现嵌入空间中的最大区分。第二个损失是用来调整网络权值的,它通过调整框架来提高泛化性能。我们通过考虑由人脸、虹膜和指纹模式组成的三个多模态数据集来评估性能。通过与最新算法的比较,证明了该框架的有效性。特别是,我们的框架在真实接受率和错误接受率为$10^{-4}$的情况下,比生物数据模式的等级和分数级融合高出30%以上。 摘要:We present a quality-aware multimodal recognition framework that combines representations from multiple biometric traits with varying quality and number of samples to achieve increased recognition accuracy by extracting complimentary identification information based on the quality of the samples. We develop a quality-aware framework for fusing representations of input modalities by weighting their importance using quality scores estimated in a weakly-supervised fashion. This framework utilizes two fusion blocks, each represented by a set of quality-aware and aggregation networks. In addition to architecture modifications, we propose two task-specific loss functions: multimodal separability loss and multimodal compactness loss. The first loss assures that the representations of modalities for a class have comparable magnitudes to provide a better quality estimation, while the multimodal representations of different classes are distributed to achieve maximum discrimination in the embedding space. The second loss, which is considered to regularize the network weights, improves the generalization performance by regularizing the framework. We evaluate the performance by considering three multimodal datasets consisting of face, iris, and fingerprint modalities. The efficacy of the framework is demonstrated through comparison with the state-of-the-art algorithms. In particular, our framework outperforms the rank- and score-level fusion of modalities of BIOMDATA by more than 30% for true acceptance rate at false acceptance rate of $10^{-4}$.

【13】 A Label Correction Algorithm Using Prior Information for Automatic and Accurate Geospatial Object Recognition 标题:一种基于先验信息的地理空间目标自动准确识别标签校正算法 链接:https://arxiv.org/abs/2112.05794

作者:Weiwei Duan,Yao-Yi Chiang,Stefan Leyk,Johannes H. Uhl,Craig A. Knoblock 机构:University of Southern California, University of Minnesota, University of Colorado Boulder 摘要:数千张扫描的历史地形图包含有价值的信息,涵盖了很长一段时间,例如一个地区的水文是如何随时间变化的。有效地解锁这些地图中的信息需要训练地理空间对象识别系统,该系统需要大量带注释的数据。根据坐标将地理参考外部矢量数据与地形图重叠,可以自动标注所需对象在地图中的位置。但是,由于地形图的发布年份和坐标投影系统与外部矢量数据不同,直接重叠两个数据集会导致注释不对齐和错误。我们提出了一种标签校正算法,该算法利用地图的颜色信息和外部矢量数据的先验形状信息来减少未对齐和错误标注。实验表明,该算法的标注精度比现有算法的标注精度高10%。因此,使用该算法注释的识别结果比使用最新算法注释的正确率高9%。 摘要:Thousands of scanned historical topographic maps contain valuable information covering long periods of time, such as how the hydrography of a region has changed over time. Efficiently unlocking the information in these maps requires training a geospatial objects recognition system, which needs a large amount of annotated data. Overlapping geo-referenced external vector data with topographic maps according to their coordinates can annotate the desired objects' locations in the maps automatically. However, directly overlapping the two datasets causes misaligned and false annotations because the publication years and coordinate projection systems of topographic maps are different from the external vector data. We propose a label correction algorithm, which leverages the color information of maps and the prior shape information of the external vector data to reduce misaligned and false annotations. The experiments show that the precision of annotations from the proposed algorithm is 10% higher than the annotations from a state-of-the-art algorithm. Consequently, recognition results using the proposed algorithm's annotations achieve 9% higher correctness than using the annotations from the state-of-the-art algorithm.

分割|语义相关(12篇)

【1】 Learning Semantic-Aligned Feature Representation for Text-based Person Search 标题:用于基于文本的人物搜索的学习语义对齐特征表示 链接:https://arxiv.org/abs/2112.06714

作者:Shiping Li,Min Cao,Min Zhang 机构:School of Computer Science and Technology, Soochow University, China 备注:5 pages, 3 figures, 3 tables 摘要:基于文本的人员搜索旨在通过文本描述检索特定行人的图像。这项任务的关键挑战是消除模态间的差距,实现模态间的特征对齐。在本文中,我们提出了一种基于文本的人物搜索的语义对齐嵌入方法,该方法通过自动学习语义对齐的视觉特征和文本特征来实现跨模式的特征对齐。首先,我们引入两个基于转换器的主干来编码图像和文本的鲁棒特征表示。其次,我们设计了一个语义一致的特征聚合网络,自适应地选择具有相同语义的特征,并将其聚合为零件感知特征,该网络由一个受跨模态零件对齐损失和多样性损失约束的多头部注意模块实现。在中大PEDES和Flickr30K数据集上的实验结果表明,我们的方法达到了最先进的性能。 摘要:Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the feature alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature representations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head attention module constrained by a cross-modality part alignment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.

【2】 Ensemble CNN Networks for GBM Tumors Segmentation using Multi-parametric MRI 标题:基于多参数MRI的基底膜肿瘤分割的集成CNN网络 链接:https://arxiv.org/abs/2112.06554

作者:Ramy A. Zeineldin,Mohamed E. Karar,Franziska Mathis-Ullrich,Oliver Burgert 机构:Burgert, Research Group Computer Assisted Medicine (CaMed), Reutlingen University, Germany, Menoufia University, Egypt, Health Robotics and Automation (HERA), Karlsruhe Institute of Technology, Germany 备注:Submitted to BraTS 2021 (as part of the BrainLes workshop proceedings distributed by Springer LNCS) 摘要:胶质母细胞瘤是最具侵袭性的快速生长的原发性脑癌,起源于大脑的胶质细胞。恶性脑肿瘤及其子区域的准确识别仍然是医学图像分割中最具挑战性的问题之一。脑肿瘤分割挑战赛(braintomarsegmentationchallenge,BraTS)自诞生以来一直是脑胶质母细胞瘤自动分割算法的热门基准。在今年的挑战中,BRATS 2021提供了2000个术前患者的最大多参数(MPMRI)数据集。在本文中,我们提出了一种新的融合两种深度学习框架的方法,即DeepSeg和nnU-Net,用于术前mpMRI中胶质母细胞瘤的自动识别。我们的集成方法获得骰子相似性得分为92、87.33和84.10和Hausdorff距离为3.81、8.91和16.02,分别用于增强肿瘤、肿瘤核心和整个肿瘤区域上的BRATS 2021验证集。这些实验结果证明,它可以很容易地应用于临床,从而有助于脑癌预后、治疗计划和治疗反应监测。 摘要:Glioblastomas are the most aggressive fast-growing primary brain cancer which originate in the glial cells of the brain. Accurate identification of the malignant brain tumor and its sub-regions is still one of the most challenging problems in medical image segmentation. The Brain Tumor Segmentation Challenge (BraTS) has been a popular benchmark for automatic brain glioblastomas segmentation algorithms since its initiation. In this year's challenge, BraTS 2021 provides the largest multi-parametric (mpMRI) dataset of 2,000 pre-operative patients. In this paper, we propose a new aggregation of two deep learning frameworks namely, DeepSeg and nnU-Net for automatic glioblastoma recognition in pre-operative mpMRI. Our ensemble method obtains Dice similarity scores of 92.00, 87.33, and 84.10 and Hausdorff Distances of 3.81, 8.91, and 16.02 for the enhancing tumor, tumor core, and whole tumor regions on the BraTS 2021 validation set, individually. These Experimental findings provide evidence that it can be readily applied clinically and thereby aiding in the brain cancer prognosis, therapy planning, and therapy response monitoring.

【3】 Split GCN: Effective Interactive Annotation for Segmentation of Disconnected Instance 标题:Split GCN:一种有效的断开实例分割的交互式注解 链接:https://arxiv.org/abs/2112.06454

作者:Namgil Kim,Barom Kang,Yeonok Cho 机构:Department of Data Science, Ajou University, SelectStar AI Research 备注:11 pages 摘要:人工注释对象边界需要较高的成本。最近,基于多边形的人机交互注释方法已经显示出成功的性能。然而,鉴于连通顶点拓扑,这些方法难以预测对象中的断开组件。本文介绍了一种基于多边形方法和自我注意机制的新型结构splitgcn。通过提供方向信息,Split GCN使多边形顶点能够更精确地移动到对象边界。我们的模型通过使用关于顶点依赖关系的上下文交换来转换初始拓扑,成功地预测了对象的断开组件。Split GCN展示了与城市景观上最先进模型的竞争性能,以及与基线模型的更高性能。在四个跨域数据集上,我们验证了模型的泛化能力。 摘要:Annotating object boundaries by humans demands high costs. Recently, polygon-based annotation methods with human interaction have shown successful performance. However, given the connected vertex topology, these methods exhibit difficulty predicting the disconnected components in an object. This paper introduces Split-GCN, a novel architecture based on the polygon approach and self-attention mechanism. By offering the direction information, Split-GCN enables the polygon vertices to move more precisely to the object boundary. Our model successfully predicts disconnected components of an object by transforming the initial topology using the context exchange about the dependencies of vertices. Split-GCN demonstrates competitive performance with the state-of-the-art models on Cityscapes and even higher performance with the baseline models. On four cross-domain datasets, we confirm our model's generalization ability.

【4】 PartGlot: Learning Shape Part Segmentation from Language Reference Games 标题:PartGlot:从语言参考游戏中学习形状部分分割 链接:https://arxiv.org/abs/2112.06390

作者:Juil Koo,Ian Huang,Panos Achlioptas,Leonidas Guibas,Minhyuk Sung 机构:KAIST, Stanford University, “this chair has an oval back”, “totally solid no leg” 摘要:我们介绍了PartGlot,一个神经框架和相关的体系结构,用于学习三维形状几何的语义部分分割,完全基于部分引用语言。我们利用了这样一个事实,即对形状的语言描述可以提供形状各部分的优先权——因为自然语言已经进化到反映人类对物体组成结构的感知,这对物体的识别和使用至关重要。在训练中,我们使用ShapeGlot工作中收集的成对几何体/语言数据进行参考游戏,其中演讲者创建一个话语来区分目标形状和两个干扰物,听者必须根据该话语找到目标。我们的网络就是为了解决这个目标识别问题而设计的,它仔细地结合了一个基于转换器的注意模块,以便输出的注意能够精确地突出语言中描述的语义部分。此外,网络运行时不直接监督3D几何体本身。令人惊讶的是,我们进一步证明,学习到的零件信息可以概括为在训练过程中看不到的课程。我们的方法打开了仅从语言学习三维形状零件的可能性,而不需要大规模零件几何注释,从而促进注释获取。 摘要:We introduce PartGlot, a neural framework and associated architectures for learning semantic part segmentation of 3D shape geometry, based solely on part referential language. We exploit the fact that linguistic descriptions of a shape can provide priors on the shape's parts -- as natural language has evolved to reflect human perception of the compositional structure of objects, essential to their recognition and use. For training, we use the paired geometry / language data collected in the ShapeGlot work for their reference game, where a speaker creates an utterance to differentiate a target shape from two distractors and the listener has to find the target based on this utterance. Our network is designed to solve this target discrimination problem, carefully incorporating a Transformer-based attention module so that the output attention can precisely highlight the semantic part or parts described in the language. Furthermore, the network operates without any direct supervision on the 3D geometry itself. Surprisingly, we further demonstrate that the learned part information is generalizable to shape classes unseen during training. Our approach opens the possibility of learning 3D shape parts from language alone, without the need for large-scale part geometry annotations, thus facilitating annotation acquisition.

【5】 Multimodal-based Scene-Aware Framework for Aquatic Animal Segmentation 标题:基于多模态的水生动物场景感知分割框架 链接:https://arxiv.org/abs/2112.06193

作者:Minh-Quan Le,Trung-Nghia Le,Tam V. Nguyen,Isao Echizen,Minh-Triet Tran 机构:University of Science, Ho Chi Minh City, Vietnam, National Institute of Informatics, Japan, University of Dayton, US, John von Neumann Institute, VNU-HCM, Vietnam, Vietnam National University, Ho Chi Minh City, Vietnam 摘要:近年来,目标分割研究取得了很大的进展。除了一般对象外,水生动物也引起了研究的关注。基于深度学习的方法被广泛应用于水生动物分割,并取得了良好的效果。然而,缺乏具有挑战性的基准数据集。因此,我们创建了一个名为“水生动物物种”的新数据集此外,我们设计了一种新的基于多模态的场景感知分割框架,该框架利用多视图分割模型的优点来有效分割水生动物图像。为了提高训练性能,我们开发了一种引导混音增强方法。大量的实验将该框架的性能与最先进的实例分割方法进行了比较,结果表明该方法是有效的,并且显著优于现有的方法。 摘要:Recent years have witnessed great advances in object segmentation research. In addition to generic objects, aquatic animals have attracted research attention. Deep learning-based methods are widely used for aquatic animal segmentation and have achieved promising performance. However, there is a lack of challenging datasets for benchmarking. Therefore, we have created a new dataset dubbed "Aquatic Animal Species." Furthermore, we devised a novel multimodal-based scene-aware segmentation framework that leverages the advantages of multiple view segmentation models to segment images of aquatic animals effectively. To improve training performance, we developed a guided mixup augmentation method. Extensive experiments comparing the performance of the proposed framework with state-of-the-art instance segmentation methods demonstrated that our method is effective and that it significantly outperforms existing methods.

【6】 CPRAL: Collaborative Panoptic-Regional Active Learning for Semantic Segmentation 标题:CPREL:面向语义分割的协作式全域主动学习 链接:https://arxiv.org/abs/2112.05975

作者:Yu Qiao,Jincheng Zhu,Chengjiang Long,Zeyao Zhang,Yuxin Wang,Zhenjun Du,Xin Yang 机构:Dalian University of Technology,JD Finance America Corporation,SIASUN Robot & Automation CO.,Ltd. 备注:9 pages, 9 figures, accepted by AAAI 2022 摘要:通过主动学习(AL)获取最具代表性的示例可以最大限度地减少图像级或像素级注释的工作量,从而使许多依赖于数据的计算机视觉任务受益。在本文中,我们提出了一个新的合作全景区域主动学习框架(CPRAL)来解决语义分割任务。对于最初使用像素注释采样的小批量图像,我们使用全景信息来初始选择未标记的样本。考虑到分割数据集中的类不平衡,我们引入了区域高斯注意模块(RGA)来实现语义偏向选择。该子集由投票熵突出显示,然后由高斯核参与,以最大化有偏区域。我们还提出了一个上下文标签扩展(CLE),以促进区域注释与上下文注意指导。通过语义不可知的全景匹配和区域偏向的选择和扩展的协作,我们的CPRAL可以在标记工作和性能之间取得平衡,并妥协语义分布。我们在城市景观和BDD10K数据集上进行了广泛的实验,结果表明CPRAL优于尖端方法,具有令人印象深刻的结果和较少的标记比例。 摘要:Acquiring the most representative examples via active learning (AL) can benefit many data-dependent computer vision tasks by minimizing efforts of image-level or pixel-wise annotations. In this paper, we propose a novel Collaborative Panoptic-Regional Active Learning framework (CPRAL) to address the semantic segmentation task. For a small batch of images initially sampled with pixel-wise annotations, we employ panoptic information to initially select unlabeled samples. Considering the class imbalance in the segmentation dataset, we import a Regional Gaussian Attention module (RGA) to achieve semantics-biased selection. The subset is highlighted by vote entropy and then attended by Gaussian kernels to maximize the biased regions. We also propose a Contextual Labels Extension (CLE) to boost regional annotations with contextual attention guidance. With the collaboration of semantics-agnostic panoptic matching and regionbiased selection and extension, our CPRAL can strike a balance between labeling efforts and performance and compromise the semantics distribution. We perform extensive experiments on Cityscapes and BDD10K datasets and show that CPRAL outperforms the cutting-edge methods with impressive results and less labeling proportion.

【7】 Attacking Point Cloud Segmentation with Color-only Perturbation 标题:基于纯颜色摄动的攻击点云分割 链接:https://arxiv.org/abs/2112.05871

作者:Jiacen Xu,Zhe Zhou,Boyuan Feng Yufeng Ding,Zhou Li 机构:University of California, Irvine, Fudan University, University of California, Santa Barbara 摘要:最近在三维点云语义分割方面的研究工作通过采用深度CNN(卷积神经网络)和GCN(图卷积网络)取得了优异的性能。然而,这些复杂模型的鲁棒性还没有得到系统的分析。鉴于语义分割已应用于许多安全关键应用(例如,自动驾驶、地质传感),填补这一知识空白非常重要,特别是在对抗性样本下这些模型如何受到影响。在研究针对点云的对抗性攻击时,我们发现所有攻击都是针对单目标识别的,并且扰动是在点坐标上进行的。我们认为,在物理世界的约束下,基于坐标的微扰不太可能实现。因此,我们提出了一种新的颜色扰动方法COLPER,并将其用于语义分割。通过对室内数据集(S3DIS)和室外数据集(Semantic3D)上的COLPER与三种点云分割模型(PointNet++、DeepGCNs和RandLA Net)进行对比评估,我们发现在目标和非目标攻击设置下,仅颜色扰动足以显著降低分割精度和aIoU。 摘要:Recent research efforts on 3D point-cloud semantic segmentation have achieved outstanding performance by adopting deep CNN (convolutional neural networks) and GCN (graph convolutional networks). However, the robustness of these complex models has not been systematically analyzed. Given that semantic segmentation has been applied in many safety-critical applications (e.g., autonomous driving, geological sensing), it is important to fill this knowledge gap, in particular, how these models are affected under adversarial samples. While adversarial attacks against point cloud have been studied, we found all of them were targeting single-object recognition, and the perturbation is done on the point coordinates. We argue that the coordinate-based perturbation is unlikely to realize under the physical-world constraints. Hence, we propose a new color-only perturbation method named COLPER, and tailor it to semantic segmentation. By evaluating COLPER on an indoor dataset (S3DIS) and an outdoor dataset (Semantic3D) against three point cloud segmentation models (PointNet++, DeepGCNs, and RandLA-Net), we found color-only perturbation is sufficient to significantly drop the segmentation accuracy and aIoU, under both targeted and non-targeted attack settings.

【8】 A Novel Gaussian Process Based Ground Segmentation Algorithm with Local-Smoothness Estimation 标题:一种新的基于高斯过程的局部光滑度估计地面分割算法 链接:https://arxiv.org/abs/2112.05847

作者:Pouria Mehrabi,Hamid D. Taghirad 机构: Toosi University of Technology 备注:arXiv admin note: substantial text overlap with arXiv:2111.10638 摘要:自动陆地车辆(ALV)应能在未知环境中有效识别地面。提出了一种基于$\mathcal{GP}$的粗糙驾驶场景下的地面分割方法。非平稳协方差函数用作$\mathcal{GP}$的核。假定地面行为仅显示局部平滑度。这样,就得到了核长度尺度的点估计。因此,引入了两个高斯过程来分别模拟数据的观测和局部特征。当使用\textit{observation process}对地面建模时,将\textit{潜伏过程}放在长度刻度值上,以估计每个输入位置的长度刻度点值。这一潜在过程的输入位置是在物理激励程序中选择的,以表示对地面条件的直觉。此外,通过假设环境中存在假设曲面,可以表示长度刻度值的直观猜测,假设每一组数据点都是由该曲面的测量结果产生的。贝叶斯推理是使用\text{maximum a Posteriori}标准实现的。假定对数边际似然函数是一个多任务目标函数,以表示每一帧地面的整个帧无偏视图。仿真结果表明,即使在不均匀、粗糙的场景中,该方法的效果也优于基于相似高斯过程的地面分割方法。在不均匀场景中,相邻线段没有相似的地面结构,该方法基于全帧视点进行有效的地面估计,而不是仅估计分段可能的地面。 摘要:Autonomous Land Vehicles (ALV) shall efficiently recognize the ground in unknown environments. A novel $\mathcal{GP}$-based method is proposed for the ground segmentation task in rough driving scenarios. A non-stationary covariance function is utilized as the kernel for the $\mathcal{GP}$. The ground surface behavior is assumed to only demonstrate local-smoothness. Thus, point estimates of the kernel's length-scales are obtained. Thus, two Gaussian processes are introduced to separately model the observation and local characteristics of the data. While, the \textit{observation process} is used to model the ground, the \textit{latent process} is put on length-scale values to estimate point values of length-scales at each input location. Input locations for this latent process are chosen in a physically-motivated procedure to represent an intuition about ground condition. Furthermore, an intuitive guess of length-scale value is represented by assuming the existence of hypothetical surfaces in the environment that every bunch of data points may be assumed to be resulted from measurements from this surfaces. Bayesian inference is implemented using \textit{maximum a Posteriori} criterion. The log-marginal likelihood function is assumed to be a multi-task objective function, to represent a whole-frame unbiased view of the ground at each frame. Simulation results shows the effectiveness of the proposed method even in an uneven, rough scene which outperforms similar Gaussian process based ground segmentation methods. While adjacent segments do not have similar ground structure in an uneven scene, the proposed method gives an efficient ground estimation based on a whole-frame viewpoint instead of just estimating segment-wise probable ground surfaces.

【9】 Semantic Interaction in Augmented Reality Environments for Microsoft HoloLens 标题:Microsoft HoloLens增强现实环境中的语义交互 链接:https://arxiv.org/abs/2112.05846

作者:Peer Schüett,Max Schwarz,Sven Behnke 机构: University of Bonn 备注:None 摘要:增强现实是一种很有前途的人机交互技术。尤其是在机器人技术中,它总是在其环境中考虑系统,在该环境中直接显示可视化和接收用户输入是非常有益的。我们使用MicrosoftHoloLens探索这个想法,通过它我们可以捕捉室内环境并显示与已知对象类的交互提示。HoloLens记录的3D网格在用户移动时在线注释,语义类使用投影方法,这使我们能够使用最先进的2D语义分割方法。将结果融合到网格上;突出的对象段被识别并以3D形式显示给用户。最后,用户可以通过向对象做手势来触发动作。我们给出了定性结果,并在室内数据集上详细分析了我们方法的准确性和性能。 摘要:Augmented Reality is a promising technique for human-machine interaction. Especially in robotics, which always considers systems in their environment, it is highly beneficial to display visualizations and receive user input directly in exactly that environment. We explore this idea using the Microsoft HoloLens, with which we capture indoor environments and display interaction cues with known object classes. The 3D mesh recorded by the HoloLens is annotated on-line, as the user moves, with semantic classes using a projective approach, which allows us to use a state-of-the-art 2D semantic segmentation method. The results are fused onto the mesh; prominent object segments are identified and displayed in 3D to the user. Finally, the user can trigger actions by gesturing at the object. We both present qualitative results and analyze the accuracy and performance of our method in detail on an indoor dataset.

【10】 Hypernet-Ensemble Learning of Segmentation Probability for Medical Image Segmentation with Ambiguous Labels 标题:模糊标签医学图像分割概率的超网集成学习 链接:https://arxiv.org/abs/2112.06693

作者:Sungmin Hong,Anna K. Bonkhoff,Andrew Hoopes,Martin Bretzner,Markus D. Schirmer,Anne-Katrin Giese,Adrian V. Dalca,Polina Golland,Natalia S. Rost 机构:Anna Bonkhoff, Markus Schirmer, UKE, Adrian Dalca, MGH, HMS & MIT, MIT CSAIL, Natalia Rost 摘要:尽管深度学习(DL)在许多分割任务上表现优异,但基于DL的方法对其预测的高度极化标签概率过于自信。对于许多具有固有标签模糊性的应用程序来说,这通常是不可取的,即使在人工注释中也是如此。通过利用每幅图像的多个注释和分割的不确定性,解决了这一挑战。然而,在现实世界的应用中,多个每图像注释通常不可用,并且这种不确定性不能为用户提供对分割结果的完全控制。在本文中,我们提出了新的方法来提高分割概率估计,而不牺牲性能,在现实世界中,我们只有一个模糊注释每幅图像。我们在不惩罚均衡分割的情况下,边缘化了被鼓励在不同Tversky损失下进行欠分割/过分割的网络的估计分割概率图。此外,我们还提出了一种统一的超网络集成方法,以减轻训练多个网络的计算负担。我们的方法成功地估计了反映底层结构的分割概率图,并为具有挑战性的三维医学图像分割提供了直观的分割控制。虽然我们提出的方法的主要重点不是提高二进制分割性能,但我们的方法的性能略优于现有的方法。这些代码可从\url获得{https://github.com/sh4174/HypernetEnsemble}. 摘要:Despite the superior performance of Deep Learning (DL) on numerous segmentation tasks, the DL-based approaches are notoriously overconfident about their prediction with highly polarized label probability. This is often not desirable for many applications with the inherent label ambiguity even in human annotations. This challenge has been addressed by leveraging multiple annotations per image and the segmentation uncertainty. However, multiple per-image annotations are often not available in a real-world application and the uncertainty does not provide full control on segmentation results to users. In this paper, we propose novel methods to improve the segmentation probability estimation without sacrificing performance in a real-world scenario that we have only one ambiguous annotation per image. We marginalize the estimated segmentation probability maps of networks that are encouraged to under-/over-segment with the varying Tversky loss without penalizing balanced segmentation. Moreover, we propose a unified hypernetwork ensemble method to alleviate the computational burden of training multiple networks. Our approaches successfully estimated the segmentation probability maps that reflected the underlying structures and provided the intuitive control on segmentation for the challenging 3D medical image segmentation. Although the main focus of our proposed methods is not to improve the binary segmentation performance, our approaches marginally outperformed the state-of-the-arts. The codes are available at \url{https://github.com/sh4174/HypernetEnsemble}.

【11】 gACSON software for automated segmentation and morphology analyses of myelinated axons in 3D electron microscopy 标题:三维电镜下有髓轴突自动分割和形态学分析的gACSON软件 链接:https://arxiv.org/abs/2112.06476

作者:Andrea Behanova,Ali Abdollahzadeh,Ilya Belevich,Eija Jokitalo,Alejandra Sierra,Jussi Tohka 机构:Department of Information Technology, Uppsala University, Uppsala, Sweden., Biomedical Imaging Unit, A.I.Virtanen Institute for Molecular Sciences, University of Eastern Finland, Kuopio, Finland. 摘要:背景与目的:电子显微镜(EM)的进步现在允许以纳米级分辨率对数百微米的组织进行三维成像,为研究大脑的超微结构提供了新的机会。在这项工作中,我们介绍了一个免费提供的gACSON软件,用于脑组织样本3D-EM体积中有髓轴突的可视化、分割、评估和形态学分析。方法:gACSON软件配备图形用户界面(GUI)。它自动分割有髓轴突及其相应髓鞘的轴突内空间,并允许手动分割、校对和交互式校正分割的成分。gACSON分析有髓轴突的形态,如轴突直径、轴突偏心率、髓鞘厚度或g比率。结果:我们通过分割和分析假手术或创伤性脑损伤(TBI)后大鼠体感皮层六个3D-EM体积中的有髓轴突,说明gACSON的用途。我们的结果表明,创伤后5个月,TBI动物体感皮层有髓轴突的等效直径减小。结论:我们的研究结果表明,gACSON是一种有髓轴突在3D-EM体积中可视化、分割、评估和形态分析的有价值工具。gACSON免费提供,网址为https://github.com/AndreaBehan/g-ACSON根据麻省理工学院的许可证。 摘要:Background and Objective: Advances in electron microscopy (EM) now allow three-dimensional (3D) imaging of hundreds of micrometers of tissue with nanometer-scale resolution, providing new opportunities to study the ultrastructure of the brain. In this work, we introduce a freely available gACSON software for visualization, segmentation, assessment, and morphology analysis of myelinated axons in 3D-EM volumes of brain tissue samples. Methods: The gACSON software is equipped with a graphical user interface (GUI). It automatically segments the intra-axonal space of myelinated axons and their corresponding myelin sheaths and allows manual segmentation, proofreading, and interactive correction of the segmented components. gACSON analyzes the morphology of myelinated axons, such as axonal diameter, axonal eccentricity, myelin thickness, or g-ratio. Results: We illustrate the use of gACSON by segmenting and analyzing myelinated axons in six 3D-EM volumes of rat somatosensory cortex after sham surgery or traumatic brain injury (TBI). Our results suggest that the equivalent diameter of myelinated axons in somatisensory cortex was decreased in TBI animals five months after the injury. Conclusions: Our results indicate that gACSON is a valuable tool for visualization, segmentation, assessment, and morphology analysis of myelinated axons in 3D-EM volumes. gACSON is freely available at https://github.com/AndreaBehan/g-ACSON under the MIT license.

【12】 PyTorch Connectomics: A Scalable and Flexible Segmentation Framework for EM Connectomics 标题:PyTorch Connectomics:一种可扩展、灵活的EM Connectomics分割框架 链接:https://arxiv.org/abs/2112.05754

作者:Zudi Lin,Donglai Wei,Jeff Lichtman,Hanspeter Pfister 机构:Harvard University, Boston College 备注:Technical report 摘要:我们介绍了Pytork Connectomics(PyTC),这是一个基于Pytork的开源深度学习框架,用于体积显微镜图像的语义和实例分割。我们展示了PyTC在连接组学领域的有效性,该领域旨在以纳米分辨率分割和重建神经元、突触和其他细胞器,如线粒体,以了解动物大脑中的神经元通讯、代谢和发育。PyTC是一个可扩展且灵活的工具箱,可处理不同规模的数据集,并支持多任务和半监督学习,以便在训练期间更好地利用昂贵的专家注释和大量未标记的数据。通过改变配置选项而无需编码,这些功能可以在PyTC中轻松实现,并适用于不同组织和成像模式的其他2D和3D分割任务。从数量上讲,我们的框架在CREMI挑战中实现了最佳的突触分裂分割性能(比现有的最佳结果高出6.1$\%%$),并且在线粒体和神经元细胞核分割方面具有竞争力。代码和教程可在https://connectomics.readthedocs.io. 摘要:We present PyTorch Connectomics (PyTC), an open-source deep-learning framework for the semantic and instance segmentation of volumetric microscopy images, built upon PyTorch. We demonstrate the effectiveness of PyTC in the field of connectomics, which aims to segment and reconstruct neurons, synapses, and other organelles like mitochondria at nanometer resolution for understanding neuronal communication, metabolism, and development in animal brains. PyTC is a scalable and flexible toolbox that tackles datasets at different scales and supports multi-task and semi-supervised learning to better exploit expensive expert annotations and the vast amount of unlabeled data during training. Those functionalities can be easily realized in PyTC by changing the configuration options without coding and adapted to other 2D and 3D segmentation tasks for different tissues and imaging modalities. Quantitatively, our framework achieves the best performance in the CREMI challenge for synaptic cleft segmentation (outperforms existing best result by relatively 6.1$\%$) and competitive performance on mitochondria and neuronal nuclei segmentation. Code and tutorials are publicly available at https://connectomics.readthedocs.io.

Zero/Few Shot|迁移|域适配|自适应(6篇)

【1】 VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks 标题:VL-Adapter:视觉和语言任务的参数高效迁移学习 链接:https://arxiv.org/abs/2112.06825

作者:Yi-Lin Sung,Jaemin Cho,Mohit Bansal 机构:UNC Chapel Hill 备注:13 pages 摘要:最近,在大型文本语料库上预先训练的微调语言模型已经在视觉和语言(V&L)任务以及纯语言任务上提供了巨大的改进。然而,由于模型尺寸正在快速增长,因此微调预训练模型的整个参数集变得不切实际。因此,在本文中,我们将基于适配器的参数有效迁移学习技术引入V&L模型,如VL-BART和VL-T5。我们在一个统一的多任务设置中对四种不同的V&L任务(VQAv2、GQA、NLVR2和MSCOCO图像字幕)评估我们的方法。通过仔细的训练和彻底的实验,我们将三种流行的基于适配器的方法(适配器、Hyperformer、Compacter)与标准的完全微调和最近提出的快速调优方法进行了对比。我们还通过共享适配器的权重来跨任务获取知识,从而提高适配器的效率和性能。我们的结果表明,使用权重共享技术(占总参数的4.4%)训练适配器可以匹配微调整个模型的性能。最后,我们提出了一个全面的分析,包括适配器和任务特定提示的组合以及V&L预训练对适配器的影响。我们的代码可从以下网址获得:https://github.com/ylsung/VL_adapter. 摘要:Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on pure language tasks. However, fine-tuning the entire parameter set of pre-trained models becomes impractical since the model size is growing rapidly. Hence, in this paper, we introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5. We evaluate our methods in a unified multi-task setup on four diverse V&L tasks: VQAv2, GQA, NLVR2 , and MSCOCO image captioning. With careful training and thorough experiments, we benchmark three popular adapter-based methods (Adapter, Hyperformer, Compacter) against the standard full fine-tuning and the recently proposed prompt-tuning approach. We also enhance the efficiency and performance of adapters by sharing their weights to attain knowledge across tasks. Our results demonstrate that training the adapter with the weight-sharing technique (4.4% of total parameters) can match the performance of fine-tuning the entire model. Lastly, we present a comprehensive analysis including the combination of adapter and task-specific prompts and the impact of V&L pre-training on adapters. Our code is available at: https://github.com/ylsung/VL_adapter.

【2】 Hybrid Graph Neural Networks for Few-Shot Learning 标题:混合图神经网络在少概率学习中的应用 链接:https://arxiv.org/abs/2112.06538

作者:Tianyuan Yu,Sen He,Yi-Zhe Song,Tao Xiang 机构:Center for Vision, Speech and Signal Processing, University of Surrey, National University of Defense Technology, FlyTek-Surrey Joint Research Centre on Artificial Intelligence 备注:To appear in AAAI 2022 摘要:图形神经网络(GNNs)已被用于解决少数镜头学习(FSL)问题,并显示出在transductive环境下的巨大潜力。然而,在归纳设置下,现有的基于GNN的方法竞争力较低。这是因为他们使用实例GNN作为标签传播/分类模块,该模块与特征嵌入网络联合元学习。这种设计是有问题的,因为分类器需要快速适应新任务,而嵌入不需要。为了克服这个问题,本文提出了一种新的混合GNN(HGNN)模型,该模型由两个GNN组成,一个实例GNN和一个原型GNN。代替标签传播,它们充当特征嵌入自适应模块,用于元学习特征嵌入快速适应新任务。重要的是,它们被设计用于处理FSL中一个基本但经常被忽略的挑战,即,每个类只有少量快照,任何少量快照分类器都会对异常值或可能导致类间分布重叠的错误采样快照敏感我们设计的两个GNN分别用于解决这两种类型的低采样少数炮点,并在混合GNN模型中利用了它们的互补性。大量的实验表明,我们的HGNN在三个FSL基准上获得了最新的水平。 摘要:Graph neural networks (GNNs) have been used to tackle the few-shot learning (FSL) problem and shown great potentials under the transductive setting. However under the inductive setting, existing GNN based methods are less competitive. This is because they use an instance GNN as a label propagation/classification module, which is jointly meta-learned with a feature embedding network. This design is problematic because the classifier needs to adapt quickly to new tasks while the embedding does not. To overcome this problem, in this paper we propose a novel hybrid GNN (HGNN) model consisting of two GNNs, an instance GNN and a prototype GNN. Instead of label propagation, they act as feature embedding adaptation modules for quick adaptation of the meta-learned feature embedding to new tasks. Importantly they are designed to deal with a fundamental yet often neglected challenge in FSL, that is, with only a handful of shots per class, any few-shot classifier would be sensitive to badly sampled shots which are either outliers or can cause inter-class distribution overlapping. %Our two GNNs are designed to address these two types of poorly sampled few-shots respectively and their complementarity is exploited in the hybrid GNN model. Extensive experiments show that our HGNN obtains new state-of-the-art on three FSL benchmarks.

【3】 Shaping Visual Representations with Attributes for Few-Shot Learning 标题:基于属性的视觉表征塑造方法及其在Few-Shot学习中的应用 链接:https://arxiv.org/abs/2112.06398

作者:Haoxing Chen,Huaxiong Li,Yaohui Li,Chunlin Chen 机构:Nanjing University, Nanjing, China 摘要:少数镜头识别的目的是在低数据率下识别新的类别。由于图像的稀缺性,机器无法获得足够的有效信息,模型的泛化能力极弱。通过使用辅助语义模式,最近基于度量学习的少数镜头学习方法取得了令人满意的效果。然而,这些方法只增强了支持类的表示,而查询图像没有语义模式信息来增强表示。相反,我们提出了属性形状学习(ASL),它可以规范化视觉表示来预测查询图像的属性。此外,我们还设计了一个属性视觉注意模块(AVAM),该模块利用属性生成更具辨别力的特征。我们的方法使视觉表示能够通过属性引导聚焦于重要区域。实验表明,我们的方法可以在CUB和SUN基准上获得有竞争力的结果。我们的代码可在{https://github.com/chenhaoxing/ASL}. 摘要:Few-shot recognition aims to recognize novel categories under low-data regimes. Due to the scarcity of images, machines cannot obtain enough effective information, and the generalization ability of the model is extremely weak. By using auxiliary semantic modalities, recent metric-learning based few-shot learning methods have achieved promising performances. However, these methods only augment the representations of support classes, while query images have no semantic modalities information to enhance representations. Instead, we propose attribute-shaped learning (ASL), which can normalize visual representations to predict attributes for query images. And we further devise an attribute-visual attention module (AVAM), which utilizes attributes to generate more discriminative features. Our method enables visual representations to focus on important regions with attributes guidance. Experiments demonstrate that our method can achieve competitive results on CUB and SUN benchmarks. Our code is available at {https://github.com/chenhaoxing/ASL}.

【4】 Image-to-Height Domain Translation for Synthetic Aperture Sonar 标题:合成孔径声纳的像高域转换 链接:https://arxiv.org/abs/2112.06307

作者:Dylan Stewart,Shawn Johnson,Alina Zare 摘要:合成孔径声纳对海底结构的观测取决于几个因素。在这项工作中,我们重点关注集合几何体的各向同性和各向异性纹理。采集几何体的低掠射角,加上声纳路径相对于各向异性纹理的方向,对图像对齐和其他多视图场景理解框架提出了重大挑战。我们之前建议使用从估计的海底地形中捕获的特征来提高场景理解。虽然已经开发了几种通过强度估算海底起伏的方法,但文献中没有大规模的研究。此外,不存在共同注册的海底地形图和声纳图像数据集来学习该领域的翻译。我们通过使用两种独特的声纳数据模拟技术生成一个大型模拟数据集来解决这些问题,该数据集包含两对共同注册的海底地形图和强度图。我们采用三种不同复杂度的模型将强度图像转换为海底地形:高斯马尔可夫随机场方法(GMRF)、条件生成对抗网络(cGAN)和UNet体系结构。将这些方法与使用L1误差的共注册模拟数据集进行比较。此外,还显示了对模拟和真实SAS图像的预测。最后,在手对准SAS图像的两个数据集上对模型进行比较,并与使用强度进行比较,从多个方面对L1误差进行评估。我们的综合实验表明,所提出的UNet体系结构在模拟和真实SAS图像的海底地形估计方面优于GMRF和pix2pix cGAN模型。 摘要:Observations of seabed texture with synthetic aperture sonar are dependent upon several factors. In this work, we focus on collection geometry with respect to isotropic and anisotropic textures. The low grazing angle of the collection geometry, combined with orientation of the sonar path relative to anisotropic texture, poses a significant challenge for image-alignment and other multi-view scene understanding frameworks. We previously proposed using features captured from estimated seabed relief to improve scene understanding. While several methods have been developed to estimate seabed relief via intensity, no large-scale study exists in the literature. Furthermore, a dataset of coregistered seabed relief maps and sonar imagery is nonexistent to learn this domain translation. We address these problems by producing a large simulated dataset containing coregistered pairs of seabed relief and intensity maps from two unique sonar data simulation techniques. We apply three types of models, with varying complexity, to translate intensity imagery to seabed relief: a Gaussian Markov Random Field approach (GMRF), a conditional Generative Adversarial Network (cGAN), and UNet architectures. Methods are compared in reference to the coregistered simulated datasets using L1 error. Additionally, predictions on simulated and real SAS imagery are shown. Finally, models are compared on two datasets of hand-aligned SAS imagery and evaluated in terms of L1 error across multiple aspects in comparison to using intensity. Our comprehensive experiments show that the proposed UNet architectures outperform the GMRF and pix2pix cGAN models on seabed relief estimation for simulated and real SAS imagery.

【5】 Unsupervised Domain-Specific Deblurring using Scale-Specific Attention 标题:使用特定尺度注意力的无监督特定领域去模糊 链接:https://arxiv.org/abs/2112.06175

作者:Praveen Kandula,Rajagopalan. A. N 机构:IIT Madras, Rajagopalan.A.N 摘要:在文献中,从粗到细或比例循环方法,即从低分辨率版本逐步恢复干净图像,已成功用于单图像去模糊。然而,现有方法的一个主要缺点是需要成对的数据;i、 e.同一场景的sharpblur图像对,这是一个复杂而繁琐的采集过程。此外,由于对损失函数的强大监督,此类网络的预训练模型强烈偏向于训练期间经历的模糊,并且在推理期间遇到新的模糊核时,倾向于提供次优性能。为了解决上述问题,我们提出了使用尺度自适应注意模块(SAAM)的无监督特定领域的去模糊。我们的网络不需要监督对进行训练,而去模糊机制主要由对抗性损失引导,因此使我们的网络适合模糊函数的分布。给定一个模糊的输入图像,我们的模型在训练期间使用同一图像的不同分辨率,SAAM允许在不同分辨率之间进行有效的信息流。对于特定规模的网络训练,SAAM根据当前规模关注较低规模的功能。不同的消融研究表明,我们从粗到细的机制优于端到端的无监督模型,与文献中使用的注意模型相比,SAAM能够更好地参与。定性和定量比较(无参考指标)表明,我们的方法优于以前的无监督方法。 摘要:In the literature, coarse-to-fine or scale-recurrent approach i.e. progressively restoring a clean image from its low-resolution versions has been successfully employed for single image deblurring. However, a major disadvantage of existing methods is the need for paired data; i.e. sharpblur image pairs of the same scene, which is a complicated and cumbersome acquisition procedure. Additionally, due to strong supervision on loss functions, pre-trained models of such networks are strongly biased towards the blur experienced during training and tend to give sub-optimal performance when confronted by new blur kernels during inference time. To address the above issues, we propose unsupervised domain-specific deblurring using a scale-adaptive attention module (SAAM). Our network does not require supervised pairs for training, and the deblurring mechanism is primarily guided by adversarial loss, thus making our network suitable for a distribution of blur functions. Given a blurred input image, different resolutions of the same image are used in our model during training and SAAM allows for effective flow of information across the resolutions. For network training at a specific scale, SAAM attends to lower scale features as a function of the current scale. Different ablation studies show that our coarse-to-fine mechanism outperforms end-to-end unsupervised models and SAAM is able to attend better compared to attention models used in literature. Qualitative and quantitative comparisons (on no-reference metrics) show that our method outperforms prior unsupervised methods.

【6】 Semi-supervised Domain Adaptive Structure Learning 标题:半监督领域自适应结构学习 链接:https://arxiv.org/abs/2112.06161

作者:Can Qin,Lichen Wang,Qianqian Ma,Yu Yin,Huan Wang,Yun Fu 机构: Boston University 摘要:半监督域自适应(SSDA)是一个相当具有挑战性的问题,需要方法克服1)对注释不良的数据的过度拟合和2)跨域的分布转移。不幸的是,域自适应(DA)和半监督学习(SSL)方法的简单组合往往无法解决这两个对象,因为训练数据偏向于标记样本。在本文中,我们介绍了一种自适应结构学习方法来规范SSL和DA的合作。受多视图学习的启发,我们提出的框架由一个共享的特征编码器网络和两个分类器网络组成,用于相互矛盾的目的。其中,一个分类器用于对目标特征进行分组,以提高类内密度,扩大分类聚类的差距,实现鲁棒表示学习。同时,另一个分类器作为正则化器,试图分散源特征以增强决策边界的平滑度。目标聚类和源扩展的迭代使目标特征很好地封闭在相应源点的扩展边界内。对于跨域特征对齐和部分标记数据学习的联合地址,我们应用最大平均差异(MMD)距离最小化和自训练(ST)将矛盾结构投影到共享视图中,以做出可靠的最终决策。在标准SSDA基准(包括DomainNet和Office home)上的实验结果表明,与最先进的方法相比,我们的方法具有准确性和鲁棒性。 摘要:Semi-supervised domain adaptation (SSDA) is quite a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. Unfortunately, a simple combination of domain adaptation (DA) and semi-supervised learning (SSL) methods often fail to address such two objects because of training data bias towards labeled samples. In this paper, we introduce an adaptive structure learning method to regularize the cooperation of SSL and DA. Inspired by the multi-views learning, our proposed framework is composed of a shared feature encoder network and two classifier networks, trained for contradictory purposes. Among them, one of the classifiers is applied to group target features to improve intra-class density, enlarging the gap of categorical clusters for robust representation learning. Meanwhile, the other classifier, serviced as a regularizer, attempts to scatter the source features to enhance the smoothness of the decision boundary. The iterations of target clustering and source expansion make the target features being well-enclosed inside the dilated boundary of the corresponding source points. For the joint address of cross-domain features alignment and partially labeled data learning, we apply the maximum mean discrepancy (MMD) distance minimization and self-training (ST) to project the contradictory structures into a shared view to make the reliable final decision. The experimental results over the standard SSDA benchmarks, including DomainNet and Office-home, demonstrate both the accuracy and robustness of our method over the state-of-the-art approaches.

半弱无监督|主动学习|不确定性(8篇)

【1】 GCNDepth: Self-supervised Monocular Depth Estimation based on Graph Convolutional Network 标题:GCNDepth:基于图卷积网络的自监督单目深度估计 链接:https://arxiv.org/abs/2112.06782

作者:Armin Masoumian,Hatem A. Rashwan,Saddam Abdulwahab,Julian Cristiano,Domenec Puig 机构: Rovira i Virgili University 备注:10 pages, Submitted to IEEE transactions on intelligent transportation systems 摘要:深度估计是三维重建中一项具有挑战性的任务,以提高环境感知的准确性。这项工作带来了一个新的解决方案,并进行了一系列改进,与现有方法相比,增加了对深度图的定量和定性理解。最近,卷积神经网络(CNN)在从单目视频中估计深度贴图方面表现出了非凡的能力。然而,传统的CNN不支持拓扑结构,只能处理具有确定大小和权重的规则图像区域。另一方面,图卷积网络(GCN)可以处理非欧几里德数据的卷积,并且可以应用于拓扑结构中的不规则图像区域。因此,在这项工作中,为了保持目标的几何外观和分布,我们的目标是利用GCN的自我监督深度估计模型。我们的模型由两个并行的自动编码器网络组成:第一个是一个自动编码器,它依赖于ResNet-50,从输入图像中提取特征,并基于多尺度GCN来估计深度图。反过来,第二网络将用于基于ResNet-18估计两个连续帧之间的自我运动矢量(即,3D姿势)。估计的3D姿势和深度贴图都将用于构建目标图像。与光度、投影和平滑度相关的损失函数组合用于处理不良的深度预测并保留对象的不连续性。特别是,我们的方法在KITTI和Make3D数据集上提供了具有可比性和前景的结果,预测精度高达89%,与最先进的解决方案相比,可训练参数的数量减少了40%。源代码可在https://github.com/ArminMasoumian/GCNDepth.git 摘要:Depth estimation is a challenging task of 3D reconstruction to enhance the accuracy sensing of environment awareness. This work brings a new solution with a set of improvements, which increase the quantitative and qualitative understanding of depth maps compared to existing methods. Recently, a convolutional neural network (CNN) has demonstrated its extraordinary ability in estimating depth maps from monocular videos. However, traditional CNN does not support topological structure and they can work only on regular image regions with determined size and weights. On the other hand, graph convolutional networks (GCN) can handle the convolution on non-Euclidean data and it can be applied to irregular image regions within a topological structure. Therefore, in this work in order to preserve object geometric appearances and distributions, we aim at exploiting GCN for a self-supervised depth estimation model. Our model consists of two parallel auto-encoder networks: the first is an auto-encoder that will depend on ResNet-50 and extract the feature from the input image and on multi-scale GCN to estimate the depth map. In turn, the second network will be used to estimate the ego-motion vector (i.e., 3D pose) between two consecutive frames based on ResNet-18. Both the estimated 3D pose and depth map will be used for constructing a target image. A combination of loss functions related to photometric, projection, and smoothness is used to cope with bad depth prediction and preserve the discontinuities of the objects. In particular, our method provided comparable and promising results with a high prediction accuracy of 89% on the publicly KITTI and Make3D datasets along with a reduction of 40% in the number of trainable parameters compared to the state of the art solutions. The source code is publicly available at https://github.com/ArminMasoumian/GCNDepth.git

【2】 Active learning with MaskAL reduces annotation effort for training Mask R-CNN 标题:使用Maskal的主动学习减少了训练面具R-CNN的注释工作 链接:https://arxiv.org/abs/2112.06586

作者:Pieter M. Blok,Gert Kootstra,Hakim Elchaoui Elghor,Boubacar Diallo,Frits K. van Evert,Eldert J. van Henten 机构:Wageningen University & Research, Wageningen, The Netherlands, Exxact Robotics, Epernay, France 备注:32 pages, 9 figures, 4 tables 摘要:卷积神经网络(CNN)的泛化性能受训练图像的数量、质量和种类的影响。训练图像必须进行注释,这既耗时又昂贵。我们工作的目标是减少训练CNN所需的注释图像数量,同时保持其性能。我们假设,通过确保训练图像集包含大量难以分类的图像,可以更快地提高CNN的性能。我们的研究目的是用一种能够自动选择难以分类的图像的主动学习方法来检验这一假设。我们开发了一种基于掩模区域的CNN主动学习方法(Mask R-CNN),并将其命名为MaskAL。MaskAL涉及到Mask R-CNN的迭代训练,训练后的模型用于选择一组模型不确定的未标记图像。然后对选定的图像进行注释并用于重新训练Mask R-CNN,并重复多次采样迭代。在我们的研究中,Mask R-CNN对2500张花椰菜图像进行了训练,这些图像是通过12次采样迭代从14000张花椰菜图像的训练集中通过MaskAL或随机抽样方法选择的。对于所有采样迭代,MaskAL的表现明显优于随机采样。此外,MaskAL在对900张图像进行采样后,其性能与对2300张图像进行随机采样后的性能相同。与在整个训练集(14000张图像)上训练的Mask R-CNN模型相比,MaskAL获得了93.9%的性能和17.9%的训练数据。随机抽样获得了81.9%的性能和16.4%的训练数据。我们得出结论,通过使用MaskAL,可以减少在西兰花数据集上训练Mask R-CNN的注释工作量。我们的软件在https://github.com/pieterblok/maskal. 摘要:The generalisation performance of a convolutional neural network (CNN) is influenced by the quantity, quality, and variety of the training images. Training images must be annotated, and this is time consuming and expensive. The goal of our work was to reduce the number of annotated images needed to train a CNN while maintaining its performance. We hypothesised that the performance of a CNN can be improved faster by ensuring that the set of training images contains a large fraction of hard-to-classify images. The objective of our study was to test this hypothesis with an active learning method that can automatically select the hard-to-classify images. We developed an active learning method for Mask Region-based CNN (Mask R-CNN) and named this method MaskAL. MaskAL involved the iterative training of Mask R-CNN, after which the trained model was used to select a set of unlabelled images about which the model was uncertain. The selected images were then annotated and used to retrain Mask R-CNN, and this was repeated for a number of sampling iterations. In our study, Mask R-CNN was trained on 2500 broccoli images that were selected through 12 sampling iterations by either MaskAL or a random sampling method from a training set of 14,000 broccoli images. For all sampling iterations, MaskAL performed significantly better than the random sampling. Furthermore, MaskAL had the same performance after sampling 900 images as the random sampling had after 2300 images. Compared to a Mask R-CNN model that was trained on the entire training set (14,000 images), MaskAL achieved 93.9% of its performance with 17.9% of its training data. The random sampling achieved 81.9% of its performance with 16.4% of its training data. We conclude that by using MaskAL, the annotation effort can be reduced for training Mask R-CNN on a broccoli dataset. Our software is available on https://github.com/pieterblok/maskal.

【3】 Multi-Modal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing 标题:多模态互信息最大化:一种新的无监督深度跨模态散列方法 链接:https://arxiv.org/abs/2112.06489

作者:Tuan Hoang,Thanh-Toan Do,Tam V. Nguyen,Ngai-Man Cheung 摘要:在本文中,我们采用最大互信息(MI)方法来解决二进制哈希码的无监督学习问题,以实现高效的跨模式检索。我们提出了一种新的方法,称为交叉模态信息最大哈希(CMIMH)。首先,为了学习既能保持模态内相似性又能保持模态间相似性的信息表示,我们利用估计MI变化下限的最新进展来最大化二进制表示和输入特征之间以及不同模态的二进制表示之间的MI。在假设二进制表示由多元贝努利分布建模的情况下,通过联合最大化这些MIs,我们可以学习二进制表示,它可以通过梯度下降以小批量方式有效地保持模态内和模态间的相似性。此外,我们还发现,通过从不同的模态中学习相同实例的相似二进制表示来最小化模态间隙可能会导致信息量较少的表示。因此,在减少模态间隙和丢失模态私有信息之间取得平衡对于跨模态检索任务非常重要。对标准基准数据集的定量评估表明,该方法始终优于其他最先进的跨模态检索方法。 摘要:In this paper, we adopt the maximizing mutual information (MI) approach to tackle the problem of unsupervised learning of binary hash codes for efficient cross-modal retrieval. We proposed a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH). First, to learn informative representations that can preserve both intra- and inter-modal similarities, we leverage the recent advances in estimating variational lower-bound of MI to maximize the MI between the binary representations and input features and between binary representations of different modalities. By jointly maximizing these MIs under the assumption that the binary representations are modelled by multivariate Bernoulli distributions, we can learn binary representations, which can preserve both intra- and inter-modal similarities, effectively in a mini-batch manner with gradient descent. Furthermore, we find out that trying to minimize the modality gap by learning similar binary representations for the same instance from different modalities could result in less informative representations. Hence, balancing between reducing the modality gap and losing modality-private information is important for the cross-modal retrieval tasks. Quantitative evaluations on standard benchmark datasets demonstrate that the proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.

【4】 Semi-Supervised Contrastive Learning for Remote Sensing: Identifying Ancient Urbanization in the South Central Andes 标题:遥感的半监督对比学习:识别中南部安第斯山脉的古代城市化 链接:https://arxiv.org/abs/2112.06437

作者:Jiachen Xu,James Zimmer-Dauphinee,Quan Liu,Yuxuan Shi,Steven Wernke,Yuankai Huo 机构: School of Engineering, Vanderbilt University, Department of Anthropology, Vanderbilt University, Department of Electrical Engineering and Computer Science, Vanderbilt University 摘要:古聚落的探测是景观考古学的一个重点。传统上,定居点是通过行人调查来确定的,因为研究人员通过实地考察景观并记录定居点位置。最近,在卫星图像中对古代遗骸进行人工识别和标记增加了考古数据收集的规模,但这一过程仍然非常耗时和艰巨。自我监督学习(如对比学习)的发展为利用未标记卫星和历史航空图像定位考古遗址提供了一种可扩展的学习方案。然而,考古遗址仅占整个景观的很小比例,而现代对比监督学习方法在高度平衡的数据集上的表现通常较差,例如使用卫星图像在大面积上识别局部稀疏的古代城市化。在这项工作中,我们提出了一个框架来解决这个长尾问题。与现有的对比学习方法不同,该方法通常分别处理标记和未标记的数据,该方法在半监督环境下改革了学习范式,以充分利用宝贵的注释数据(在我们的环境中<7%)。具体地说,数据的高度不平衡性质被用作先验知识,通过对未注释图像块和注释锚图像之间的相似性进行排序来形成伪负对。在本研究中,我们使用95358张未标记图像和5830张标记图像来解决从长尾卫星图像数据集中检测古建筑的问题。从结果来看,我们的半监督对比学习模型达到了79.0%的测试平衡准确率,比最先进的方法提高了3.8%。 摘要:The detection of ancient settlements is a key focus in landscape archaeology. Traditionally, settlements were identified through pedestrian survey, as researchers physically traversed the landscape and recorded settlement locations. Recently the manual identification and labeling of ancient remains in satellite imagery have increased the scale of archaeological data collection, but the process remains tremendously time-consuming and arduous. The development of self-supervised learning (e.g., contrastive learning) offers a scalable learning scheme in locating archaeological sites using unlabeled satellite and historical aerial images. However, archaeology sites are only present in a very small proportion of the whole landscape, while the modern contrastive-supervised learning approach typically yield inferior performance on the highly balanced dataset, such as identifying sparsely localized ancient urbanization on a large area using satellite images. In this work, we propose a framework to solve this long-tail problem. As opposed to the existing contrastive learning approaches that typically treat the labeled and unlabeled data separately, the proposed method reforms the learning paradigm under a semi-supervised setting to fully utilize the precious annotated data (<7% in our setting). Specifically, the highly unbalanced nature of the data is employed as the prior knowledge to form pseudo negative pairs by ranking the similarities between unannotated image patches and annotated anchor images. In this study, we used 95,358 unlabeled images and 5,830 labeled images to solve the problem of detecting ancient buildings from a long-tailed satellite image dataset. From the results, our semi-supervised contrastive learning model achieved a promising testing balanced accuracy of 79.0%, which is 3.8% improvement over state-of-the-art approaches.

【5】 Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity 标题:利用视频连续性的自监督时空表示学习 链接:https://arxiv.org/abs/2112.05883

作者:Hanwen Liang,Niamul Quader,Zhixiang Chi,Lizhe Chen,Peng Dai,Juwei Lu,Yang Wang 机构: Huawei Noah’s Ark Lab , University of Manitoba, Canada 备注:Accepted at AAAI2022 摘要:最近的自监督视频表示学习方法通过探索视频的基本属性(例如速度、时间顺序等)取得了显著的成功。这项工作利用了视频的一个基本属性,即\textit{video continuity},来获得用于自监督表示学习的监督信号。具体地说,我们提出了三种新的与连续性相关的借口任务,即连续性证明、不连续性定位和缺失部分近似,它们共同监督视频表征学习的共享主干。这种被称为连续性感知网络(CPNet)的自我监督方法可以解决这三项任务,并鼓励主干网络学习本地和远程运动和上下文表示。它在多个下游任务(如动作识别、视频检索和动作定位)上优于现有技术。此外,视频连续性可以补充用于表示学习的其他粗粒度视频属性,并且将所提出的借口任务集成到现有技术中可以产生很大的性能增益。 摘要:Recent self-supervised video representation learning methods have found significant success by exploring essential properties of videos, e.g. speed, temporal order, etc. This work exploits an essential yet under-explored property of videos, the \textit{video continuity}, to obtain supervision signals for self-supervised representation learning. Specifically, we formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning. This self-supervision approach, termed as Continuity Perception Network (CPNet), solves the three tasks altogether and encourages the backbone network to learn local and long-ranged motion and context representations. It outperforms prior arts on multiple downstream tasks, such as action recognition, video retrieval, and action localization. Additionally, the video continuity can be complementary to other coarse-grained video properties for representation learning, and integrating the proposed pretext task to prior arts can yield much performance gains.

【6】 Revisiting Consistency Regularization for Semi-Supervised Learning 标题:半监督学习的重访一致性正则化方法 链接:https://arxiv.org/abs/2112.05825

作者:Yue Fan,Anna Kukleva,Bernt Schiele 机构:Max Planck Institute for Informatics, Saarbr¨ucken, Germany, Saarland Informatics Campus 备注:Published at GCPR2021 as a conference paper 摘要:一致性正则化是半监督学习(SSL)中应用最广泛的技术之一。通常,目标是训练一个对各种数据扩充保持不变的模型。在本文中,我们重新审视这一观点,发现通过减少不同增强图像的特征之间的距离来增强不变性可以提高性能。但是,通过增加特征距离,鼓励等变,可以进一步提高性能。为此,我们提出了一种改进的一致性正则化框架,该框架采用了一种简单而有效的技术FeatDistLoss,该技术分别在分类器和特征层施加一致性和等变性。实验结果表明,我们的模型为各种数据集和设置定义了一种新的技术状态,并且在很大程度上优于以前的工作,特别是在低数据状态下。我们进行了大量的实验来分析该方法,代码将发布。 摘要:Consistency regularization is one of the most widely-used techniques for semi-supervised learning (SSL). Generally, the aim is to train a model that is invariant to various data augmentations. In this paper, we revisit this idea and find that enforcing invariance by decreasing distances between features from differently augmented images leads to improved performance. However, encouraging equivariance instead, by increasing the feature distance, further improves performance. To this end, we propose an improved consistency regularization framework by a simple yet effective technique, FeatDistLoss, that imposes consistency and equivariance on the classifier and the feature level, respectively. Experimental results show that our model defines a new state of the art for various datasets and settings and outperforms previous work by a significant margin, particularly in low data regimes. Extensive experiments are conducted to analyze the method, and the code will be published.

【7】 Unsupervised Image to Image Translation for Multiple Retinal Pathology Synthesis in Optical Coherence Tomography Scans 标题:光学相干层析扫描中多视网膜病变综合的无监督图像到图像转换 链接:https://arxiv.org/abs/2112.06031

作者:Hemanth Pasupuleti,G. N. Girish 机构:Department of Computer Science and Engineering, Indian Institute of Information Technology, Sri City, India. 摘要:图像到图像的翻译(I2I)是一个具有挑战性的计算机视觉问题,在许多领域中用于多种任务。近年来,眼科成为I2I应用迅速增长的主要领域之一。其中一个应用是生成合成视网膜光学相干断层扫描(OCT)。现有的I2I方法需要对多个模型进行训练,以将正常扫描的图像转换为特定的病理学:由于这些模型的复杂性,限制了这些模型的使用。为了解决这个问题,我们提出了一种无监督的多域I2I网络,该网络具有预先训练的编码器,可以将一个域中的视网膜OCT图像转换为多个域。我们假设图像分解为域不变的内容和域特定的样式代码,并对这些样式代码进行预训练。进行的实验表明,该模型优于MUNIT和CycleGAN等综合多种病理扫描的最新模型。 摘要:Image to Image Translation (I2I) is a challenging computer vision problem used in numerous domains for multiple tasks. Recently, ophthalmology became one of the major fields where the application of I2I is increasing rapidly. One such application is the generation of synthetic retinal optical coherence tomographic (OCT) scans. Existing I2I methods require training of multiple models to translate images from normal scans to a specific pathology: limiting the use of these models due to their complexity. To address this issue, we propose an unsupervised multi-domain I2I network with pre-trained style encoder that translates retinal OCT images in one domain to multiple domains. We assume that the image splits into domain-invariant content and domain-specific style codes, and pre-train these style codes. The performed experiments show that the proposed model outperforms state-of-the-art models like MUNIT and CycleGAN synthesizing diverse pathological scans.

【8】 Learning Representations with Contrastive Self-Supervised Learning for Histopathology Applications 标题:基于对比自监督学习的学习表征在组织病理学中的应用 链接:https://arxiv.org/abs/2112.05760

作者:Karin Stacke,Jonas Unger,Claes Lundström,Gabriel Eilertsen 机构:Department of Science and Technology (ITN), Link¨oping University, Sweden, Sectra AB, Link¨oping, Center for Medical Image Science and Visualization (CMIV), Link¨oping University, Sweden, Claes Lundstr¨om 备注:Pre-print, in submission Journal of Machine Learning for Biomedical Imaging 摘要:在过去的几年中,无监督学习取得了长足的进步,尤其是通过对比自我监督学习。基准自监督学习的主要数据集是ImageNet,最近的方法接近于通过完全监督训练获得的性能。然而,ImageNet数据集在很大程度上是以对象为中心的,目前尚不清楚这些方法在非以对象为中心的广泛不同的数据集和任务(如数字病理学)上有何潜力。虽然自我监督学习已开始在这一领域进行探索,并取得了令人鼓舞的成果,但有理由更仔细地研究这种设置与自然图像和ImageNet的区别。在本文中,我们对组织病理学的对比学习进行了深入的分析,指出了由于组织病理学数据的特点,对比目标将如何表现出不同的行为。我们提出了一些考虑因素,例如用于对比目标的视图生成和超参数调整。在一系列大型实验中,我们分析了这些因素对组织分类下游性能的影响。结果表明,对比学习可以减少数字病理学中的注释工作,但需要考虑特定的数据集特征。为了充分利用对比学习目标,需要对视图生成和超参数进行不同的校准。我们的结果为实现组织病理学应用中自我监督学习的全部潜力铺平了道路。 摘要:Unsupervised learning has made substantial progress over the last few years, especially by means of contrastive self-supervised learning. The dominating dataset for benchmarking self-supervised learning has been ImageNet, for which recent methods are approaching the performance achieved by fully supervised training. The ImageNet dataset is however largely object-centric, and it is not clear yet what potential those methods have on widely different datasets and tasks that are not object-centric, such as in digital pathology. While self-supervised learning has started to be explored within this area with encouraging results, there is reason to look closer at how this setting differs from natural images and ImageNet. In this paper we make an in-depth analysis of contrastive learning for histopathology, pin-pointing how the contrastive objective will behave differently due to the characteristics of histopathology data. We bring forward a number of considerations, such as view generation for the contrastive objective and hyper-parameter tuning. In a large battery of experiments, we analyze how the downstream performance in tissue classification will be affected by these considerations. The results point to how contrastive learning can reduce the annotation effort within digital pathology, but that the specific dataset characteristics need to be considered. To take full advantage of the contrastive learning objective, different calibrations of view generation and hyper-parameters are required. Our results pave the way for realizing the full potential of self-supervised learning for histopathology applications.

时序|行为识别|姿态|视频|运动估计(7篇)

【1】 hARMS: A Hardware Acceleration Architecture for Real-Time Event-Based Optical Flow 标题:HARMS:一种面向实时事件光流的硬件加速体系结构 链接:https://arxiv.org/abs/2112.06772

作者:Daniel C. Stumpp,Himanshu Akolkar,Alan D. George,Ryad B. Benosman 机构:Biomedical Science Tower , University of Pittsburgh, Pittsburgh, PA , USA, CNRS, UMRS , UMR , INSERM UMRI S , Institut de la Vision, Sorbonne Université (UPMC University Paris ,), Paris, France 备注:18 pages, 16 figures, 4 tables 摘要:基于事件的视觉传感器根据视觉场景的变化生成具有高时间分辨率的异步事件流。这些传感器的特性允许在生成事件时准确快速地计算光流。现有的从事件数据计算光流的解决方案要么由于孔径问题无法捕捉运动的真实方向,要么不使用传感器的高时间分辨率,要么计算成本太高,无法在嵌入式平台上实时运行。在这项研究中,我们首先提出了我们先前算法的更快版本,ARMS(孔径鲁棒多尺度流)。新的优化软件版本(fARMS)显著提高了传统CPU的吞吐量。此外,我们还介绍了hARMS,这是fARMS算法的硬件实现,允许在低功耗嵌入式平台上实时计算真实流量。提出的HARM体系结构以混合系统芯片设备为目标,旨在最大限度地提高可配置性和吞吐量。硬件体系结构和fARMS算法的开发考虑到了异步神经形态处理,放弃了事件帧的常见用途,而是使用相关事件的小历史记录进行操作,允许延迟独立于传感器分辨率进行扩展。与现有方法相比,处理范式的这种变化将流向估计提高了73%,并在所选基准配置上产生了高达1.21 Mevent/s的已证明hARMS吞吐量。这种吞吐量实现了实时性能,并使其成为迄今为止已知的最快的孔径鲁棒性、基于事件的光流实现。 摘要:Event-based vision sensors produce asynchronous event streams with high temporal resolution based on changes in the visual scene. The properties of these sensors allow for accurate and fast calculation of optical flow as events are generated. Existing solutions for calculating optical flow from event data either fail to capture the true direction of motion due to the aperture problem, do not use the high temporal resolution of the sensor, or are too computationally expensive to be run in real time on embedded platforms. In this research, we first present a faster version of our previous algorithm, ARMS (Aperture Robust Multi-Scale flow). The new optimized software version (fARMS) significantly improves throughput on a traditional CPU. Further, we present hARMS, a hardware realization of the fARMS algorithm allowing for real-time computation of true flow on low-power, embedded platforms. The proposed hARMS architecture targets hybrid system-on-chip devices and was designed to maximize configurability and throughput. The hardware architecture and fARMS algorithm were developed with asynchronous neuromorphic processing in mind, abandoning the common use of an event frame and instead operating using only a small history of relevant events, allowing latency to scale independently of the sensor resolution. This change in processing paradigm improved the estimation of flow directions by up to 73% compared to the existing method and yielded a demonstrated hARMS throughput of up to 1.21 Mevent/s on the benchmark configuration selected. This throughput enables real-time performance and makes it the fastest known realization of aperture-robust, event-based optical flow to date.

【2】 VirtualCube: An Immersive 3D Video Communication System 标题:VirtualCube:一种沉浸式3D视频通信系统 链接:https://arxiv.org/abs/2112.06730

作者:Yizhong Zhang,Jiaolong Yang,Zhen Liu,Ruicheng Wang,Guojun Chen,Xin Tong,Baining Guo 机构:Fellow, IEEE 摘要:VirtualCube系统是一种3D视频会议系统,它试图克服传统技术的一些限制。关键要素是VirtualCube,它是一个真实世界隔间的抽象表示,装有RGBD摄像头,用于捕捉用户的3D几何体和纹理。我们设计了VirtualCube,使数据捕获任务标准化并大大简化,并且可以使用现成的硬件构建一切。我们将VirtualCube用作虚拟会议环境的基本构建块,并为每个VirtualCube用户提供一个显示远程参与者真人大小视频的周边显示器。为了实现远程参与者的实时渲染,我们开发了V-Cube视图算法,该算法使用多视图立体进行更精确的深度估计,并使用Lumi网络渲染以获得更好的渲染质量。VirtualCube系统正确地保留了参与者之间的相互注视,使他们能够建立目光接触,并意识到谁在视觉上关注他们。该系统还允许参与者与远程参与者进行边讨论,就像他们在同一个房间一样。最后,该系统为如何支持工作项(如文档和应用程序)的共享空间以及跟踪参与者对工作项的视觉注意提供了帮助。 摘要:The VirtualCube system is a 3D video conference system that attempts to overcome some limitations of conventional technologies. The key ingredient is VirtualCube, an abstract representation of a real-world cubicle instrumented with RGBD cameras for capturing the 3D geometry and texture of a user. We design VirtualCube so that the task of data capturing is standardized and significantly simplified, and everything can be built using off-the-shelf hardware. We use VirtualCubes as the basic building blocks of a virtual conferencing environment, and we provide each VirtualCube user with a surrounding display showing life-size videos of remote participants. To achieve real-time rendering of remote participants, we develop the V-Cube View algorithm, which uses multi-view stereo for more accurate depth estimation and Lumi-Net rendering for better rendering quality. The VirtualCube system correctly preserves the mutual eye gaze between participants, allowing them to establish eye contact and be aware of who is visually paying attention to them. The system also allows a participant to have side discussions with remote participants as if they were in the same room. Finally, the system sheds lights on how to support the shared space of work items (e.g., documents and applications) and track the visual attention of participants to work items.

【3】 SVIP: Sequence VerIfication for Procedures in Videos 标题:SVIP:视频程序序列校验 链接:https://arxiv.org/abs/2112.06447

作者:Yicheng Qian,Weixin Luo,Dongze Lian,Xu Tang,Peilin Zhao,Shenghua Gao 机构:ShanghaiTech University,Meituan,Xiaohongshu Inc.,Tencent AI Lab 摘要:在本文中,我们提出了一种新的序列验证任务,旨在区分执行相同动作序列的正视频对和执行阶跃变换但仍执行相同任务的负视频对。这样的质询任务驻留在一个开放集设置中,没有需要事件级甚至帧级注释的优先级检测或分割。为此,我们仔细地重新组织了两个公开可用的具有step procedure任务结构的操作相关数据集。为了充分研究任何方法的有效性,我们收集了一个脚本视频数据集,列举了化学实验中的各种阶跃转换。此外,还引入了一种新的评估方法——加权距离比,以确保评估过程中不同阶跃变换的等效性。最后,引入了一种基于Transformer的简单而有效的基线,该基线具有新颖的序列对齐丢失,能够更好地描述步骤之间的长期依赖性,优于其他动作识别方法。代码和数据将被发布。 摘要:In this paper, we propose a novel sequence verificationtask that aims to distinguish positive video pairs performingthe same action sequence from negative ones with step-leveltransformations but still conducting the same task. Such achallenging task resides in an open-set setting without prioraction detection or segmentation that requires event-levelor even frame-level annotations. To that end, we carefullyreorganize two publicly available action-related datasetswith step-procedure-task structure. To fully investigate theeffectiveness of any method, we collect a scripted videodataset enumerating all kinds of step-level transformationsin chemical experiments. Besides, a novel evaluation met-ric Weighted Distance Ratio is introduced to ensure equiva-lence for different step-level transformations during evalua-tion. In the end, a simple but effective baseline based on thetransformer with a novel sequence alignment loss is intro-duced to better characterize long-term dependency betweensteps, which outperforms other action recognition methods.Codes and data will be released.

【4】 Local and Global Point Cloud Reconstruction for 3D Hand Pose Estimation 标题:用于三维手势估计的局部和全局点云重建 链接:https://arxiv.org/abs/2112.06389

作者:Ziwei Yu,Linlin Yang,Shicheng Chen,Angela Yao 机构: National University of Singapore, University of Bonn, Germany 备注:The British Machine Vision Conference (BMVC) 摘要:本文讨论了从单个RGB图像重建人手的三维点云和三维姿势估计。为此,我们提出了一种新的管道,用于局部和全局点云重建,使用三维手模板,同时学习潜在的姿势估计表示。为了演示我们的方法,我们引入了一个新的多视图手姿势数据集,以获得真实世界中完整的手的三维点云。在我们新提出的数据集和四个公共基准上的实验证明了该模型的优势。在重建逼真的完整三维手点云时,我们的方法在三维姿态估计方面优于竞争对手。 摘要:This paper addresses the 3D point cloud reconstruction and 3D pose estimation of the human hand from a single RGB image. To that end, we present a novel pipeline for local and global point cloud reconstruction using a 3D hand template while learning a latent representation for pose estimation. To demonstrate our method, we introduce a new multi-view hand posture dataset to obtain complete 3D point clouds of the hand in the real world. Experiments on our newly proposed dataset and four public benchmarks demonstrate the model's strengths. Our method outperforms competitors in 3D pose estimation while reconstructing realistic-looking complete 3D hand point clouds.

【5】 Video as Conditional Graph Hierarchy for Multi-Granular Question Answering 标题:用于多粒度问答的视频作为条件图层次结构 链接:https://arxiv.org/abs/2112.06197

作者:Junbin Xiao,Angela Yao,Zhiyuan Liu,Yicong Li,Wei Ji,Tat-Seng Chua 机构:Department of Computer Science, National University of Singapore 备注:Accepted to prepresent at AAAI'22 摘要:视频问答要求模型理解并推理复杂的视频和语言数据,以正确得出答案。现有的工作集中在设计复杂的跨模态交互,以融合来自两种模态的信息,同时将视频和问题整体编码为帧和单词序列。尽管这些方法取得了成功,但它们基本上都是围绕着视频和问题内容的顺序性展开的,对问题的回答缺乏洞察,也缺乏可解释性。在这项工作中,我们认为,虽然视频是以帧顺序呈现的,但视觉元素(例如,对象、动作、活动和事件)在语义空间中不是顺序的,而是分层的。为了与语言查询中语言概念的多粒度本质保持一致,我们建议将视频建模为一个条件图层次结构,在相应文本线索的指导下,以层次方式将不同粒度的视觉事实编织在一起。尽管简单,我们的大量实验证明了这种条件层次图结构的优越性,与以前的方法相比,性能有了明显的改进,并且在不同类型的问题上也有了更好的泛化。进一步的分析也巩固了模型的可靠性,因为它为预测的答案提供了有意义的视觉文本证据。 摘要:Video question answering requires models to understand and reason about both complex video and language data to correctly derive answers. Existing efforts focus on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to the problem of question-answering and lacking interpretability as well. In this work, we argue that while video is presented in frame sequence, the visual elements (eg, objects, actions, activities and events) are not sequential but rather hierarchical in semantic space. To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues. Despite the simplicity, our extensive experiments demonstrate the superiority of such conditional hierarchical graph architecture, with clear performance improvements over prior methods and also better generalization across different type of questions. Further analyses also consolidate the model's reliability as it shows meaningful visual-textual evidences for the predicted answers.

【6】 COMPOSER: Compositional Learning of Group Activity in Videos 标题:Composer:视频中小组活动的构图学习 链接:https://arxiv.org/abs/2112.05892

作者:Honglu Zhou,Asim Kadav,Aviv Shamsian,Shijie Geng,Farley Lai,Long Zhao,Ting Liu,Mubbasir Kapadia,Hans Peter Graf 机构: Department of Computer Science, Rutgers University, Piscataway, NJ, USA, NEC Laboratories America, Inc., San Jose, CA, USA, Bar-Ilan University, Israel, Google LLC, Menlo Park, CA, USA 摘要:组活动识别(GAR)检测短视频剪辑中一组演员执行的活动。该任务需要对场景实体的合成理解以及它们之间的关系推理。我们通过将视频建模为一系列表示视频中多尺度语义概念的标记来实现GAR。我们提出COMPOSER,这是一种基于多尺度变换器的体系结构,它在每个尺度上对标记执行基于注意的推理,并以组合方式学习组活动。此外,我们只使用关键点模式,减少了场景偏差,提高了模型的泛化能力。我们通过对中间尺度表示进行聚类来改进COMPOSER中的多尺度表示,同时在尺度之间保持一致的聚类分配。最后,我们使用辅助预测和新的数据增强(如演员退出)等技术来辅助模型训练。我们在具有挑战性的排球数据集上演示了该模型的强度和可解释性。COMPOSER使用仅关键点模态实现了新的94.5%的最先进准确度。COMPOSER的性能优于最新的基于RGB信号的GAR方法,并且与利用多种模式的方法相比表现良好。我们的代码将可用。 摘要:Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip. The task requires the compositional understanding of scene entities and relational reasoning between them. We approach GAR by modeling the video as a series of tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, we only use the keypoint modality which reduces scene biases and improves the generalization ability of the model. We improve the multi-scale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and novel data augmentations (e.g., Actor Dropout) to aid model training. We demonstrate the model's strength and interpretability on the challenging Volleyball dataset. COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality. COMPOSER outperforms the latest GAR methods that rely on RGB signals, and performs favorably compared against methods that exploit multiple modalities. Our code will be available.

【7】 Information Prebuilt Recurrent Reconstruction Network for Video Super-Resolution 标题:用于视频超分辨率的信息预建递归重建网络 链接:https://arxiv.org/abs/2112.05755

作者:Ming Yu,Shuyun Wang,Cuihong Xue,Yingchun Guo,Gang Yan 机构:Tianjin University ofTechnology 备注:11 pages,9 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要:基于递归卷积网络的视频超分辨率(VSR)方法对视频序列具有很强的时间建模能力。然而,单向递归卷积网络中不同递归单元接收到的输入信息是不平衡的。早期重建帧接收的时间信息较少,导致结果模糊或伪影。虽然双向递归卷积网络可以缓解这一问题,但它大大增加了重建时间和计算复杂度。它也不适用于许多应用场景,例如在线超分辨率。为了解决上述问题,我们提出了一种端到端的信息预建递归重建网络(IPRRN),由信息预建网络(IPNet)和递归重建网络(RRNet)组成。通过整合来自视频前端的足够信息来构建初始重现单元所需的隐藏状态,以帮助恢复早期帧,信息预构建网络在没有反向传播的情况下平衡前后的输入信息差异。此外,我们还演示了一个紧凑的循环重建网络,该网络在恢复质量和时间效率方面有显著改进。大量实验验证了我们提出的网络的有效性,与现有的先进方法相比,我们的方法可以有效地实现更高的定量和定性评价性能。 摘要:The video super-resolution (VSR) method based on the recurrent convolutional network has strong temporal modeling capability for video sequences. However, the input information received by different recurrent units in the unidirectional recurrent convolutional network is unbalanced. Early reconstruction frames receive less temporal information, resulting in fuzzy or artifact results. Although the bidirectional recurrent convolution network can alleviate this problem, it greatly increases reconstruction time and computational complexity. It is also not suitable for many application scenarios, such as online super-resolution. To solve the above problems, we propose an end-to-end information prebuilt recurrent reconstruction network (IPRRN), consisting of an information prebuilt network (IPNet) and a recurrent reconstruction network (RRNet). By integrating sufficient information from the front of the video to build the hidden state needed for the initially recurrent unit to help restore the earlier frames, the information prebuilt network balances the input information difference before and after without backward propagation. In addition, we demonstrate a compact recurrent reconstruction network, which has significant improvements in recovery quality and time efficiency. Many experiments have verified the effectiveness of our proposed network, and compared with the existing state-of-the-art methods, our method can effectively achieve higher quantitative and qualitative evaluation performance.

医学相关(3篇)

【1】 Improving Performance of Federated Learning based Medical Image Analysis in Non-IID Settings using Image Augmentation 标题:利用图像增强提高非IID环境下基于联邦学习的医学图像分析性能 链接:https://arxiv.org/abs/2112.06194

作者:Alper Emin Cetinkaya,Dr. Murat Akin,Prof. Dr. Seref Sagiroglu 机构:Information Security Program, Ankara, Turkey, -,-,-, Murat Akin, Gazi AI Center of Gazi University, Basarsoft Information Systems Inc., Seref Sagiroglu, Computer Engineering Dept., Gazi AI Center, Gazi University 摘要:联邦学习(FL)是一个合适的解决方案,用于利用属于患者、个人、公司或行业的敏感数据,这些数据必须在严格的隐私约束下工作。FL主要或部分支持数据隐私和安全问题,并提供模型问题的替代方案,方便多个边缘设备或组织使用大量本地数据对全球模型进行训练,而不需要这些数据。FL的非IID数据因其分布特性而呈现出显著的性能退化和稳定偏差。针对FL的非IID数据问题,提出了一种通过增强图像动态平衡客户数据分布的新方法。该方法显著地稳定了模型训练,并将模型的测试精度从83.22%提高到89.43%非IID FL设置。IID、非IID和非IID与提议的方法联合训练的结果表明,提议的方法可能有助于鼓励组织或研究人员开发更好的系统,从数据中获取数据隐私价值,这不仅适用于医疗保健,也适用于其他领域。 摘要:Federated Learning (FL) is a suitable solution for making use of sensitive data belonging to patients, people, companies, or industries that are obligatory to work under rigid privacy constraints. FL mainly or partially supports data privacy and security issues and provides an alternative to model problems facilitating multiple edge devices or organizations to contribute a training of a global model using a number of local data without having them. Non-IID data of FL caused from its distributed nature presents a significant performance degradation and stabilization skews. This paper introduces a novel method dynamically balancing the data distributions of clients by augmenting images to address the non-IID data problem of FL. The introduced method remarkably stabilizes the model training and improves the model's test accuracy from 83.22% to 89.43% for multi-chest diseases detection of chest X-ray images in highly non-IID FL setting. The results of IID, non-IID and non-IID with proposed method federated trainings demonstrated that the proposed method might help to encourage organizations or researchers in developing better systems to get values from data with respect to data privacy not only for healthcare but also other fields.

【2】 Automated assessment of disease severity of COVID-19 using artificial intelligence with synthetic chest CT 标题:应用人工智能和合成胸部CT自动评估冠状病毒的病情严重程度 链接:https://arxiv.org/abs/2112.05900

作者:Mengqiu Liu,Ying Liu,Yidong Yang,Aiping Liu,Shana Li,Changbing Qu,Xiaohui Qiu,Yang Li,Weifu Lv,Peng Zhang,Jie Wen 机构:. Department of Radiology, The First Affiliated Hospital of USTC, Hefei, China, . School of Physical Sciences, University of Science and Technology of China, Hefei, . Electronic Science and Technology, University of Science and Technology, Hefei 摘要:背景:患者分诊对于控制2019年冠状病毒病(COVID-19)的流行非常重要,尤其是在临床资源极其有限的大流行高峰期。目的:建立一种利用胸部CT自动分割和量化肺部和肺炎病灶并评估COVID-19患者疾病严重程度的方法。材料和方法:在本研究中,我们使用公共可用数据集(来自“2016年肺结节分析”的285个数据集)结合数据增强生成合成胸部CT图像。2019冠状病毒疾病的神经网络和合成图像和面具被用来训练2D神经网络,并在203个COVID-19数据集上进行测试,以产生肺和病变分割。根据分段计算疾病严重程度评分(DL:损伤负荷;DS:损伤评分)。使用Pearson方法评估DL/DS和临床实验室试验之间的相关性。p值<0.05被认为具有统计学意义。结果:自动肺和病变分割与手动注释相比较。对于肺分割,骰子相似系数、Jaccard指数和平均表面距离的中值分别为98.56%、97.15%和0.49mm。病灶分割的相同指标分别为76.95%、62.54%和2.36mm。DL/DS和淋巴细胞百分比试验之间存在显著(p<0.05)相关性,r值分别为-0.561和-0.501。结论:提出了一种基于胸部影像学和数据增强的AI系统,用于COVID-19患者肺和病变的分割。成像结果和临床实验室测试之间的相关性表明该系统的价值作为评估COVID-19疾病严重程度的潜在工具。 摘要:Background: Triage of patients is important to control the pandemic of coronavirus disease 2019 (COVID-19), especially during the peak of the pandemic when clinical resources become extremely limited. Purpose: To develop a method that automatically segments and quantifies lung and pneumonia lesions with synthetic chest CT and assess disease severity in COVID-19 patients. Materials and Methods: In this study, we incorporated data augmentation to generate synthetic chest CT images using public available datasets (285 datasets from "Lung Nodule Analysis 2016"). The synthetic images and masks were used to train a 2D U-net neural network and tested on 203 COVID-19 datasets to generate lung and lesion segmentations. Disease severity scores (DL: damage load; DS: damage score) were calculated based on the segmentations. Correlations between DL/DS and clinical lab tests were evaluated using Pearson's method. A p-value < 0.05 was considered as statistical significant. Results: Automatic lung and lesion segmentations were compared with manual annotations. For lung segmentation, the median values of dice similarity coefficient, Jaccard index and average surface distance, were 98.56%, 97.15% and 0.49 mm, respectively. The same metrics for lesion segmentation were 76.95%, 62.54% and 2.36 mm, respectively. Significant (p << 0.05) correlations were found between DL/DS and percentage lymphocytes tests, with r-values of -0.561 and -0.501, respectively. Conclusion: An AI system that based on thoracic radiographic and data augmentation was proposed to segment lung and lesions in COVID-19 patients. Correlations between imaging findings and clinical lab tests suggested the value of this system as a potential tool to assess disease severity of COVID-19.

【3】 Edge-Enhanced Dual Discriminator Generative Adversarial Network for Fast MRI with Parallel Imaging Using Multi-view Information 标题:边缘增强型双区分器生成对抗网络多视角并行成像快速磁共振成像 链接:https://arxiv.org/abs/2112.05758

作者:Jiahao Huang,Weiping Ding,Jun Lv,Jingwen Yang,Hao Dong,Javier Del Ser,Jun Xia,Tiaojuan Ren,Stephen Wong,Guang Yang 机构:Received: date Accepted: date, College of Information Science and Technology, Zhejiang Shuren University, Hangzhou, National Heart and Lung Institute, Imperial College London, London, United Kingdom 备注:33 pages, 13 figures, Applied Intelligence 摘要:在临床医学中,磁共振成像(MRI)是诊断、分类、预后和治疗计划的最重要工具之一。然而,由于数据是在k空间中顺序采集的,因此MRI存在固有的缓慢数据采集过程。近年来,文献中提出的大多数MRI重建方法侧重于整体图像重建,而不是增强边缘信息。这项工作通过详细阐述边缘信息的增强来避开这一总体趋势。具体地说,我们介绍了一种新的并行成像耦合双鉴别器生成对抗网络(PIDD-GAN),用于通过合并多视图信息进行快速多通道MRI重建。双鉴别器设计旨在改善MRI重建中的边缘信息。一个鉴别器用于整体图像重建,而另一个用于增强边缘信息。提出了一种改进的局部和全局残差学习U网络。频率通道注意块(FCA块)嵌入在发生器中,用于合并注意机制。引入内容丢失来训练生成器以获得更好的重建质量。我们在卡尔加里·坎皮纳斯公共脑MR数据集上进行了综合实验,并将我们的方法与最先进的MRI重建方法进行了比较。在MICCAI13数据集上进行剩余学习的消融研究,以验证所提出的模块。结果表明,我们的PIDD-GAN提供了高质量的重建MR图像,并保留了良好的边缘信息。单幅图像重建时间小于5ms,满足快速处理的要求。 摘要:In clinical medicine, magnetic resonance imaging (MRI) is one of the most important tools for diagnosis, triage, prognosis, and treatment planning. However, MRI suffers from an inherent slow data acquisition process because data is collected sequentially in k-space. In recent years, most MRI reconstruction methods proposed in the literature focus on holistic image reconstruction rather than enhancing the edge information. This work steps aside this general trend by elaborating on the enhancement of edge information. Specifically, we introduce a novel parallel imaging coupled dual discriminator generative adversarial network (PIDD-GAN) for fast multi-channel MRI reconstruction by incorporating multi-view information. The dual discriminator design aims to improve the edge information in MRI reconstruction. One discriminator is used for holistic image reconstruction, whereas the other one is responsible for enhancing edge information. An improved U-Net with local and global residual learning is proposed for the generator. Frequency channel attention blocks (FCA Blocks) are embedded in the generator for incorporating attention mechanisms. Content loss is introduced to train the generator for better reconstruction quality. We performed comprehensive experiments on Calgary-Campinas public brain MR dataset and compared our method with state-of-the-art MRI reconstruction methods. Ablation studies of residual learning were conducted on the MICCAI13 dataset to validate the proposed modules. Results show that our PIDD-GAN provides high-quality reconstructed MR images, with well-preserved edge information. The time of single-image reconstruction is below 5ms, which meets the demand of faster processing.

GAN|对抗|攻击|生成相关(9篇)

【1】 Learning to Learn Transferable Attack 标题:学会学习可转移进攻 链接:https://arxiv.org/abs/2112.06658

作者:Shuman Fang,Jie Li,Xianming Lin,Rongrong Ji 机构:†Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University, China, ‡Peng Cheng Laboratory, Shenzhen, China, ∗Contributed Equally, ♮Corresponding Author 备注:AAAI 2022 摘要:转移对抗性攻击是一种非平凡的黑箱对抗性攻击,其目的是在代理模型上制造对抗性扰动,然后将此类扰动应用于受害者模型。然而,现有方法中扰动的可转移性仍然有限,因为对抗性扰动很容易与单一替代模型和特定数据模式过度拟合。在本文中,我们提出了一种从学习到学习的可转移攻击(LLTA)方法,该方法通过从数据和模型扩充两方面进行学习,使对抗性扰动更加普遍化。对于数据扩充,我们采用简单的随机调整大小和填充。对于模型增强,我们随机改变反向传播而不是正向传播,以消除对模型预测的影响。通过将对特定数据和修改模型的攻击视为一项任务,我们期望对抗性干扰能够采用足够的任务进行推广。为此,在扰动生成的迭代过程中,进一步引入了元学习算法。在广泛使用的数据集上的实验结果证明了我们的攻击方法的有效性,与最先进的方法相比,转移攻击的成功率高12.85%。我们还将在真实的在线系统(即Google Cloud Vision API)上评估我们的方法,以进一步展示我们方法的实用潜力。 摘要:Transfer adversarial attack is a non-trivial black-box adversarial attack that aims to craft adversarial perturbations on the surrogate model and then apply such perturbations to the victim model. However, the transferability of perturbations from existing methods is still limited, since the adversarial perturbations are easily overfitting with a single surrogate model and specific data pattern. In this paper, we propose a Learning to Learn Transferable Attack (LLTA) method, which makes the adversarial perturbations more generalized via learning from both data and model augmentation. For data augmentation, we adopt simple random resizing and padding. For model augmentation, we randomly alter the back propagation instead of the forward propagation to eliminate the effect on the model prediction. By treating the attack of both specific data and a modified model as a task, we expect the adversarial perturbations to adopt enough tasks for generalization. To this end, the meta-learning algorithm is further introduced during the iteration of perturbation generation. Empirical results on the widely-used dataset demonstrate the effectiveness of our attack method with a 12.85% higher success rate of transfer attack compared with the state-of-the-art methods. We also evaluate our method on the real-world online system, i.e., Google Cloud Vision API, to further show the practical potentials of our method.

【2】 SAC-GAN: Structure-Aware Image-to-Image Composition for Self-Driving 标题:SAC-GAN:用于自动驾驶的结构感知图像到图像合成 链接:https://arxiv.org/abs/2112.06596

作者:Hang Zhou,Ali Mahdavi-Amiri,Rui Ma,Hao Zhang 机构:Simon Fraser University, Jilin University 摘要:我们提出了一种用于自动驾驶应用的图像增强合成方法。它是一种端到端的神经网络,经过训练,可以将对象(例如,车辆或行人)无缝地合成为对象图像中的裁剪面片,并合成为背景场景图像。由于我们的方法更多地强调合成图像的语义和结构一致性,而不是它们的像素级RGB精度,因此我们使用结构感知功能定制网络的输入和输出,并相应地设计网络。具体来说,我们的网络从输入场景图像中提取语义布局特征、从输入对象面片中的边缘和轮廓编码的特征以及潜在代码作为输入,并生成定义对象面片平移和缩放的二维空间仿射变换。所学习的参数被进一步送入可微空间变换网络,以将对象面片变换为目标图像,其中我们的模型使用仿射变换鉴别器和布局鉴别器进行对抗性训练。我们在综合图像的质量、可组合性和可概括性方面,在显著的自驱动数据集上评估我们的网络,即结构感知合成的SAC-GAN。与最先进的替代方案进行了比较,证实了我们方法的优越性。 摘要:We present a compositional approach to image augmentation for self-driving applications. It is an end-to-end neural network that is trained to seamlessly compose an object (e.g., a vehicle or pedestrian) represented as a cropped patch from an object image, into a background scene image. As our approach emphasizes more on semantic and structural coherence of the composed images, rather than their pixel-level RGB accuracies, we tailor the input and output of our network with structure-aware features and design our network losses accordingly. Specifically, our network takes the semantic layout features from the input scene image, features encoded from the edges and silhouette in the input object patch, as well as a latent code as inputs, and generates a 2D spatial affine transform defining the translation and scaling of the object patch. The learned parameters are further fed into a differentiable spatial transformer network to transform the object patch into the target image, where our model is trained adversarially using an affine transform discriminator and a layout discriminator. We evaluate our network, coined SAC-GAN for structure-aware composition, on prominent self-driving datasets in terms of quality, composability, and generalizability of the composite images. Comparisons are made to state-of-the-art alternatives, confirming superiority of our method.

【3】 Triangle Attack: A Query-efficient Decision-based Adversarial Attack 标题:三角攻击:一种查询高效的基于决策的对抗性攻击 链接:https://arxiv.org/abs/2112.06569

作者:Xiaosen Wang,Zeliang Zhang,Kangheng Tong,Dihong Gong,Kun He,Zhifeng Li,Wei Liu 机构:School of Computer Science and Technology, Huazhong University of Science and Technology, Data Platform, Tencent 备注:10 pages 摘要:基于决策的攻击将目标模型视为一个黑匣子,只访问硬预测标签,因此对实际应用构成严重威胁。最近已作出巨大努力,以减少查询数量;然而,现有的基于决策的攻击仍然需要数千个查询才能生成高质量的对抗性示例。在这项工作中,我们发现一个良性样本、当前和下一个敌对样本可以自然地在子空间中为任何迭代攻击构造一个三角形。基于正弦定律,我们提出了一种新的三角形攻击(TA),利用三角形中长边总是与大角度相对的几何信息来优化扰动。然而,直接将这些信息应用于输入图像是无效的,因为它无法在高维空间中彻底探索输入样本的邻域。为了解决这个问题,TA优化了低频空间中的扰动,以有效地降低维数,因为这种几何特性具有普遍性。对ImageNet数据集的广泛评估表明,TA在1000个查询中实现了更高的攻击成功率,并且与现有基于决策的攻击相比,在各种扰动预算下实现相同的攻击成功率所需的查询数量要少得多。通过如此高的效率,我们进一步证明了TA在现实世界API,即腾讯云API上的适用性。 摘要:Decision-based attack poses a severe threat to real-world applications since it regards the target model as a black box and only accesses the hard prediction label. Great efforts have been made recently to decrease the number of queries; however, existing decision-based attacks still require thousands of queries in order to generate good quality adversarial examples. In this work, we find that a benign sample, the current and the next adversarial examples could naturally construct a triangle in a subspace for any iterative attacks. Based on the law of sines, we propose a novel Triangle Attack (TA) to optimize the perturbation by utilizing the geometric information that the longer side is always opposite the larger angle in any triangle. However, directly applying such information on the input image is ineffective because it cannot thoroughly explore the neighborhood of the input sample in the high dimensional space. To address this issue, TA optimizes the perturbation in the low frequency space for effective dimensionality reduction owing to the generality of such geometric property. Extensive evaluations on the ImageNet dataset demonstrate that TA achieves a much higher attack success rate within 1,000 queries and needs a much less number of queries to achieve the same attack success rate under various perturbation budgets than existing decision-based attacks. With such high efficiency, we further demonstrate the applicability of TA on real-world API, i.e., Tencent Cloud API.

【4】 MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning 标题:MAGIC:针对不同和不成对的基于文本的图像字幕的多模态关系图对抗性推理 链接:https://arxiv.org/abs/2112.06558

作者:Wenqiao Zhang,Haochen Shi,Jiannan Guo,Shengyu Zhang,Qingpeng Cai,Juncheng Li,Sihui Luo,Yueting Zhuang 机构: Zhejiang University, Universit´e de Montr´eal, National University of Singapore, Ningbo University 摘要:基于文本的图像字幕(TextCap)要求同时理解视觉内容和阅读图像文本,以生成自然语言描述。尽管一项任务可以教会机器进一步理解复杂的人类环境,因为文本在我们的日常环境中无处不在,但它在正常字幕设置方面带来了额外的挑战。基于文本的图像直观地包含丰富而复杂的多模态关系内容,也就是说,图像细节可以从多视图而不是单个标题中进行多样化描述。当然,我们可以引入额外的成对训练数据来显示图像描述的多样性,对于带有额外文本的TextCap对注释来说,这个过程既费时又费力。基于上述观点,我们研究了如何使用一种不成对的训练范式来生成关注不同图像部分的不同字幕。我们提出了多模态关系图对抗性推理(MAGIC)框架,用于多样性和不成对的TextCap。该框架可以自适应地构造多个图像的多模态关系图,并对图之间的复杂关系进行建模,以表示描述的多样性。此外,从建模图中开发了一个级联生成对抗网络,以推断图像-句子特征对齐和语言连贯水平下的不成对字幕生成。我们验证了MAGIC从图像的不同关系信息项生成不同字幕的有效性。实验结果表明,MAGIC可以在不使用任何图像字幕训练对的情况下产生非常有希望的结果。 摘要:Text-based image captioning (TextCap) requires simultaneous comprehension of visual content and reading the text of images to generate a natural language description. Although a task can teach machines to understand the complex human environment further given that text is omnipresent in our daily surroundings, it poses additional challenges in normal captioning. A text-based image intuitively contains abundant and complex multimodal relational content, that is, image details can be described diversely from multiview rather than a single caption. Certainly, we can introduce additional paired training data to show the diversity of images' descriptions, this process is labor-intensive and time-consuming for TextCap pair annotations with extra texts. Based on the insight mentioned above, we investigate how to generate diverse captions that focus on different image parts using an unpaired training paradigm. We propose the Multimodal relAtional Graph adversarIal inferenCe (MAGIC) framework for diverse and unpaired TextCap. This framework can adaptively construct multiple multimodal relational graphs of images and model complex relationships among graphs to represent descriptive diversity. Moreover, a cascaded generative adversarial network is developed from modeled graphs to infer the unpaired caption generation in image-sentence feature alignment and linguistic coherence levels. We validate the effectiveness of MAGIC in generating diverse captions from different relational information items of an image. Experimental results show that MAGIC can generate very promising outcomes without using any image-caption training pairs.

【5】 DGL-GAN: Discriminator Guided Learning for GAN Compression 标题:DGL-GAN:用于GaN压缩的鉴别器引导学习 链接:https://arxiv.org/abs/2112.06502

作者:Yuesong Tian,Li Shen,Dacheng Tao,Zhifeng Li,Wei Liu 机构: Zhejiang University 摘要:具有高计算成本的生成性对抗网络(GAN),例如BigGAN和StyleGAN2,在从随机噪声合成高保真的高分辨率和多样图像方面取得了显著的成果。由于GANs在计算资源有限的设备上的广泛应用,在保持生成照片真实感图像的同时降低其计算成本是一个迫切而具有挑战性的领域。在这项工作中,我们提出了一种新颖而简单的{\bf D}识别器{\bf G}引导的{\bf L}学习方法来压缩香草{\bf-GAN},称为{\bf-DGL-GAN}。基于教师鉴别器可能包含一些有意义的信息这一现象,我们仅仅通过对抗函数从教师鉴别器传递知识。我们证明DGL-GAN是有效的,因为从经验上看,从教师鉴别器学习可以促进学生GAN的表现,并通过大量实验结果进行验证。此外,我们还提出了一种用于训练DGL-GAN的两阶段训练策略,当我们将DGL-GAN应用于压缩两个最具代表性的大规模香草GAN,即StyleGAN2和BigGAN时,该策略可以在很大程度上稳定其训练过程并获得优异的性能。实验表明,DGL-GAN在StyleGAN2(FFHQ上的FID 2.92,StyleGAN2的参数接近$1/3$)和BigGAN(ImageNet上的FID 9.92,参数接近$1/4$,参数接近$1/4$)上都达到了最先进的(SOTA)结果,并且也优于几种现有的vanilla GAN压缩技术。此外,DGL-GAN还可以有效提升原始未压缩GAN的性能,使用DGL-GAN提升的原始未压缩StyleGAN2在FFHQ上实现FID 2.65,实现了新的最先进性能。代码和模型可从\url获得{https://github.com/yuesongtian/DGL-GAN}. 摘要:Generative Adversarial Networks (GANs) with high computation costs, e.g., BigGAN and StyleGAN2, have achieved remarkable results in synthesizing high resolution and diverse images with high fidelity from random noises. Reducing the computation cost of GANs while keeping generating photo-realistic images is an urgent and challenging field for their broad applications on computational resource-limited devices. In this work, we propose a novel yet simple {\bf D}iscriminator {\bf G}uided {\bf L}earning approach for compressing vanilla {\bf GAN}, dubbed {\bf DGL-GAN}. Motivated by the phenomenon that the teacher discriminator may contain some meaningful information, we transfer the knowledge merely from the teacher discriminator via the adversarial function. We show DGL-GAN is valid since empirically, learning from the teacher discriminator could facilitate the performance of student GANs, verified by extensive experimental findings. Furthermore, we propose a two-stage training strategy for training DGL-GAN, which can largely stabilize its training process and achieve superior performance when we apply DGL-GAN to compress the two most representative large-scale vanilla GANs, i.e., StyleGAN2 and BigGAN. Experiments show that DGL-GAN achieves state-of-the-art (SOTA) results on both StyleGAN2 (FID 2.92 on FFHQ with nearly $1/3$ parameters of StyleGAN2) and BigGAN (IS 93.29 and FID 9.92 on ImageNet with nearly $1/4$ parameters of BigGAN) and also outperforms several existing vanilla GAN compression techniques. Moreover, DGL-GAN is also effective in boosting the performance of original uncompressed GANs, original uncompressed StyleGAN2 boosted with DGL-GAN achieves FID 2.65 on FFHQ, which achieves a new state-of-the-art performance. Code and models are available at \url{https://github.com/yuesongtian/DGL-GAN}.

【6】 Interpolated Joint Space Adversarial Training for Robust and Generalizable Defenses 标题:插值联合空间对抗训练的健壮性和泛化防御 链接:https://arxiv.org/abs/2112.06323

作者:Chun Pong Lau,Jiang Liu,Hossein Souri,Wei-An Lin,Soheil Feizi,Rama Chellappa 机构:Johns Hopkins University,Adobe,University of Maryland, College Park 备注:Under submission 摘要:对抗性训练(AT)被认为是对抗性攻击最可靠的防御手段之一。然而,以牺牲标准精度训练的模型不能很好地推广到新的攻击。最近的研究表明,在新的威胁模型(如流形威胁模型或神经感知威胁模型)下,对抗性样本的泛化能力有所提高。然而,前者需要精确的流形信息,而后者需要算法松弛。基于这些考虑,我们利用规范化流的基础流形信息,确保精确的流形假设成立。此外,我们还提出了一种新的威胁模型,称为联合空间威胁模型(JSTM),它可以作为神经感知威胁模型的一个特例,不需要额外的松弛来设计相应的对抗性攻击。在JSTM下,我们开发了新的对抗性攻击和防御。混合策略提高了神经网络的标准精度,但与AT结合时牺牲了鲁棒性。为了解决这个问题,我们提出了鲁棒混合策略,在该策略中,我们最大限度地利用插值图像的不利条件,获得鲁棒性并防止过度拟合。我们的实验表明,插值联合空间对抗训练(IJSAT)在CIFAR-10/100、OM-ImageNet和CIFAR-10-C数据集的标准精度、鲁棒性和泛化性方面取得了良好的性能。IJSAT也很灵活,可以用作数据增强方法,以提高标准精度,并与许多现有AT方法相结合,以提高鲁棒性。 摘要:Adversarial training (AT) is considered to be one of the most reliable defenses against adversarial attacks. However, models trained with AT sacrifice standard accuracy and do not generalize well to novel attacks. Recent works show generalization improvement with adversarial samples under novel threat models such as on-manifold threat model or neural perceptual threat model. However, the former requires exact manifold information while the latter requires algorithm relaxation. Motivated by these considerations, we exploit the underlying manifold information with Normalizing Flow, ensuring that exact manifold assumption holds. Moreover, we propose a novel threat model called Joint Space Threat Model (JSTM), which can serve as a special case of the neural perceptual threat model that does not require additional relaxation to craft the corresponding adversarial attacks. Under JSTM, we develop novel adversarial attacks and defenses. The mixup strategy improves the standard accuracy of neural networks but sacrifices robustness when combined with AT. To tackle this issue, we propose the Robust Mixup strategy in which we maximize the adversity of the interpolated images and gain robustness and prevent overfitting. Our experiments show that Interpolated Joint Space Adversarial Training (IJSAT) achieves good performance in standard accuracy, robustness, and generalization in CIFAR-10/100, OM-ImageNet, and CIFAR-10-C datasets. IJSAT is also flexible and can be used as a data augmentation method to improve standard accuracy and combine with many existing AT approaches to improve robustness.

【7】 BIPS: Bi-modal Indoor Panorama Synthesis via Residual Depth-aided Adversarial Learning 标题:BIPS:基于残差深度辅助对抗性学习的双模态室内全景图合成 链接:https://arxiv.org/abs/2112.06179

作者:Changgyoon Oh,Wonjune Cho,Daehee Park,Yujeong Chae,Lin Wang,Kuk-Jin Yoon 机构:Visual Intelligence Lab., KAIST, Korea 摘要:提供全向深度和RGB信息对于许多应用非常重要,例如VR/AR。但是,由于全向RGB-D数据并不总是可用,因此从场景的有限信息合成RGB-D全景数据可能很有用。因此,以前的一些工作尝试从透视RGB图像合成RGB全景图像;然而,它们的图像质量有限,不能直接扩展用于RGB-D全景图合成。在本文中,我们研究了一个新的问题:任意相机和深度传感器配置下的RGB-D全景图合成。因此,我们提出了一种新的双模(RGB-D)全景合成(BIPS)框架。特别是,我们关注室内环境,其中RGB-D全景可以为许多应用提供完整的3D模型。我们设计了一个融合双模信息的生成器,并用残差辅助对抗学习(RDAL)对其进行训练。RDAL允许通过联合推断RGB全景、布局深度和剩余深度来合成真实的室内布局结构和内部。此外,由于RGB-D全景图合成没有定制的评价指标,我们提出了一种新的指标来有效地评价其感知质量。大量实验表明,我们的方法合成了高质量的室内RGB-D全景图,并比以前的方法提供了逼真的三维室内模型。代码将在验收后发布。 摘要:Providing omnidirectional depth along with RGB information is important for numerous applications, eg, VR/AR. However, as omnidirectional RGB-D data is not always available, synthesizing RGB-D panorama data from limited information of a scene can be useful. Therefore, some prior works tried to synthesize RGB panorama images from perspective RGB images; however, they suffer from limited image quality and can not be directly extended for RGB-D panorama synthesis. In this paper, we study a new problem: RGB-D panorama synthesis under the arbitrary configurations of cameras and depth sensors. Accordingly, we propose a novel bi-modal (RGB-D) panorama synthesis (BIPS) framework. Especially, we focus on indoor environments where the RGB-D panorama can provide a complete 3D model for many applications. We design a generator that fuses the bi-modal information and train it with residual-aided adversarial learning (RDAL). RDAL allows to synthesize realistic indoor layout structures and interiors by jointly inferring RGB panorama, layout depth, and residual depth. In addition, as there is no tailored evaluation metric for RGB-D panorama synthesis, we propose a novel metric to effectively evaluate its perceptual quality. Extensive experiments show that our method synthesizes high-quality indoor RGB-D panoramas and provides realistic 3D indoor models than prior methods. Code will be released upon acceptance.

【8】 Improving the Transferability of Adversarial Examples with Resized-Diverse-Inputs, Diversity-Ensemble and Region Fitting 标题:利用调整大小的多元输入、多元集成和区域拟合提高对抗性范例的可转移性 链接:https://arxiv.org/abs/2112.06011

作者:Junhua Zou,Zhisong Pan,Junyang Qiu,Xin Liu,Ting Rui,Wei Li 机构: Army Engineering University of PLA, China, Jiangnan Institute of Computing Technology, China 备注:Accepted to ECCV2020 摘要:我们介绍了一个三阶段的流程:调整大小的多样性输入(RDIM)、多样性集成(DEM)和区域拟合,它们共同生成可转移的对抗性示例。我们首先探讨现有攻击之间的内在关系,并提出能够利用这种关系的RDIM。然后,我们提出了DEM(RDIM的多尺度版本)来生成多尺度梯度。在前两步之后,我们将值拟合转换为跨迭代的区域拟合。RDIM和区域拟合不需要额外的运行时间,这三个步骤可以很好地集成到其他攻击中。我们最好的攻击愚弄了六个黑匣子防御,平均成功率为93%,高于最先进的基于梯度的攻击。此外,我们重新考虑现有的攻击,而不是简单地在旧方法上叠加新方法以获得更好的性能。预计我们的发现将作为探索攻击方法之间内在关系的开始。代码可在https://github.com/278287847/DEM. 摘要:We introduce a three stage pipeline: resized-diverse-inputs (RDIM), diversity-ensemble (DEM) and region fitting, that work together to generate transferable adversarial examples. We first explore the internal relationship between existing attacks, and propose RDIM that is capable of exploiting this relationship. Then we propose DEM, the multi-scale version of RDIM, to generate multi-scale gradients. After the first two steps we transform value fitting into region fitting across iterations. RDIM and region fitting do not require extra running time and these three steps can be well integrated into other attacks. Our best attack fools six black-box defenses with a 93% success rate on average, which is higher than the state-of-the-art gradient-based attacks. Besides, we rethink existing attacks rather than simply stacking new methods on the old ones to get better performance. It is expected that our findings will serve as the beginning of exploring the internal relationship between attack methods. Codes are available at https://github.com/278287847/DEM.

【9】 AvatarMe++: Facial Shape and BRDF Inference with Photorealistic Rendering-Aware GANs 标题:AvatarMe++:使用照片级真实感渲染感知Gans进行脸型和BRDF推断 链接:https://arxiv.org/abs/2112.05957

作者:Alexandros Lattas,Stylianos Moschoglou,Stylianos Ploumpis,Baris Gecer,Abhijeet Ghosh,Stefanos Zafeiriou 机构: theAll authors are with the Department of Computing 备注:Project and Dataset page: ( this https URL ). 20 pages, including supplemental materials. Accepted for publishing at IEEE Transactions on Pattern Analysis and Machine Intelligence on 13 November 2021. Copyright 2021 IEEE. Personal use of this material is permitted 摘要:在过去的几年中,许多人脸分析任务都取得了惊人的成绩,其应用包括从单个“原始”图像生成人脸和三维人脸重建。然而,据我们所知,没有一种方法可以从“野外”图像生成可渲染的高分辨率3D人脸,这可归因于:(a)缺乏可用的训练数据,以及(b)缺乏可成功应用于高分辨率数据的稳健方法。在这项工作中,我们介绍了第一种方法,能够重建照片级真实感渲染就绪三维人脸几何体和BRDF从一个单一的“野生”图像。我们收集了大量的面部形状和反射率数据,我们已经公开了这些数据。我们定义了一种具有精确面部皮肤漫反射和镜面反射、自遮挡和次表面散射近似的快速面部真实感微分渲染方法。在此基础上,我们训练了一个网络,该网络将面部漫反射和镜面反射BRDF组件与烘焙照明的形状和纹理分离,并使用最先进的3DMM拟合方法重建。我们的方法在很大程度上优于现有的艺术,从一幅低分辨率图像重建高分辨率的3D人脸,可以在各种应用中进行渲染,并跨越神秘的山谷。 摘要:Over the last years, many face analysis tasks have accomplished astounding performance, with applications including face generation and 3D face reconstruction from a single "in-the-wild" image. Nevertheless, to the best of our knowledge, there is no method which can produce render-ready high-resolution 3D faces from "in-the-wild" images and this can be attributed to the: (a) scarcity of available data for training, and (b) lack of robust methodologies that can successfully be applied on very high-resolution data. In this work, we introduce the first method that is able to reconstruct photorealistic render-ready 3D facial geometry and BRDF from a single "in-the-wild" image. We capture a large dataset of facial shape and reflectance, which we have made public. We define a fast facial photorealistic differentiable rendering methodology with accurate facial skin diffuse and specular reflection, self-occlusion and subsurface scattering approximation. With this, we train a network that disentangles the facial diffuse and specular BRDF components from a shape and texture with baked illumination, reconstructed with a state-of-the-art 3DMM fitting method. Our method outperforms the existing arts by a significant margin and reconstructs high-resolution 3D faces from a single low-resolution image, that can be rendered in various applications, and bridge the uncanny valley.

Attention注意力(1篇)

【1】 Attention based Broadly Self-guided Network for Low light Image Enhancement 标题:基于注意力的广义自引导网络在微光图像增强中的应用 链接:https://arxiv.org/abs/2112.06226

作者:Zilong Chen,Yaling Liang,Minghui Du 备注:10 Pages,8 Figures,4 Tables 摘要:在过去的几年中,深卷积神经网络在微光图像增强方面取得了令人印象深刻的成功。现有的深度学习方法大多通过叠加网络结构和加深网络深度来增强特征提取能力。这会导致单个映像的运行时成本增加。为了减少推理时间,同时充分提取局部特征和全局特征。受SGN的启发,我们提出了一种基于注意力的广义自引导网络(ABSGN),用于现实世界中的微光图像增强。这样的策略能够处理不同暴露条件下的噪声。该网络经过多个主流基准测试的验证。额外的实验结果表明,所提出的网络优于大多数最先进的微光图像增强解决方案。 摘要:During the past years,deep convolutional neural networks have achieved impressive success in low-light Image Enhancement.Existing deep learning methods mostly enhance the ability of feature extraction by stacking network structures and deepening the depth of the network.which causes more runtime cost on single image.In order to reduce inference time while fully extracting local features and global features.Inspired by SGN,we propose a Attention based Broadly self-guided network (ABSGN) for real world low-light image Enhancement.such a broadly strategy is able to handle the noise at different exposures.The proposed network is validated by many mainstream benchmark.Additional experimental results show that the proposed network outperforms most of state-of-the-art low-light image Enhancement solutions.

人脸|人群计数(4篇)

【1】 CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability 标题:CR-FIQA:基于样本相对可分类性学习的人脸图像质量评价 链接:https://arxiv.org/abs/2112.06592

作者:Fadi Boutros,Meiling Fang,Marcel Klemt,Biying Fu,Naser Damer 机构:Fraunhofer Institute for Computer Graphics Research IGD, Darmstadt, Germany, Department of Computer Science, TU Darmstadt, Darmstadt, Germany 摘要:人脸图像的质量会显著影响底层人脸识别算法的性能。人脸图像质量评估(FIQA)评估捕获图像在实现可靠和准确的识别性能方面的效用。在这项工作中,我们提出了一种新的学习范式,在训练过程中学习内部网络观察。基于此,我们提出的CR-FIQA使用该范式通过预测样本的相对可分类性来估计样本的人脸图像质量。这种可分类性是基于训练样本特征表示相对于其类中心和最近的负类中心在角度空间中的分配来度量的。我们通过实验说明了人脸图像质量和样本相对可分类性之间的相关性。由于该属性仅对训练数据集可见,我们建议从训练数据集中学习该属性,并利用该属性预测未知样本的质量度量。该训练同时执行,同时通过用于人脸识别模型训练的基于角度裕度惩罚的softmax损失优化类中心。通过在八个基准测试和四个人脸识别模型上的广泛评估实验,我们证明了我们提出的CR-FIQA算法优于最先进的(SOTA)FIQA算法。 摘要:The quality of face images significantly influences the performance of underlying face recognition algorithms. Face image quality assessment (FIQA) estimates the utility of the captured image in achieving reliable and accurate recognition performance. In this work, we propose a novel learning paradigm that learns internal network observations during the training process. Based on that, our proposed CR-FIQA uses this paradigm to estimate the face image quality of a sample by predicting its relative classifiability. This classifiability is measured based on the allocation of the training sample feature representation in angular space with respect to its class center and the nearest negative class center. We experimentally illustrate the correlation between the face image quality and the sample relative classifiability. As such property is only observable for the training dataset, we propose to learn this property from the training dataset and utilize it to predict the quality measure on unseen samples. This training is performed simultaneously while optimizing the class centers by an angular margin penalty-based softmax loss used for face recognition model training. Through extensive evaluation experiments on eight benchmarks and four face recognition models, we demonstrate the superiority of our proposed CR-FIQA over state-of-the-art (SOTA) FIQA algorithms.

【2】 Anatomizing Bias in Facial Analysis 标题:面部分析中的解剖偏差 链接:https://arxiv.org/abs/2112.06522

作者:Richa Singh,Puspita Majumdar,Surbhi Mittal,Mayank Vatsa 机构: IIT Jodhpur, IIIT-Delhi 备注:Accepted in AAAI 2022 摘要:现有的面部分析系统已被证明会产生对某些人口亚组有偏见的结果。由于其对社会的影响,必须确保这些制度不存在基于性别、身份或个人肤色的歧视。这导致了人工智能系统中偏差识别和缓解的研究。在本文中,我们封装了用于人脸分析的偏差检测/估计和缓解算法。我们的主要贡献包括对为理解偏差而提出的算法的系统回顾,以及对现有偏差缓解算法的分类和广泛概述。我们还讨论了偏见面部分析领域的开放性挑战。 摘要:Existing facial analysis systems have been shown to yield biased results against certain demographic subgroups. Due to its impact on society, it has become imperative to ensure that these systems do not discriminate based on gender, identity, or skin tone of individuals. This has led to research in the identification and mitigation of bias in AI systems. In this paper, we encapsulate bias detection/estimation and mitigation algorithms for facial analysis. Our main contributions include a systematic review of algorithms proposed for understanding bias, along with a taxonomy and extensive overview of existing bias mitigation algorithms. We also discuss open challenges in the field of biased facial analysis.

【3】 Smooth-Swap: A Simple Enhancement for Face-Swapping with Smoothness 标题:平滑交换:一种简单的平滑人脸交换增强方法 链接:https://arxiv.org/abs/2112.05907

作者:Jiseob Kim,Jihoon Lee,Byoung-Tak Zhang 机构:Seoul National Univ., Kakao Brain 摘要:近年来,人脸交换模型在生成质量方面取得了进展,并因其在隐私保护和娱乐方面的应用而受到关注。然而,它们复杂的体系结构和丢失功能通常需要仔细调整才能成功进行训练。在本文中,我们提出了一种新的人脸交换模型,称为“平滑交换”,该模型的重点是推导身份嵌入的平滑度,而不是采用复杂的手工设计。我们假设人脸交换的难点在于不稳定的梯度,它可以通过平滑的身份嵌入器来解决。Smooth swap采用了使用监督对比学习训练的嵌入器,我们发现其改进的平滑性允许更快、更稳定的训练,即使使用简单的基于U网络的生成器和三个基本损失函数。在野外对人脸交换基准(FFHQ、FaceForensics++)和人脸图像进行的大量实验表明,我们的模型在身份变化方面也在数量和质量上与现有方法相当,甚至优于现有方法。 摘要:In recent years, face-swapping models have progressed in generation quality and drawn attention for their applications in privacy protection and entertainment. However, their complex architectures and loss functions often require careful tuning for successful training. In this paper, we propose a new face-swapping model called `Smooth-Swap', which focuses on deriving the smoothness of the identity embedding instead of employing complex handcrafted designs. We postulate that the gist of the difficulty in face-swapping is unstable gradients and it can be resolved by a smooth identity embedder. Smooth-swap adopts an embedder trained using supervised contrastive learning, where we find its improved smoothness allows faster and stable training even with a simple U-Net-based generator and three basic loss functions. Extensive experiments on face-swapping benchmarks (FFHQ, FaceForensics++) and face images in the wild show that our model is also quantitatively and qualitatively comparable or even superior to existing methods in terms of identity change.

【4】 Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets 标题:自然场景中人类视觉搜索计算模型的基准比较:模型比较和参考数据集 链接:https://arxiv.org/abs/2112.05808

作者:F. Travi,G. Ruarte,G. Bujia,J. E. Kamienkowski 机构:Universidad de Buenos Aires - CONICET 摘要:视觉搜索几乎是任何日常人类与环境的目标导向交互的重要组成部分。目前,有几种算法能够预测简单观察时的注视位置,但很少有模型试图模拟自然场景中视觉搜索时的人类行为。此外,这些模型在设计上差异很大,在评估它们时所用的数据集和指标上也存在差异。因此,需要一个参考点,在该点上可以测试每个模型,并从中得出潜在的改进。在目前的工作中,我们在自然场景中选择公开可用的最先进的视觉搜索模型,并在不同的数据集上对其进行评估,使用相同的度量来估计其效率和与人类受试者的相似性。特别是,我们通过结合基于神经网络的可视化搜索模型,对理想的贝叶斯搜索器进行了改进,使其能够推广到其他数据集。目前的工作揭示了当前模型的局限性,以及如何通过组合方法实现潜在的改进。此外,它还提供了一个解决方案,以满足对基准数据和度量的迫切需要,从而支持开发更通用的人类视觉搜索计算模型。 摘要:Visual search is an essential part of almost any everyday human goal-directed interaction with the environment. Nowadays, several algorithms are able to predict gaze positions during simple observation, but few models attempt to simulate human behavior during visual search in natural scenes. Furthermore, these models vary widely in their design and exhibit differences in the datasets and metrics with which they were evaluated. Thus, there is a need for a reference point, on which each model can be tested and from where potential improvements can be derived. In the present work, we select publicly available state-of-the-art visual search models in natural scenes and evaluate them on different datasets, employing the same metrics to estimate their efficiency and similarity with human subjects. In particular, we propose an improvement to the Ideal Bayesian Searcher through a combination with a neural network-based visual search model, enabling it to generalize to other datasets. The present work sheds light on the limitations of current models and how potential improvements can be accomplished by combining approaches. Moreover, it moves forward on providing a solution for the urgent need for benchmarking data and metrics to support the development of more general human visual search computational models.

跟踪(1篇)

【1】 An Informative Tracking Benchmark 标题:信息丰富的跟踪基准 链接:https://arxiv.org/abs/2112.06467

作者:Xin Li,Qiao Liu,Wenjie Pei,Qiuhong Shen,Yaowei Wang,Huchuan Lu,Ming-Hsuan Yang 机构:Peng Cheng Laboratory, Chongqing Normal University, Harbin Institute of Technology, Shenzhen, Dalian University of Technology, Google Research, University of California, Merced, Yonsei University 备注:10 pages, 6 figures 摘要:随着视觉跟踪的快速发展,由于样本冗余和当前跟踪器之间的弱区分,现有基准变得信息较少,使得对所有数据集的评估非常耗时。因此,一个小型且信息丰富的基准测试非常有意义,它涵盖了所有典型的具有挑战性的场景,以便于评估跟踪器的性能。在这项工作中,我们开发了一种原则性的方法来构建一个小型且信息丰富的跟踪基准(ITB),在现有和新收集的数据集的120万帧中占7%,这使得能够在确保有效性的同时进行有效评估。具体来说,我们首先设计一种质量评估机制,从现有基准中选择信息量最大的序列,同时考虑1)挑战性水平、2)辨别力、3)外观变化密度。此外,我们收集额外的序列以确保跟踪场景的多样性和平衡性,导致每个场景总共有20个序列。通过分析15名最先进的跟踪器在相同数据上重新训练的结果,我们确定了每种情况下鲁棒跟踪的有效方法,并展示了该领域未来研究方向的新挑战。 摘要:Along with the rapid progress of visual tracking, existing benchmarks become less informative due to redundancy of samples and weak discrimination between current trackers, making evaluations on all datasets extremely time-consuming. Thus, a small and informative benchmark, which covers all typical challenging scenarios to facilitate assessing the tracker performance, is of great interest. In this work, we develop a principled way to construct a small and informative tracking benchmark (ITB) with 7% out of 1.2 M frames of existing and newly collected datasets, which enables efficient evaluation while ensuring effectiveness. Specifically, we first design a quality assessment mechanism to select the most informative sequences from existing benchmarks taking into account 1) challenging level, 2) discriminative strength, 3) and density of appearance variations. Furthermore, we collect additional sequences to ensure the diversity and balance of tracking scenarios, leading to a total of 20 sequences for each scenario. By analyzing the results of 15 state-of-the-art trackers re-trained on the same data, we determine the effective methods for robust tracking under each scenario and demonstrate new challenges for future research direction in this field.

表征学习(1篇)

【1】 HVH: Learning a Hybrid Neural Volumetric Representation for Dynamic Hair Performance Capture 标题:HVH:学习用于动态毛发性能捕获的混合神经体积表示 链接:https://arxiv.org/abs/2112.06904

作者:Ziyan Wang,Giljoo Nam,Tuur Stuyck,Stephen Lombardi,Michael Zollhoefer,Jessica Hodgins,Christoph Lassner 机构:Michael Zollh¨ofer, Carnegie Mellon University, Meta AI, Reality Labs Research 摘要:捕捉和渲染栩栩如生的头发尤其具有挑战性,因为它具有精细的几何结构、复杂的物理交互以及非同寻常的视觉外观。然而,头发是可信化身的关键组成部分。在本文中,我们解决了上述问题:1)我们使用了一种新颖的体积头发表示法,它由数千个基本体组成。通过建立在神经渲染最新进展的基础上,每个基本体都可以高效、真实地渲染。2) 为了获得可靠的控制信号,我们提出了一种新的方法来跟踪头发的股水平。为了使计算工作易于管理,我们使用导向头发和经典技术将其扩展为浓密的头发罩。3) 为了更好地增强我们模型的时间一致性和泛化能力,我们使用体积光线推进,使用多视图光流进一步优化我们表示的三维场景流。我们的方法不仅可以创建记录的多视图序列的真实渲染,还可以通过提供新的控制信号为新的头发配置创建渲染。我们将我们的方法与视点合成和可驾驶动画方面的现有工作进行了比较,并获得了最先进的结果。 摘要:Capturing and rendering life-like hair is particularly challenging due to its fine geometric structure, the complex physical interaction and its non-trivial visual appearance.Yet, hair is a critical component for believable avatars. In this paper, we address the aforementioned problems: 1) we use a novel, volumetric hair representation that is com-posed of thousands of primitives. Each primitive can be rendered efficiently, yet realistically, by building on the latest advances in neural rendering. 2) To have a reliable control signal, we present a novel way of tracking hair on the strand level. To keep the computational effort manageable, we use guide hairs and classic techniques to expand those into a dense hood of hair. 3) To better enforce temporal consistency and generalization ability of our model, we further optimize the 3D scene flow of our representation with multi-view optical flow, using volumetric ray marching. Our method can not only create realistic renders of recorded multi-view sequences, but also create renderings for new hair configurations by providing new control signals. We compare our method with existing work on viewpoint synthesis and drivable animation and achieve state-of-the-art results.

蒸馏|知识提取(1篇)

【1】 Up to 100x Faster Data-free Knowledge Distillation 标题:无数据知识提炼速度最高可提高100倍 链接:https://arxiv.org/abs/2112.06253

作者:Gongfan Fang,Kanya Mo,Xinchao Wang,Jie Song,Shitao Bei,Haofei Zhang,Mingli Song 机构:National University of Singapore, Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies 摘要:数据自由知识提取(DFKD)由于其仅使用合成数据压缩模型的能力,近年来越来越受到研究界的关注。尽管取得了令人鼓舞的成果,但最先进的DFKD方法仍然存在数据合成效率低下的问题,使得无数据训练过程非常耗时,因此不适用于大规模任务。在这项工作中,我们介绍了一种有效的方案,称为FastDFKD,它允许我们将DFKD加速一个数量级。我们方法的核心是一种新的策略,即重用训练数据中的共享公共特征,从而合成不同的数据实例。与以前独立优化一组数据的方法不同,我们建议学习一种元合成器,该元合成器寻找共同特征作为快速数据合成的初始化。因此,FastDFKD只需几个步骤即可实现数据合成,显著提高了无数据训练的效率。在CIOFAR、NYV2和ImageNet上的实验表明,所提出的FASTFFKD达到10 $ \倍$,甚至100 $ \倍$加速,同时保持性能与现有技术的PAR。 摘要:Data-free knowledge distillation (DFKD) has recently been attracting increasing attention from research communities, attributed to its capability to compress a model only using synthetic data. Despite the encouraging results achieved, state-of-the-art DFKD methods still suffer from the inefficiency of data synthesis, making the data-free training process extremely time-consuming and thus inapplicable for large-scale tasks. In this work, we introduce an efficacious scheme, termed as FastDFKD, that allows us to accelerate DFKD by a factor of orders of magnitude. At the heart of our approach is a novel strategy to reuse the shared common features in training data so as to synthesize different data instances. Unlike prior methods that optimize a set of data independently, we propose to learn a meta-synthesizer that seeks common features as the initialization for the fast data synthesis. As a result, FastDFKD achieves data synthesis within only a few steps, significantly enhancing the efficiency of data-free training. Experiments over CIFAR, NYUv2, and ImageNet demonstrate that the proposed FastDFKD achieves 10$\times$ and even 100$\times$ acceleration while preserving performances on par with state of the art.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection 标题:基于实体增强型知识注入的知识视觉答疑改进与诊断 链接:https://arxiv.org/abs/2112.06888

作者:Diego Garcia-Olano,Yasumasa Onoe,Joydeep Ghosh 机构:University of Texas At Austin 摘要:基于知识的视觉问答(KBVQA)是一项双模任务,需要外部世界的知识才能正确回答文本问题和相关图像。最近的单模态文本工作表明,将知识注入预先训练的语言模型,特别是实体增强的知识图嵌入,可以提高下游以实体为中心的任务的性能。在这项工作中,我们实证研究了在双模环境中应用这些方法如何以及是否能够提高现有VQA系统在KBVQA任务上的性能。我们使用两个大型的公开VQA数据集进行实验,(1)KVQA,其中包含大多数罕见的Wikipedia实体,(2)OKVQA,它不太以实体为中心,更符合常识推理。两者都缺乏明确的实体跨度,我们研究了不同的弱监督和手动方法对获取它们的影响。此外,我们还分析了最近提出的双模态和单模态注意解释是如何受到这种实体增强表征的影响的。我们的结果表明,KBVQA任务的性能显著提高,无需额外昂贵的预训练,我们还提供了实体知识注入何时有助于提高模型的理解的见解。我们提供代码和增强的数据集,以确保再现性。 摘要:Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single modality text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks. In this work, we empirically study how and whether such methods, applied in a bi-modal setting, can improve an existing VQA system's performance on the KBVQA task. We experiment with two large publicly available VQA datasets, (1) KVQA which contains mostly rare Wikipedia entities and (2) OKVQA which is less entity-centric and more aligned with common sense reasoning. Both lack explicit entity spans and we study the effect of different weakly supervised and manual methods for obtaining them. Additionally we analyze how recently proposed bi-modal and single modal attention explanations are affected by the incorporation of such entity enhanced representations. Our results show substantial improved performance on the KBVQA task without the need for additional costly pre-training and we provide insights for when entity knowledge injection helps improve a model's understanding. We provide code and enhanced datasets for reproducibility.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Enhancing Multi-Scale Implicit Learning in Image Super-Resolution with Integrated Positional Encoding 标题:综合位置编码增强图像超分辨率多尺度内隐学习 链接:https://arxiv.org/abs/2112.05756

作者:Ying-Tian Liu,Yuan-Chen Guo,Song-Hai Zhang 机构:BNRist, Department of Computer Science and Technology, Tsinghua University 备注:10 pages, 5 figures 摘要:中心位置是否完全能够表示像素?在离散图像表示中用像素表示像素没有错误,但是在图像超分辨率(SR)上下文中考虑每个像素作为来自局部区域的信号的聚集更为合理。尽管基于坐标的隐式表示在任意尺度的图像SR中有很强的表现能力,但这一区域的像素特性并没有得到充分的考虑。为此,我们提出了集成位置编码(IPE),通过在像素区域聚集频率信息来扩展传统位置编码。我们将IPE应用于最新的任意尺度图像超分辨率方法:局部隐式图像函数(LIIF),提出了IPE-LIIF。我们通过定量和定性评估证明了IPE-LIIF的有效性,并进一步证明了IPE对更大图像尺度和多隐式方法的泛化能力。代码将被发布。 摘要:Is the center position fully capable of representing a pixel? There is nothing wrong to represent pixels with their centers in a discrete image representation, but it makes more sense to consider each pixel as the aggregation of signals from a local area in an image super-resolution (SR) context. Despite the great capability of coordinate-based implicit representation in the field of arbitrary-scale image SR, this area's nature of pixels is not fully considered. To this end, we propose integrated positional encoding (IPE), extending traditional positional encoding by aggregating frequency information over the pixel area. We apply IPE to the state-of-the-art arbitrary-scale image super-resolution method: local implicit image function (LIIF), presenting IPE-LIIF. We show the effectiveness of IPE-LIIF by quantitative and qualitative evaluations, and further demonstrate the generalization ability of IPE to larger image scales and multiple implicit-based methods. Code will be released.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Generate Point Clouds with Multiscale Details from Graph-Represented Structures 标题:由图表示的结构生成具有多尺度细节的点云 链接:https://arxiv.org/abs/2112.06433

作者:Ximing Yang,Cheng Jin 机构:Fudan University, China, Shanghai 备注:8 pages, 6 figures 摘要:从结构生成点云是控制点云生成的一种非常有价值的方法。基于结构的可控点云生成的主要问题之一是缺乏对细节的可控性,因为在大多数现有的结构表示中缺少细节。可以看出,细节和结构的定义是主观的。细节可以被视为小规模的结构。为了同时表示不同尺度的结构,我们提出了一种基于图的结构表示方法,称为多尺度结构图(MSG)。通过将细节视为小尺度结构,可以在不同的尺度、位置、密度和角度上找到类似的局部结构模式。从模式中学习到的知识可以转移到其他尺度的类似模式中。提出了一种基于多尺度结构的点云生成器(MSPCG)的编码和生成机制,用于从MSG生成密集点云,该机制可以同时学习具有各种空间特性的局部模式。我们的MSPCG还具有很强的泛化能力和可扩展性。在ShapeNet数据集上训练的MSPCG可以在点云上启用多尺度编辑,为看不见的类别生成点云,并从给定结构生成室内场景。实验结果表明,我们的方法明显优于基线方法。 摘要:Generating point clouds from structures is a highly valued method to control the generation of point clouds.One of the major problems in structure-based controllable point cloud generation is the lack of controllability to details, as details are missing in most existing representations of structures.It can be observed that definitions of details and structures are subjective.Details can be treated as structures on small scale.To represent structures in different scales at the same time, we present a graph-based representation of structures called the Multiscale Structure Graph(MSG).By treating details as small-scale structures, similar patterns of local structures can be found at different scales, places, densities, and angles.The knowledge learned from a pattern can be transferred to similar patterns in other scales.An encoding and generation mechanism, namely the Multiscale Structure-based Point Cloud Generator(MSPCG), for generating dense point clouds from the MSG is proposed, which can simultaneously learn local patterns with miscellaneous spatial properties.Our MSPCG also has great generalization ability and scalability.An MSPCG trained on the ShapeNet dataset can enable multi-scale edition on point clouds, generate point clouds for unseen categories, and generate indoor scenes from a given structure. The experimental results show that our method significantly outperforms baseline methods.

3D|3D重建等相关(1篇)

【1】 MVLayoutNet:3D layout reconstruction with multi-view panoramas 标题:MVLayoutNet:基于多视图全景图的三维布局重建 链接:https://arxiv.org/abs/2112.06133

作者:Zhihua Hu,Bo Duan,Yanfeng Zhang,Mingwei Sun,Jingwei Huang 机构:Riemann Lab, Huawei Technologies, Wuhan University 摘要:我们介绍了MVLayoutNet,一种端到端网络,用于从多视图全景图进行整体三维重建。我们的核心贡献是将学习到的单目布局估计和多视图立体(MVS)无缝结合起来,在3D和图像空间进行精确的布局重建。我们联合训练一个版图模块生成初始版图,并训练一个新颖的MVS模块获得精确的版图几何图形。与标准MVSNet[33]不同,我们的MVS模块采用新提出的布局成本量,将同一深度层的多视图成本聚合到相应的布局元素中。此外,我们还提供了一个基于注意的方案,引导MVS模块关注结构区域。这种设计既考虑了局部像素级成本,又考虑了全局整体信息,以实现更好的重建。实验表明,在2D-3D-S[1]和ZInD[5]数据集上,我们的方法在深度rmse方面的性能分别比现有的方法高21.7%和20.6%。最后,我们的方法产生了一致的布局几何体,可以重建整个场景。 摘要:We present MVLayoutNet, an end-to-end network for holistic 3D reconstruction from multi-view panoramas. Our core contribution is to seamlessly combine learned monocular layout estimation and multi-view stereo (MVS) for accurate layout reconstruction in both 3D and image space. We jointly train a layout module to produce an initial layout and a novel MVS module to obtain accurate layout geometry. Unlike standard MVSNet [33], our MVS module takes a newly-proposed layout cost volume, which aggregates multi-view costs at the same depth layer into corresponding layout elements. We additionally provide an attention-based scheme that guides the MVS module to focus on structural regions. Such a design considers both local pixel-level costs and global holistic information for better reconstruction. Experiments show that our method outperforms state-of-the-arts in terms of depth rmse by 21.7% and 20.6% on the 2D-3D-S [1] and ZInD [5] datasets. Finally, our method leads to coherent layout geometry that enables the reconstruction of an entire scene.

其他神经网络|深度学习|模型|建模(14篇)

【1】 DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor Points 标题:DenseGAP:带锚点的图结构密集对应学习 链接:https://arxiv.org/abs/2112.06910

作者:Zhengfei Kuang,Jiaman Li,Mingming He,Tong Wang,Yajie Zhao 机构:University of Southern California, USC Institute for Creative Technologies, Stanford University 摘要:在两幅图像之间建立紧密的对应关系是一个基本的计算机视觉问题,通常通过匹配局部特征描述符来解决。然而,如果没有全球意识,这种局部特征往往不足以消除类似区域的歧义。计算图像间的两两特征相关性不仅计算量大,而且占用大量内存。为了使局部特征能够感知全局环境并提高其匹配精度,我们引入了一种新的解决方案——基于锚定点的图结构神经网络的稠密对应学习。具体地说,我们首先提出了一种图结构,该结构利用锚定点在图像间和图像内上下文上提供稀疏但可靠的先验信息,并通过有向边将它们传播到所有图像点。我们还设计了一个图结构网络,通过轻量级的消息传递层来广播多级上下文,并以低内存成本生成高分辨率的特征映射。最后,基于预测的特征映射,我们引入了一个从粗到精的框架,利用循环一致性进行精确的对应预测。我们的特征描述符捕获本地和全局信息,从而支持连续特征字段以高分辨率查询任意点。通过全面的烧蚀实验和对大规模室内外数据集的评估,我们证明了我们的方法在大多数基准测试中提高了函授学习的水平。 摘要:Establishing dense correspondence between two images is a fundamental computer vision problem, which is typically tackled by matching local feature descriptors. However, without global awareness, such local features are often insufficient for disambiguating similar regions. And computing the pairwise feature correlation across images is both computation-expensive and memory-intensive. To make the local features aware of the global context and improve their matching accuracy, we introduce DenseGAP, a new solution for efficient Dense correspondence learning with a Graph-structured neural network conditioned on Anchor Points. Specifically, we first propose a graph structure that utilizes anchor points to provide sparse but reliable prior on inter- and intra-image context and propagates them to all image points via directed edges. We also design a graph-structured network to broadcast multi-level contexts via light-weighted message-passing layers and generate high-resolution feature maps at low memory cost. Finally, based on the predicted feature maps, we introduce a coarse-to-fine framework for accurate correspondence prediction using cycle consistency. Our feature descriptors capture both local and global information, thus enabling a continuous feature field for querying arbitrary points at high resolution. Through comprehensive ablative experiments and evaluations on large-scale indoor and outdoor datasets, we demonstrate that our method advances the state-of-the-art of correspondence learning on most benchmarks.

【2】 Quaternion-Valued Convolutional Neural Network Applied for Acute Lymphoblastic Leukemia Diagnosis 标题:四元数值卷积神经网络在急性淋巴细胞白血病诊断中的应用 链接:https://arxiv.org/abs/2112.06685

作者:Marco Aurélio Granero,Cristhian Xavier Hernández,Marcos Eduardo Valle 机构: 3 Universidade Estadual de Campinas 备注:None 摘要:近年来,随着深度和卷积神经网络的发展,神经网络领域取得了重大进展。尽管当前的许多工作都涉及实值模型,但最近的研究表明,具有超复数值参数的神经网络能够更好地捕获、推广和表示多维数据的复杂性。本文探讨了四元数卷积神经网络在医学模式识别任务中的应用,即急性淋巴细胞白血病的诊断。准确地说,我们比较了实数和四元数卷积神经网络从外周血涂片显微图像中分类淋巴母细胞的性能。四元数值卷积神经网络比其相应的实值网络具有更好或类似的性能,但仅使用其34%的参数。这一结果证实了四元数代数允许用较少的参数从彩色图像中捕获和提取信息。 摘要:The field of neural networks has seen significant advances in recent years with the development of deep and convolutional neural networks. Although many of the current works address real-valued models, recent studies reveal that neural networks with hypercomplex-valued parameters can better capture, generalize, and represent the complexity of multidimensional data. This paper explores the quaternion-valued convolutional neural network application for a pattern recognition task from medicine, namely, the diagnosis of acute lymphoblastic leukemia. Precisely, we compare the performance of real-valued and quaternion-valued convolutional neural networks to classify lymphoblasts from the peripheral blood smear microscopic images. The quaternion-valued convolutional neural network achieved better or similar performance than its corresponding real-valued network but using only 34% of its parameters. This result confirms that quaternion algebra allows capturing and extracting information from a color image with fewer parameters.

【3】 Ex-Model: Continual Learning from a Stream of Trained Models 标题:EX-Model:从训练有素的模型流中持续学习 链接:https://arxiv.org/abs/2112.06511

作者:Antonio Carta,Andrea Cossu,Vincenzo Lomonaco,Davide Bacciu 机构:University of Pisa, Scuola Normale Superiore 摘要:从非平稳数据流中不断学习是近几年来日益流行的一个具有挑战性的研究课题。能够以高效、有效和可扩展的方式不断学习、适应和推广是人工智能系统可持续发展的基础。然而,以代理为中心的持续学习观点要求直接从原始数据中学习,这限制了独立代理之间的交互、当前方法的效率和隐私。相反,我们认为,持续学习系统应该以经过训练的模型的形式利用压缩信息的可用性。在本文中,我们介绍并形式化了一种新的范例“Ex-Model连续学习”(ExML),其中agent从一系列先前训练的模型中学习,而不是从原始数据中学习。我们还提供了三种ex-model连续学习算法和一个由三个数据集(MNIST、CIFAR-10和CORe50)组成的经验设置,以及八个场景,其中对提出的算法进行了广泛测试。最后,我们强调了前模型范式的特点,并指出了有趣的未来研究方向。 摘要:Learning continually from non-stationary data streams is a challenging research topic of growing popularity in the last few years. Being able to learn, adapt, and generalize continually in an efficient, effective, and scalable way is fundamental for a sustainable development of Artificial Intelligent systems. However, an agent-centric view of continual learning requires learning directly from raw data, which limits the interaction between independent agents, the efficiency, and the privacy of current approaches. Instead, we argue that continual learning systems should exploit the availability of compressed information in the form of trained models. In this paper, we introduce and formalize a new paradigm named "Ex-Model Continual Learning" (ExML), where an agent learns from a sequence of previously trained models instead of raw data. We further contribute with three ex-model continual learning algorithms and an empirical setting comprising three datasets (MNIST, CIFAR-10 and CORe50), and eight scenarios, where the proposed algorithms are extensively tested. Finally, we highlight the peculiarities of the ex-model paradigm and we point out interesting future research directions.

【4】 Semantically Contrastive Learning for Low-light Image Enhancement 标题:用于微光图像增强的语义对比学习 链接:https://arxiv.org/abs/2112.06451

作者:Dong Liang,Ling Li,Mingqiang Wei,Shuo Yang,Liyan Zhang,Wenhan Yang,Yun Du,Huiyu Zhou 机构:Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Collaborative Innovation Center of Novel Software Technology and Industrialization, vivo Mobile Communication, Nanyang Technological University 摘要:由于单个RGB图像普遍存在的低对比度和弱可见性问题,微光图像增强(LLE)仍然具有挑战性。在本文中,我们回答了一个有趣的与学习相关的问题——如果利用可访问的未配对过曝光/欠曝光图像和高级语义指导,可以提高尖端LLE模型的性能吗?在此,我们提出了一种有效的语义对比学习范式(即SCL-LLE)。在现有LLE智能的基础上,将图像增强任务转化为多任务联合学习,将LLE转化为对比学习、语义亮度一致性和特征保持三个约束条件,同时保证曝光、纹理和颜色的一致性。SCL-LLE允许LLE模型从未配对的正片(正常光)/负片(过度/欠曝光)中学习,并使其能够与场景语义交互以规范图像增强网络,然而在以前的方法中很少研究高级语义知识和低级信号先验的交互。在现成的开放数据上进行训练,大量实验表明,我们的方法在六个独立的跨场景数据集上超过了最先进的LLE模型。此外,还讨论了SCL-LLE在极端黑暗条件下有利于下游语义分割的潜力。源代码:https://github.com/LingLIx/SCL-LLE. 摘要:Low-light image enhancement (LLE) remains challenging due to the unfavorable prevailing low-contrast and weak-visibility problems of single RGB images. In this paper, we respond to the intriguing learning-related question -- if leveraging both accessible unpaired over/underexposed images and high-level semantic guidance, can improve the performance of cutting-edge LLE models? Here, we propose an effective semantically contrastive learning paradigm for LLE (namely SCL-LLE). Beyond the existing LLE wisdom, it casts the image enhancement task as multi-task joint learning, where LLE is converted into three constraints of contrastive learning, semantic brightness consistency, and feature preservation for simultaneously ensuring the exposure, texture, and color consistency. SCL-LLE allows the LLE model to learn from unpaired positives (normal-light)/negatives (over/underexposed), and enables it to interact with the scene semantics to regularize the image enhancement network, yet the interaction of high-level semantic knowledge and the low-level signal prior is seldom investigated in previous methods. Training on readily available open data, extensive experiments demonstrate that our method surpasses the state-of-the-arts LLE models over six independent cross-scenes datasets. Moreover, SCL-LLE's potential to benefit the downstream semantic segmentation under extremely dark conditions is discussed. Source Code: https://github.com/LingLIx/SCL-LLE.

【5】 Image Reconstruction from Events. Why learn it? 标题:从事件中重建图像。为什么要学呢? 链接:https://arxiv.org/abs/2112.06242

作者:Zelin Zhang,Anthony Yezzi,Guillermo Gallego 机构:Georgia Tech, TU Berlin, ECDF, SCIoI 备注:18 pages, 13 figures, 5 tables 摘要:传统相机测量图像强度。相反,事件摄影机异步测量每像素的时间强度变化。从事件中恢复强度是一个热门的研究课题,因为重建图像继承了事件的高动态范围(HDR)和高速特性;因此,它们可以用于许多机器人视觉应用,并生成慢动作HDR视频。然而,最先进的方法通过训练一个事件到图像的递归神经网络(RNN)来解决这个问题,RNN缺乏可解释性,难以调整。在这项工作中,我们首次展示了如何解决运动和强度估计的联合问题,从而将基于事件的图像重建建模为一个线性逆问题,该问题可以在不训练图像重建RNN的情况下解决。取而代之的是,可以使用经典的和基于学习的图像先验来解决问题,并从重建图像中去除伪影。实验表明,所提出的方法生成视觉质量与PAR与现有技术的最先进的方法,尽管只使用数据从一个短的时间间隔(即,没有经常性的连接)。我们的方法也可以用于改善通过首先估计图像拉普拉斯函数的方法重建的图像的质量;在这里,我们的方法可以解释为由图像先验引导的泊松重建。 摘要:Traditional cameras measure image intensity. Event cameras, by contrast, measure per-pixel temporal intensity changes asynchronously. Recovering intensity from events is a popular research topic since the reconstructed images inherit the high dynamic range (HDR) and high-speed properties of events; hence they can be used in many robotic vision applications and to generate slow-motion HDR videos. However, state-of-the-art methods tackle this problem by training an event-to-image recurrent neural network (RNN), which lacks explainability and is difficult to tune. In this work we show, for the first time, how tackling the joint problem of motion and intensity estimation leads us to model event-based image reconstruction as a linear inverse problem that can be solved without training an image reconstruction RNN. Instead, classical and learning-based image priors can be used to solve the problem and remove artifacts from the reconstructed images. The experiments show that the proposed approach generates images with visual quality on par with state-of-the-art methods despite only using data from a short time interval (i.e., without recurrent connections). Our method can also be used to improve the quality of images reconstructed by approaches that first estimate the image Laplacian; here our method can be interpreted as Poisson reconstruction guided by image priors.

【6】 HerosNet: Hyperspectral Explicable Reconstruction and Optimal Sampling Deep Network for Snapshot Compressive Imaging 标题:HerosNet:快照压缩成像的高光谱可解释重建和最优采样深度网络 链接:https://arxiv.org/abs/2112.06238

作者:Xuanyu Zhang,Yongbing Zhang,Ruiqin Xiong,Qilin Sun,Jian Zhang 机构:Peking University Shenzhen Graduate School, Shenzhen, China, Harbin Institute of Technology, Shenzhen, China ,Peking University, Beijing, China, The Chinese University of Hong Kong, Shenzhen, China 备注:10 pages, 8 figures 摘要:高光谱成像是一种重要的成像方式,有着广泛的应用,特别是在遥感、农业和医学领域。受现有超光谱相机速度慢、价格昂贵或体积庞大的启发,从低预算快照测量重建超光谱图像(HSI)引起了广泛关注。通过将截断的数值优化算法映射到具有固定相位数的网络中,最近用于光谱快照压缩感知(SCI)的深度展开网络(DUN)取得了显著的成功。然而,由于缺乏跨相位特征交互和自适应参数调整,DUN远未达到工业应用的范围。在本文中,我们提出了一种新的SCI高光谱可解释重建和最佳采样深度网络HerosNet,该网络在ISTA展开框架下包括几个阶段。在梯度下降步骤中,每一阶段都可以灵活地模拟传感矩阵并上下文调整步长,在近端映射步骤中,对前一阶段的隐藏状态进行分层融合和交互,有效地恢复当前HSI帧。同时,对硬件友好的最优二值掩模进行端到端学习,进一步提高重构性能。最后,我们的HerosNet经过验证,在模拟和真实数据集上都大大优于最先进的方法。 摘要:Hyperspectral imaging is an essential imaging modality for a wide range of applications, especially in remote sensing, agriculture, and medicine. Inspired by existing hyperspectral cameras that are either slow, expensive, or bulky, reconstructing hyperspectral images (HSIs) from a low-budget snapshot measurement has drawn wide attention. By mapping a truncated numerical optimization algorithm into a network with a fixed number of phases, recent deep unfolding networks (DUNs) for spectral snapshot compressive sensing (SCI) have achieved remarkable success. However, DUNs are far from reaching the scope of industrial applications limited by the lack of cross-phase feature interaction and adaptive parameter adjustment. In this paper, we propose a novel Hyperspectral Explicable Reconstruction and Optimal Sampling deep Network for SCI, dubbed HerosNet, which includes several phases under the ISTA-unfolding framework. Each phase can flexibly simulate the sensing matrix and contextually adjust the step size in the gradient descent step, and hierarchically fuse and interact the hidden states of previous phases to effectively recover current HSI frames in the proximal mapping step. Simultaneously, a hardware-friendly optimal binary mask is learned end-to-end to further improve the reconstruction performance. Finally, our HerosNet is validated to outperform the state-of-the-art methods on both simulation and real datasets by large margins.

【7】 Deep network for rolling shutter rectification 标题:卷帘式纠偏深网 链接:https://arxiv.org/abs/2112.06170

作者:Praveen K,Lokesh Kumar T,A. N. Rajagopalan 摘要:CMOS传感器在对场景成像时采用行式采集机制,这可能会导致捕获图像中出现称为滚动快门(RS)失真的不希望出现的运动伪影。现有的单图像RS校正方法试图通过使用为特定类别的场景定制的算法来解释这些失真,该算法保证了摄像机固有参数的信息,或者使用基于学习的框架和已知的地面真实运动参数。在本文中,我们提出了一种端到端的深度神经网络,用于单图像遥感校正的挑战性任务。我们的网络由一个运动块、一个轨迹模块、一个行块、一个RS校正模块和一个RS再生模块(仅在训练期间使用)组成。运动块预测每一行输入的RS畸变图像的相机姿态,而轨迹模块将估计的运动参数拟合为三阶多项式。行块预测必须与目标(即RS校正图像)中的每个像素关联的相机运动。最后,RS校正模块利用运动轨迹和行块输出对输入的RS图像进行扭曲,得到无失真的图像。为了在训练过程中更快地收敛,我们另外使用了一个RS再生模块,该模块将输入的RS图像与被估计运动参数扭曲的地面真实图像进行比较。我们模型中的端到端公式没有将估计的运动约束到地面真实运动参数,从而成功地校正了具有复杂真实摄像机运动的RS图像。在合成数据集和真实数据集上的实验表明,我们的网络在定性和定量上都优于现有技术。 摘要:CMOS sensors employ row-wise acquisition mechanism while imaging a scene, which can result in undesired motion artifacts known as rolling shutter (RS) distortions in the captured image. Existing single image RS rectification methods attempt to account for these distortions by either using algorithms tailored for specific class of scenes which warrants information of intrinsic camera parameters or a learning-based framework with known ground truth motion parameters. In this paper, we propose an end-to-end deep neural network for the challenging task of single image RS rectification. Our network consists of a motion block, a trajectory module, a row block, an RS rectification module and an RS regeneration module (which is used only during training). The motion block predicts camera pose for every row of the input RS distorted image while the trajectory module fits estimated motion parameters to a third-order polynomial. The row block predicts the camera motion that must be associated with every pixel in the target i.e, RS rectified image. Finally, the RS rectification module uses motion trajectory and the output of row block to warp the input RS image to arrive at a distortionfree image. For faster convergence during training, we additionally use an RS regeneration module which compares the input RS image with the ground truth image distorted by estimated motion parameters. The end-to-end formulation in our model does not constrain the estimated motion to ground-truth motion parameters, thereby successfully rectifying the RS images with complex real-life camera motion. Experiments on synthetic and real datasets reveal that our network outperforms prior art both qualitatively and quantitatively.

【8】 PRNet: A Periodic Residual Learning Network for Crowd Flow Forecasting 标题:PRNet:一种用于人群流量预测的周期性残差学习网络 链接:https://arxiv.org/abs/2112.06132

作者:Chengxin Wang,Yuxuan Liang,Gary Tan 机构:National University of Singapore 摘要:人群流量预测,例如预测进入或离开特定区域的人群,对于现实世界的城市应用具有重要意义。人流数据的关键特性之一是周期性:以固定时间间隔出现的模式,如每周模式。为了捕捉这种周期性,现有的研究要么基于周期隐藏状态显式地对其建模,要么通过将所有周期段反馈到神经网络来隐式地学习它。在本文中,我们设计了一种新的周期剩余学习网络(PRNet),以更好地模拟人流数据的周期性。与现有方法不同,PRNet通过对输入(前一时间段)和输出(未来时间段)之间的偏差进行建模,将人群流量预测作为一个周期性剩余学习问题。与直接预测高度动态的人群流相比,学习这种平稳偏差要容易得多,从而便于模型训练。此外,学习到的偏差使网络能够在每个时间间隔产生未来条件与其相应每周观测值之间的残差,因此有助于更好的预测。我们进一步提出了一种轻量级的空间信道增强编码器,通过联合捕获全局空间相关性和时间相关性来构建更强大的区域表示。在两个真实数据集上的实验结果表明,PRNet在准确性和鲁棒性方面都优于最先进的方法。 摘要:Crowd flow forecasting, e.g., predicting the crowds entering or leaving certain regions, is of great importance to real-world urban applications. One of the key properties of crowd flow data is periodicity: a pattern that occurs at regular time intervals, such as a weekly pattern. To capture such periodicity, existing studies either explicitly model it based on the periodic hidden states or implicitly learn it by feeding all periodic segments into neural networks. In this paper, we devise a novel periodic residual learning network (PRNet) for better modeling the periodicity in crowd flow data. Differing from existing methods, PRNet frames the crowd flow forecasting as a periodic residual learning problem by modeling the deviation between the input (the previous time period) and the output (the future time period). As compared to predicting highly dynamic crowd flows directly, learning such stationary deviation is much easier, which thus facilitates the model training. Besides, the learned deviation enables the network to produce the residual between future conditions and its corresponding weekly observations at each time interval, and therefore contributes to substantially better predictions. We further propose a lightweight Spatial-Channel Enhanced Encoder to build more powerful region representations, by jointly capturing global spatial correlations and temporal dependencies. Experimental results on two real-world datasets demonstrate that PRNet outperforms the state-of-the-art methods in terms of both accuracy and robustness.

【9】 Magnifying Networks for Images with Billions of Pixels 标题:数十亿像素图像的放大网络 链接:https://arxiv.org/abs/2112.06121

作者:Neofytos Dimitriou,Ognjen Arandjelovic 机构:University of St Andrews, KY,SX, United Kingdom 摘要:向端到端深度学习的转变在计算机视觉的许多领域带来了前所未有的进步。然而,在有些情况下,输入图像过大,认为端到端的方法是不可能的。在本文中,我们介绍了一种新的网络,即放大网络(MagNet),它可以独立于输入图像的大小进行端到端的训练。磁铁将卷积神经网络与可微空间变换相结合,以一种新的方式导航并成功地从数十亿像素的图像中学习。从普通brightfield显微镜的放大特性中汲取灵感,磁铁处理下采样版本的图像,并且在没有监督的情况下学习如何识别可能对手头的任务有价值的区域,对其进行上采样,并在每个提取的面片上递归重复此过程。我们在公开的Camelyon16和Camelyon17数据集上的结果首先证实了磁铁和拟议优化框架的有效性,其次,证明了磁铁内置透明度的优势,这是对医疗诊断等关键过程至关重要的属性。 摘要:The shift towards end-to-end deep learning has brought unprecedented advances in many areas of computer vision. However, there are cases where the input images are excessively large, deeming end-to-end approaches impossible. In this paper, we introduce a new network, the Magnifying Network (MagNet), which can be trained end-to-end independently of the input image size. MagNets combine convolutional neural networks with differentiable spatial transformers, in a new way, to navigate and successfully learn from images with billions of pixels. Drawing inspiration from the magnifying nature of an ordinary brightfield microscope, a MagNet processes a downsampled version of an image, and without supervision learns how to identify areas that may carry value to the task at hand, upsamples them, and recursively repeats this process on each of the extracted patches. Our results on the publicly available Camelyon16 and Camelyon17 datasets first corroborate to the effectiveness of MagNets and the proposed optimization framework and second, demonstrate the advantage of Magnets' built-in transparency, an attribute of utmost importance for critical processes such as medical diagnosis.

【10】 Learning from the Tangram to Solve Mini Visual Tasks 标题:学习七巧板解决微型视觉任务 链接:https://arxiv.org/abs/2112.06113

作者:Yizhou Zhao,Liang Qiu,Pan Lu,Feng Shi,Tian Han,Song-Chun Zhu 机构:UCLA Center for Vision, Cognition, Learning, and Autonomy, Stevens Institute of Technology 摘要:当前计算机视觉的预训练方法主要关注日常生活环境中的自然图像。然而,像图标和符号这样的抽象图在现实世界中很常见,也很重要。这件作品的灵感来自七巧板,一款需要从七个解剖形状复制抽象图案的游戏。通过记录人类解决七巧板难题的经验,我们展示了七巧板数据集,并表明在七巧板上预先训练的神经模型有助于解决一些基于低分辨率视觉的小型视觉任务。大量的实验表明,我们提出的方法可以为折叠衣服和评估房间布局等美学任务生成智能解决方案。经过预训练的特征抽取器可以促进笔迹上少数镜头学习任务的收敛,并提高根据图标轮廓识别图标的准确性。Tangram数据集可在https://github.com/yizhouzhao/Tangram. 摘要:Current pre-training methods in computer vision focus on natural images in the daily-life context. However, abstract diagrams such as icons and symbols are common and important in the real world. This work is inspired by Tangram, a game that requires replicating an abstract pattern from seven dissected shapes. By recording human experience in solving tangram puzzles, we present the Tangram dataset and show that a pre-trained neural model on the Tangram helps solve some mini visual tasks based on low-resolution vision. Extensive experiments demonstrate that our proposed method generates intelligent solutions for aesthetic tasks such as folding clothes and evaluating room layouts. The pre-trained feature extractor can facilitate the convergence of few-shot learning tasks on human handwriting and improve the accuracy in identifying icons by their contours. The Tangram dataset is available at https://github.com/yizhouzhao/Tangram.

【11】 Controlled-rearing studies of newborn chicks and deep neural networks 标题:初生雏鸡的控制饲养研究与深度神经网络 链接:https://arxiv.org/abs/2112.06106

作者:Donsuk Lee,Pranav Gujarathi,Justin N. Wood 机构:Department of Informatics, Indiana University, Bloomington, IN , Department of Computer Science, Departments of Informatics, Psychology, Neuroscience 备注:NeurIPS 2021 Workshop on Shared Visual Representations in Human & Machine Intelligence 摘要:卷积神经网络(CNN)现在可以在具有挑战性的目标识别任务上实现人类水平的性能。CNN也是预测视觉识别任务中神经和行为反应的主要定量模型。然而,CNN模型有一个广为接受的批评:与新生动物学习迅速有效不同,CNN被认为是“数据饥饿”,需要大量的训练数据来开发准确的对象识别模型。这种批评挑战了CNN作为视觉发展模型的前景。在这里,我们通过对新生小鸡和CNN进行平行对照饲养实验,直接检验CNN是否比新生动物更需要数据。我们在严格控制的视觉环境中饲养新生小鸡,然后通过在视频游戏引擎中构建虚拟动物室来模拟该环境中可用的训练数据。我们记录了在虚拟室中移动的代理获取的视觉图像,并使用这些图像来训练CNN。当CNN接收到与小鸡相似的视觉训练数据时,CNN成功地解决了与小鸡相同的具有挑战性的视图不变对象识别任务。因此,CNN并不比动物更需要数据:CNN和小鸡都成功地从单个对象的训练数据开发出健壮的对象模型。 摘要:Convolutional neural networks (CNNs) can now achieve human-level performance on challenging object recognition tasks. CNNs are also the leading quantitative models in terms of predicting neural and behavioral responses in visual recognition tasks. However, there is a widely accepted critique of CNN models: unlike newborn animals, which learn rapidly and efficiently, CNNs are thought to be "data hungry," requiring massive amounts of training data to develop accurate models for object recognition. This critique challenges the promise of using CNNs as models of visual development. Here, we directly examined whether CNNs are more data hungry than newborn animals by performing parallel controlled-rearing experiments on newborn chicks and CNNs. We raised newborn chicks in strictly controlled visual environments, then simulated the training data available in that environment by constructing a virtual animal chamber in a video game engine. We recorded the visual images acquired by an agent moving through the virtual chamber and used those images to train CNNs. When CNNs received similar visual training data as chicks, the CNNs successfully solved the same challenging view-invariant object recognition tasks as the chicks. Thus, the CNNs were not more data hungry than animals: both CNNs and chicks successfully developed robust object models from training data of a single object.

【12】 Curvature-guided dynamic scale networks for Multi-view Stereo 标题:曲率引导的多视点立体动态比例尺网络 链接:https://arxiv.org/abs/2112.05999

作者:Khang Truong Giang,Soohwan Song,Sungho Jo 摘要:多视点立体(MVS)是精确三维重建的关键任务。最新的研究试图通过设计聚合的3D成本量及其正则化来提高MVS中匹配成本量的性能。本文主要研究学习一个鲁棒的特征提取网络,以提高匹配性能,而无需在其他步骤中进行大量计算。特别地,我们提出了一种动态尺度特征提取网络,即CDSFNet。它由多个新的卷积层组成,每个卷积层可以根据图像表面的法曲率为每个像素选择合适的面片比例。因此,CDFSNet可以估计最佳的面片尺度来学习鉴别特征,以便在参考图像和源图像之间进行精确的匹配计算。通过将稳健的提取特征与适当的成本公式策略相结合,我们得到的MVS体系结构可以更精确地估计深度图。大量实验表明,在复杂的室外场景中,该方法的性能优于其他先进的方法。它显著提高了重建模型的完整性。因此,与其他MVS方法相比,该方法可以在更快的运行时间和更低的内存内处理更高分辨率的输入。我们的源代码可以在url上找到{https://github.com/TruongKhang/cds-mvsnet}. 摘要:Multi-view stereo (MVS) is a crucial task for precise 3D reconstruction. Most recent studies tried to improve the performance of matching cost volume in MVS by designing aggregated 3D cost volumes and their regularization. This paper focuses on learning a robust feature extraction network to enhance the performance of matching costs without heavy computation in the other steps. In particular, we present a dynamic scale feature extraction network, namely, CDSFNet. It is composed of multiple novel convolution layers, each of which can select a proper patch scale for each pixel guided by the normal curvature of the image surface. As a result, CDFSNet can estimate the optimal patch scales to learn discriminative features for accurate matching computation between reference and source images. By combining the robust extracted features with an appropriate cost formulation strategy, our resulting MVS architecture can estimate depth maps more precisely. Extensive experiments showed that the proposed method outperforms other state-of-the-art methods on complex outdoor scenes. It significantly improves the completeness of reconstructed models. As a result, the method can process higher resolution inputs within faster run-time and lower memory than other MVS methods. Our source code is available at url{https://github.com/TruongKhang/cds-mvsnet}.

【13】 LC-FDNet: Learned Lossless Image Compression with Frequency Decomposition Network 标题:LC-FDNet:基于频率分解网络的学习型无损图像压缩 链接:https://arxiv.org/abs/2112.06417

作者:Hochang Rhee,Yeong Il Jang,Seyun Kim,Nam Ik Cho 机构:Seoul National University, Seoul, Korea, Dept. of Electrical and Computer Engineering, INMC, Gauss Labs 摘要:最新的基于学习的无损图像压缩方法以子图像为单位对图像进行编码,并实现与传统非学习算法相当的性能。然而,这些方法不考虑高频区域中的性能下降,考虑到低和高频区域。在本文中,我们提出了一种新的无损图像压缩方法,该方法以从粗到精的方式进行编码,以分离和处理不同的低频和高频区域。我们首先压缩低频分量,然后使用它们作为额外的输入来编码剩余的高频区域。在这种情况下,低频分量充当强先验,从而改进高频区域的估计。此外,我们还设计了频率分解过程,以适应颜色通道、空间位置和图像特征。因此,我们的方法导出了图像特定的低频/高频分量的最佳比率。实验表明,该方法对基准高分辨率数据集具有最先进的性能。 摘要:Recent learning-based lossless image compression methods encode an image in the unit of subimages and achieve comparable performances to conventional non-learning algorithms. However, these methods do not consider the performance drop in the high-frequency region, giving equal consideration to the low and high-frequency areas. In this paper, we propose a new lossless image compression method that proceeds the encoding in a coarse-to-fine manner to separate and process low and high-frequency regions differently. We initially compress the low-frequency components and then use them as additional input for encoding the remaining high-frequency region. The low-frequency components act as a strong prior in this case, which leads to improved estimation in the high-frequency area. In addition, we design the frequency decomposition process to be adaptive to color channel, spatial location, and image characteristics. As a result, our method derives an image-specific optimal ratio of low/high-frequency components. Experiments show that the proposed method achieves state-of-the-art performance for benchmark high-resolution datasets.

【14】 Specificity-Preserving Federated Learning for MR Image Reconstruction 标题:保持特异性的联合学习在磁共振图像重建中的应用 链接:https://arxiv.org/abs/2112.05752

作者:Chun-Mei Feng,Yunlu Yan,Huazhu Fu,Yong Xu,Ling Shao 机构: Harbin Institute of Technology, Shenzhen, IHPC, ASTAR, IIAI, UAE 备注:12 pages, 8 figures 摘要:联邦学习(FL)可用于改善磁共振(MR)图像重建中的数据隐私和效率,使多个机构能够协作,而无需聚合本地数据。然而,不同的磁共振成像协议引起的磁畴偏移会严重降低FL模型的性能。最近的FL技术倾向于通过增强全局模型的泛化来解决这一问题,但它们忽略了特定于领域的特征,这些特征可能包含关于器件特性的重要信息,并且有助于局部重建。在本文中,我们提出了一种用于MR图像重建(FedMRI)的特异性保持FL算法。其核心思想是将MR重建模型分为两部分:一部分是全局共享编码器,以获得全局级别的广义表示;另一部分是特定于客户端的解码器,以保留每个客户端的特定于域的属性,当客户端具有唯一分布时,这对于协作重建非常重要。此外,为了在存在域偏移时进一步提高全局共享编码器的收敛性,引入了加权对比正则化来直接纠正优化过程中客户端和服务器之间的任何偏差。大量实验表明,我们的FedMRI重建结果最接近多机构数据的基本事实,并且优于最先进的FL方法。 摘要:Federated learning (FL) can be used to improve data privacy and efficiency in magnetic resonance (MR) image reconstruction by enabling multiple institutions to collaborate without needing to aggregate local data. However, the domain shift caused by different MR imaging protocols can substantially degrade the performance of FL models. Recent FL techniques tend to solve this by enhancing the generalization of the global model, but they ignore the domain-specific features, which may contain important information about the device properties and be useful for local reconstruction. In this paper, we propose a specificity-preserving FL algorithm for MR image reconstruction (FedMRI). The core idea is to divide the MR reconstruction model into two parts: a globally shared encoder to obtain a generalized representation at the global level, and a client-specific decoder to preserve the domain-specific properties of each client, which is important for collaborative reconstruction when the clients have unique distribution. Moreover, to further boost the convergence of the globally shared encoder when a domain shift is present, a weighted contrastive regularization is introduced to directly correct any deviation between the client and server during optimization. Extensive experiments demonstrate that our FedMRI's reconstructed results are the closest to the ground-truth for multi-institutional data, and that it outperforms state-of-the-art FL methods.

其他(20篇)

【1】 Hallucinating Pose-Compatible Scenes 标题:与姿势相容的幻觉场景 链接:https://arxiv.org/abs/2112.06909

作者:Tim Brooks,Alexei A. Efros 机构:UC Berkeley 摘要:关于一个场景,人类姿势告诉我们什么?我们提出了一个任务来回答这个问题:给定人类姿势作为输入,幻觉出一个兼容的场景。人类姿势捕捉到的微妙线索——动作语义、环境启示、对象交互——提供了令人惊讶的洞察哪些场景是兼容的。我们提出了一个大规模的生成对抗网络的姿态条件场景生成。我们显著地扩展了训练数据的规模和复杂性,管理了一个包含1900多万帧日常环境中的人类的海量元数据集。相对于StyleGAN2,我们将模型处理如此复杂数据的能力提高了一倍,并设计了一种姿势调节机制,驱动我们的模型学习姿势和场景之间的细微关系。我们将经过训练的模型用于各种应用:幻觉姿势兼容的场景(有人或没有人),可视化不兼容的场景和姿势,将人从一个生成的图像放置到另一个场景,以及设置姿势动画。我们的模型产生不同的样本,并且在准确的人体位置(正确关键点的百分比)和图像质量(Frechet起始距离)方面优于姿势调节的StyleGAN2和Pix2Pix基线。 摘要:What does human pose tell us about a scene? We propose a task to answer this question: given human pose as input, hallucinate a compatible scene. Subtle cues captured by human pose -- action semantics, environment affordances, object interactions -- provide surprising insight into which scenes are compatible. We present a large-scale generative adversarial network for pose-conditioned scene generation. We significantly scale the size and complexity of training data, curating a massive meta-dataset containing over 19 million frames of humans in everyday environments. We double the capacity of our model with respect to StyleGAN2 to handle such complex data, and design a pose conditioning mechanism that drives our model to learn the nuanced relationship between pose and scene. We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose. Our model produces diverse samples and outperforms pose-conditioned StyleGAN2 and Pix2Pix baselines in terms of accurate human placement (percent of correct keypoints) and image quality (Frechet inception distance).

【2】 The whole and the parts: the MDL principle and the a-contrario framework 标题:整体与部分:MDL原则与a-对立面框架 链接:https://arxiv.org/abs/2112.06853

作者:Rafael Grompone von Gioi,Ignacio Ramírez Paulino,Gregory Randall 机构:†Universit´e Paris-Saclay, Universidad de la Rep´ublica 备注:Submitted to SIAM Jourinal on Imaging Sciences (SIIMS) 摘要:这项工作探讨了Rissanen提出的最小描述长度(MDL)原则与Desolneux、Moisan和Morel提出的结构检测a-contrario框架之间的联系。MDL原则侧重于对整个数据进行最佳解释,而a-contrario方法则侧重于检测具有异常统计的部分数据。尽管在不同的理论形式下,我们表明,这两种方法在其机制中共享许多共同的概念和工具,并在许多有趣的场景中产生非常相似的公式,从简单的玩具示例到实际应用,如曲线的多边形近似和图像中的线段检测。我们还制定了两种方法在形式上等价的条件。 摘要:This work explores the connections between the Minimum Description Length (MDL) principle as developed by Rissanen, and the a-contrario framework for structure detection proposed by Desolneux, Moisan and Morel. The MDL principle focuses on the best interpretation for the whole data while the a-contrario approach concentrates on detecting parts of the data with anomalous statistics. Although framed in different theoretical formalisms, we show that both methodologies share many common concepts and tools in their machinery and yield very similar formulations in a number of interesting scenarios ranging from simple toy examples to practical applications such as polygonal approximation of curves and line segment detection in images. We also formulate the conditions under which both approaches are formally equivalent.

【3】 N-SfC: Robust and Fast Shape Estimation from Caustic Images 标题:N-SFC:基于焦散图像的鲁棒快速形状估计 链接:https://arxiv.org/abs/2112.06705

作者:Marc Kassubeck,Moritz Kappel,Susana Castillo,Marcus Magnor 机构: Computer Graphics Lab, TU Braunschweig, Germany 备注:Project Page: this https URL 摘要:本文讨论了一个极具挑战性的问题,即从折射物体产生焦散的单个图像重建折射物体的形状。由于透明折射物体在日常生活中无处不在,其形状的重建需要大量的实际应用。最近的焦散形状(SfC)方法将该问题转化为合成焦散图像的光传播模拟的逆问题,该问题可由可微渲染器解决。然而,通过折射表面的光传输的固有复杂性目前限制了重建速度和鲁棒性方面的实用性。为了解决这些问题,我们引入了焦散神经形状(N-SfC),这是一种基于学习的扩展,将两个组件合并到重建管道中:一个去噪模块,可降低光传输模拟的计算成本,以及一个基于学习梯度下降的优化过程,这样可以使用更少的迭代实现更好的收敛。大量的实验证明了我们的神经扩展在3D玻璃印刷质量控制场景中的有效性,在计算速度和最终表面误差方面,我们的表现明显优于当前最先进的技术。 摘要:This paper deals with the highly challenging problem of reconstructing the shape of a refracting object from a single image of its resulting caustic. Due to the ubiquity of transparent refracting objects in everyday life, reconstruction of their shape entails a multitude of practical applications. The recent Shape from Caustics (SfC) method casts the problem as the inverse of a light propagation simulation for synthesis of the caustic image, that can be solved by a differentiable renderer. However, the inherent complexity of light transport through refracting surfaces currently limits the practicability with respect to reconstruction speed and robustness. To address these issues, we introduce Neural-Shape from Caustics (N-SfC), a learning-based extension that incorporates two components into the reconstruction pipeline: a denoising module, which alleviates the computational cost of the light transport simulation, and an optimization process based on learned gradient descent, which enables better convergence using fewer iterations. Extensive experiments demonstrate the effectiveness of our neural extensions in the scenario of quality control in 3D glass printing, where we significantly outperform the current state-of-the-art in terms of computational speed and final surface error.

【4】 SphereSR 标题:球体SR 链接:https://arxiv.org/abs/2112.06536

作者:Youngho Yoon,Inchul Chung,Lin Wang,Kuk-Jin Yoon 机构:Visual Intelligence Lab., KAIST, Korea 摘要:360成像最近获得了极大的关注;然而,它的角度分辨率相对低于窄视场(FOV)透视图像,因为它是使用具有相同传感器尺寸的鱼眼透镜捕获的。因此,超级解析360度图像是有益的。已经进行了一些尝试,但大多数认为等矩形投影(ERP)是360度图像表示的方法之一,尽管存在与纬度相关的失真。在这种情况下,由于输出的高分辨率(HR)图像始终与低分辨率(LR)输入的ERP格式相同,因此在将HR图像转换为其他投影类型时可能会发生另一信息丢失。在本文中,我们提出了SphereSR,一种从LR 360图像生成连续球面图像表示的新框架,旨在预测给定球面坐标下的RGB值,以实现任意360图像投影的超分辨率。具体地说,我们首先提出了一个基于二十面体的球面数据特征提取模块,该模块能够有效地提取球面上的特征。然后,我们提出了一个球形局部隐式图像函数(SLIF)来预测球坐标下的RGB值。因此,SphereSR可以在任意投影类型下灵活地重新构建HR图像。在各种基准数据集上的实验表明,我们的方法明显优于现有的方法。 摘要:The 360 imaging has recently gained great attention; however, its angular resolution is relatively lower than that of a narrow field-of-view (FOV) perspective image as it is captured by using fisheye lenses with the same sensor size. Therefore, it is beneficial to super-resolve a 360 image. Some attempts have been made but mostly considered the equirectangular projection (ERP) as one of the way for 360 image representation despite of latitude-dependent distortions.In that case, as the output high-resolution(HR) image is always in the same ERP format as the low-resolution (LR) input, another information loss may occur when transforming the HR image to other projection types.In this paper, we propose SphereSR, a novel framework to generate a continuous spherical image representation from an LR 360 image, aiming at predicting the RGB values at given spherical coordinates for super-resolution with an arbitrary 360 image projection. Specifically, we first pro-pose a feature extraction module that represents the spherical data based on icosahedron and efficiently extracts features on the spherical surface. We then propose a spherical local implicit image function (SLIIF) to predict RGB values at the spherical coordinates. As such, SphereSR flexibly re-constructs an HR image under an arbitrary projection type.Experiments on various benchmark datasets show that our method significantly surpasses existing methods.

【5】 Self-Paced Deep Regression Forests with Consideration on Ranking Fairness 标题:考虑排序公平性的自定步长深度回归林 链接:https://arxiv.org/abs/2112.06455

作者:Lili Pan,Mingming Meng,Yazhou Ren,Yali Zheng,Zenglin Xu 机构: and Yangtze Delta Region Institute, University ofElectronic Science and Technology of China, Yazhou Ren is with the School of Computer Science and Engineering, and Yali Zheng is with the School of Automation Engineering 备注:14 pages, 9 figures. The paper has been submitted to TIP and is currently under review 摘要:深度判别模型(Deep-discritive models,DDMs),如深度回归森林、深度神经决策森林,近年来得到了广泛的研究,以解决面部年龄估计、头部姿势估计、凝视估计等问题。这类问题之所以具有挑战性,部分原因在于通常无法获得大量无噪声和偏差的有效训练数据。虽然通过学习更多的辨别特征或重新称重样本已经取得了一些进展,但我们认为更可取的是逐步学习像人类一样辨别。然后,我们求助于自主学习(SPL)。但一个自然的问题出现了:自定步调的制度能否导致DDM实现更稳健、更少偏见的解决方案?本文首先讨论了SPL的一个严重问题,即它会加剧解的偏差,特别是对于明显的不平衡数据。为此,本文提出了一种新的自配速深度判别模型,该模型根据与每个示例相关的输出似然和熵来区分噪声和代表性不足的示例,并从一个新的角度解决SPL中的基本排序问题:公平性。这个范例是基本的,可以很容易地与各种DDM结合起来。在三个计算机视觉任务(如面部年龄估计、头部姿势估计和凝视估计)上的大量实验证明了我们的范式的有效性。据我们所知,我们的工作是SPL文献中第一篇考虑自配制度建设排名公平性的论文。 摘要:Deep discriminative models (DDMs), such as deep regression forests, deep neural decision forests, have been extensively studied recently to solve problems like facial age estimation, head pose estimation, gaze estimation and so forth. Such problems are challenging in part because a large amount of effective training data without noise and bias is often not available. While some progress has been achieved through learning more discriminative features, or reweighting samples, we argue what is more desirable is to learn gradually to discriminate like human beings. Then, we resort to self-paced learning (SPL). But a natural question arises: can self-paced regime lead DDMs to achieve more robust and less biased solutions? A serious problem with SPL, which is firstly discussed by this work, is it tends to aggravate the bias of solutions, especially for obvious imbalanced data. To this end, this paper proposes a new self-paced paradigm for deep discriminative model, which distinguishes noisy and underrepresented examples according to the output likelihood and entropy associated with each example, and tackle the fundamental ranking problem in SPL from a new perspective: fairness. This paradigm is fundamental, and could be easily combined with a variety of DDMs. Extensive experiments on three computer vision tasks, such as facial age estimation, head pose estimation and gaze estimation, demonstrate the efficacy of our paradigm. To the best of our knowledge, our work is the first paper in the literature of SPL that considers ranking fairness for self-paced regime construction.

【6】 Holistic Interpretation of Public Scenes Using Computer Vision and Temporal Graphs to Identify Social Distancing Violations 标题:利用计算机视觉和时间图对公共场景进行整体解释以识别社会疏远违规行为 链接:https://arxiv.org/abs/2112.06428

作者:Gihan Jayatilaka,Jameel Hassan,Suren Sritharan,Janith Bandara Senananayaka,Harshana Weligampola,Roshan Godaliyadda,Parakrama Ekanayake,Vijitha Herath,Janaka Ekanayake,Samath Dharmaratne 机构: Department of Electrical and Electronics Engineering, University of Peradeniya, Department of Computer Science, University of Maryland, School of Computing and IT, Sri Lanka Technological Campus, Department of Informatics, Technical University of Munich 备注:35 pages, 22 figures 摘要:COVID-19大流行引发了前所未有的全球公共卫生危机。鉴于其固有的性质,建议将社会距离措施作为遏制这一流行病蔓延的主要战略。因此,确定违反这些协议的情况对遏制疾病传播和促进可持续生活方式具有重要意义。本文提出了一种新的基于计算机视觉的系统分析CCTV镜头,以提供威胁等级评估COVID-19蔓延。该系统致力于全面捕获和解释跨越多个帧的CCTV视频的信息内容,以识别跨越时间和空间的各种违反社交距离协议的情况,以及识别群体行为。这一功能主要是通过利用基于时间图的结构来表示CCTV画面的信息,并利用一种策略来全面解释该图并量化给定场景的威胁级别来实现的。在一系列场景中对各个组件进行测试和验证,并根据专家意见对整个系统进行测试。结果反映了威胁程度对人、人与人之间的距离、相互作用、防护服和群体动态的依赖性。系统性能的准确率为76%,因此能够在城市中部署威胁监测系统,以实现社会的正常和可持续性。 摘要:The COVID-19 pandemic has caused an unprecedented global public health crisis. Given its inherent nature, social distancing measures are proposed as the primary strategies to curb the spread of this pandemic. Therefore, identifying situations where these protocols are violated, has implications for curtailing the spread of the disease and promoting a sustainable lifestyle. This paper proposes a novel computer vision-based system to analyze CCTV footage to provide a threat level assessment of COVID-19 spread. The system strives to holistically capture and interpret the information content of CCTV footage spanning multiple frames to recognize instances of various violations of social distancing protocols, across time and space, as well as identification of group behaviors. This functionality is achieved primarily by utilizing a temporal graph-based structure to represent the information of the CCTV footage and a strategy to holistically interpret the graph and quantify the threat level of the given scene. The individual components are tested and validated on a range of scenarios and the complete system is tested against human expert opinion. The results reflect the dependence of the threat level on people, their physical proximity, interactions, protective clothing, and group dynamics. The system performance has an accuracy of 76%, thus enabling a deployable threat monitoring system in cities, to permit normalcy and sustainability in the society.

【7】 Hybrid Atlas Building with Deep Registration Priors 标题:具有深度配准先验的混合地图集构建 链接:https://arxiv.org/abs/2112.06406

作者:Nian Wu,Jian Wang,Miaomiao Zhang,Guixu Zhang,Yaxin Peng,Chaomin Shen 机构:⋆Department of Computer Science and Technology, East China Normal University, China, †Department of Computer Science, University of Virginia, VA, USA 摘要:在高维图像空间中,基于配准的地图集构建通常会带来计算上的挑战。在本文中,我们介绍了一种新的混合地图集构建算法,该算法可以从大规模图像数据集中快速估计地图集,并且大大降低了计算成本。与以前在估计的图谱和单个图像之间迭代执行配准任务的方法不同,我们建议使用从预先训练的神经网络获得的配准先验知识。这种新开发的混合框架具有以下几个优点:(i)提供了一种高效的atlas构建方法,而不会损失结果的质量;(ii)在使用各种基于深度学习的注册方法方面提供了灵活性。我们在3D脑磁共振成像(MRI)扫描上证明了该模型的有效性。 摘要:Registration-based atlas building often poses computational challenges in high-dimensional image spaces. In this paper, we introduce a novel hybrid atlas building algorithm that fast estimates atlas from large-scale image datasets with much reduced computational cost. In contrast to previous approaches that iteratively perform registration tasks between an estimated atlas and individual images, we propose to use learned priors of registration from pre-trained neural networks. This newly developed hybrid framework features several advantages of (i) providing an efficient way of atlas building without losing the quality of results, and (ii) offering flexibility in utilizing a wide variety of deep learning based registration methods. We demonstrate the effectiveness of this proposed model on 3D brain magnetic resonance imaging (MRI) scans.

【8】 Deep Attentional Guided Image Filtering 标题:深度注意引导图像滤波 链接:https://arxiv.org/abs/2112.06401

作者:Zhiwei Zhong,Xianming Liu,Junjun Jiang,Debin Zhao,Xiangyang Ji 摘要:引导滤波器是计算机视觉和计算机图形学中的一种基本工具,其目的是将结构信息从引导图像传输到目标图像。现有的方法大多从制导本身构造滤波核,而没有考虑制导与目标之间的相互依赖性。然而,由于在两幅图像中通常存在显著不同的边缘,简单地将制导的所有结构信息传输到目标将导致各种伪影。针对这一问题,我们提出了一种有效的深度注意引导图像过滤框架,该框架的过滤过程可以充分整合两幅图像中包含的互补信息。具体来说,我们提出了一个注意核学习模块,分别从引导和目标生成两组过滤核,然后通过建模两幅图像之间的像素相关性自适应地组合它们。同时,我们提出了一个多尺度引导图像滤波模块,以从粗到精的方式逐步生成具有所构造核的滤波结果。相应地,引入了多尺度融合策略,以重用粗到精过程中的中间结果。大量实验表明,该框架在引导超分辨率、跨模态恢复、纹理去除和语义分割等多种引导图像滤波应用中均优于现有的方法。 摘要:Guided filter is a fundamental tool in computer vision and computer graphics which aims to transfer structure information from guidance image to target image. Most existing methods construct filter kernels from the guidance itself without considering the mutual dependency between the guidance and the target. However, since there typically exist significantly different edges in the two images, simply transferring all structural information of the guidance to the target would result in various artifacts. To cope with this problem, we propose an effective framework named deep attentional guided image filtering, the filtering process of which can fully integrate the complementary information contained in both images. Specifically, we propose an attentional kernel learning module to generate dual sets of filter kernels from the guidance and the target, respectively, and then adaptively combine them by modeling the pixel-wise dependency between the two images. Meanwhile, we propose a multi-scale guided image filtering module to progressively generate the filtering result with the constructed kernels in a coarse-to-fine manner. Correspondingly, a multi-scale fusion strategy is introduced to reuse the intermediate results in the coarse-to-fine process. Extensive experiments show that the proposed framework compares favorably with the state-of-the-art methods in a wide range of guided image filtering applications, such as guided super-resolution, cross-modality restoration, texture removal, and semantic segmentation.

【9】 5th Place Solution for VSPW 2021 Challenge 标题:VSPW 2021挑战赛第5名解决方案 链接:https://arxiv.org/abs/2112.06379

作者:Jiafan Zhuang,Yixin Zhang,Xinyu Hu,Junjie Li,Zilei Wang 机构:University of Science and Technology of China 备注:Presented in ICCV'21 Workshop 摘要:在这篇文章中,我们介绍了我们在VSPW 2021挑战中使用的解决方案。我们的实验是基于两个基线模型,即SwinTransformer和MaskFormer。为了进一步提高性能,我们采用了随机加权平均技术并设计了分层集成策略。在不使用任何外部语义分段数据集的情况下,我们的解决方案在私人排行榜上排名第五。此外,我们还有一些有趣的尝试来解决长尾识别和过度拟合问题,从而实现了val子集的改进。可能是由于分布差异,这些尝试在测试子集上不起作用。我们还将介绍这些尝试,并希望能启发其他研究人员。 摘要:In this article, we introduce the solution we used in the VSPW 2021 Challenge. Our experiments are based on two baseline models, Swin Transformer and MaskFormer. To further boost performance, we adopt stochastic weight averaging technique and design hierarchical ensemble strategy. Without using any external semantic segmentation dataset, our solution ranked the 5th place in the private leaderboard. Besides, we have some interesting attempts to tackle long-tail recognition and overfitting issues, which achieves improvement on val subset. Maybe due to distribution difference, these attempts don't work on test subset. We will also introduce these attempts and hope to inspire other researchers.

【10】 360-DFPE: Leveraging Monocular 360-Layouts for Direct Floor Plan Estimation 标题:360-DFPE:利用单目360布局进行直接平面图估计 链接:https://arxiv.org/abs/2112.06180

作者:Bolivar Solarte,Yueh-Cheng Liu,Chin-Hsuan Wu,Yi-Hsuan Tsai,Min Sun 机构: 1 National Tsing Hua University 2 Phiar Technologies 摘要:我们提出了360-DFPE,一种顺序楼层平面估计方法,该方法直接将360幅图像作为输入,而不依赖于活动传感器或3D信息。我们的方法利用了单目视觉SLAM解决方案和单目360房间布局方法之间的松散耦合集成,这两种方法分别估计摄像机姿势和布局几何。由于我们的任务是使用单目图像顺序捕获楼层平面,因此整个场景结构、房间实例和房间形状都是未知的。为了应对这些挑战,我们首先通过制定熵最小化过程来处理视觉里程计和布局几何之间的比例差异,这使我们能够直接对齐360个布局,而无需事先了解整个场景。其次,为了顺序识别各个房间,我们提出了一种新的房间识别算法,该算法使用几何信息跟踪相机探测过程中的每个房间。最后,为了估计房间的最终形状,我们提出了一种具有从粗到精迭代策略的最短路径算法,该算法改进了先前的公式,具有更高的精度和更快的运行时间。此外,我们收集了具有挑战性的大规模场景的新楼层平面数据集,提供点云和连续360图像信息。实验结果表明,我们的单目解决方案相对于当前最先进的算法具有良好的性能,这些算法依赖于主动传感器,并且需要提前获得整个场景的重建数据。我们的代码和数据集将很快发布。 摘要:We present 360-DFPE, a sequential floor plan estimation method that directly takes 360-images as input without relying on active sensors or 3D information. Our approach leverages a loosely coupled integration between a monocular visual SLAM solution and a monocular 360-room layout approach, which estimate camera poses and layout geometries, respectively. Since our task is to sequentially capture the floor plan using monocular images, the entire scene structure, room instances, and room shapes are unknown. To tackle these challenges, we first handle the scale difference between visual odometry and layout geometry via formulating an entropy minimization process, which enables us to directly align 360-layouts without knowing the entire scene in advance. Second, to sequentially identify individual rooms, we propose a novel room identification algorithm that tracks every room along the camera exploration using geometry information. Lastly, to estimate the final shape of the room, we propose a shortest path algorithm with an iterative coarse-to-fine strategy, which improves prior formulations with higher accuracy and faster run-time. Moreover, we collect a new floor plan dataset with challenging large-scale scenes, providing both point clouds and sequential 360-image information. Experimental results show that our monocular solution achieves favorable performance against the current state-of-the-art algorithms that rely on active sensors and require the entire scene reconstruction data in advance. Our code and dataset will be released soon.

【11】 Pixel-wise Deep Image Stitching 标题:基于像素的深度图像拼接 链接:https://arxiv.org/abs/2112.06171

作者:Hyeokjun Kweon,Hyeonseong Kim,Yoonsu Kang,Youngho Yoon,Wooseong Jeong,Kuk-Jin Yoon 机构:Visual Intelligence Lab., KAIST, Korea 摘要:图像拼接的目的是将从不同视角拍摄的图像拼接成视野更宽的图像。现有的方法使用估计的扭曲函数将目标图像扭曲到参考图像,单应性是最常用的扭曲函数之一。然而,当图像由于非平面场景和相机的平移运动而具有大视差时,单应性不能完全描述两个图像之间的映射。现有的基于全局或局部单应估计的方法不能摆脱这个问题,并且会由于视差而产生不希望出现的伪影。在本文中,我们提出了一种新的深度图像拼接框架,利用像素级的扭曲场来处理大视差问题,而不是依赖于基于单应性的扭曲。提出的深度图像拼接框架由两个模块组成:像素级扭曲模块(PWM)和缝合图像生成模块(SIGMo)。PWM采用光流估计模型获得整个图像的像素方向扭曲,并利用获得的扭曲场重新定位目标图像的像素。SIGMo混合扭曲的目标图像和参考图像,同时消除不必要的瑕疵,如不对齐、接缝和孔洞,这些瑕疵会损害缝合结果的合理性。为了训练和评估所提出的框架,我们构建了一个大规模的数据集,其中包括具有相应像素级地面真值扭曲的图像对和样本缝合结果图像。我们表明,该框架的结果在质量上优于传统方法,特别是当图像具有大视差时。代码和提议的数据集将很快公开。 摘要:Image stitching aims at stitching the images taken from different viewpoints into an image with a wider field of view. Existing methods warp the target image to the reference image using the estimated warp function, and a homography is one of the most commonly used warping functions. However, when images have large parallax due to non-planar scenes and translational motion of a camera, the homography cannot fully describe the mapping between two images. Existing approaches based on global or local homography estimation are not free from this problem and suffer from undesired artifacts due to parallax. In this paper, instead of relying on the homography-based warp, we propose a novel deep image stitching framework exploiting the pixel-wise warp field to handle the large-parallax problem. The proposed deep image stitching framework consists of two modules: Pixel-wise Warping Module (PWM) and Stitched Image Generating Module (SIGMo). PWM employs an optical flow estimation model to obtain pixel-wise warp of the whole image, and relocates the pixels of the target image with the obtained warp field. SIGMo blends the warped target image and the reference image while eliminating unwanted artifacts such as misalignments, seams, and holes that harm the plausibility of the stitched result. For training and evaluating the proposed framework, we build a large-scale dataset that includes image pairs with corresponding pixel-wise ground truth warp and sample stitched result images. We show that the results of the proposed framework are qualitatively superior to those of the conventional methods, especially when the images have large parallax. The code and the proposed dataset will be publicly available soon.

【12】 Deep Translation Prior: Test-time Training for Photorealistic Style Transfer 标题:深度翻译前:真实感风格迁移的测试时间训练 链接:https://arxiv.org/abs/2112.06150

作者:Sunwoo Kim,Soohyun Kim,Seungryong Kim 机构:Korea University, Seoul, Korea 备注:Accepted to AAAI 2022, Camera-ready version. The code will be made available at this https URL 摘要:在深度卷积神经网络(CNN)中解决照片级真实感风格转换的最新技术通常需要大规模数据集的密集训练,因此对看不见的图像或风格的适用性和泛化能力有限。为了克服这一问题,我们提出了一种称为深度翻译先验(DTP)的新框架,通过对给定输入图像对进行测试时训练,利用未经训练的网络实现照片级真实感风格的转换,该框架学习图像对特定的翻译先验知识,从而获得更好的性能和泛化能力。针对这种风格转换的测试时间训练,我们提出了新的网络架构,包括通信和生成模块两个子模块,以及由对比内容、风格和周期一致性损失组成的损失函数。我们的框架不需要离线训练阶段来进行风格转换,这一直是现有方法的主要挑战之一,但网络只需要在测试期间学习。实验结果表明,该框架对不可见图像对具有更好的泛化能力,甚至优于现有的方法。 摘要:Recent techniques to solve photorealistic style transfer within deep convolutional neural networks (CNNs) generally require intensive training from large-scale datasets, thus having limited applicability and poor generalization ability to unseen images or styles. To overcome this, we propose a novel framework, dubbed Deep Translation Prior (DTP), to accomplish photorealistic style transfer through test-time training on given input image pair with untrained networks, which learns an image pair-specific translation prior and thus yields better performance and generalization. Tailored for such test-time training for style transfer, we present novel network architectures, with two sub-modules of correspondence and generation modules, and loss functions consisting of contrastive content, style, and cycle consistency losses. Our framework does not require offline training phase for style transfer, which has been one of the main challenges in existing methods, but the networks are to be solely learned during test-time. Experimental results prove that our framework has a better generalization ability to unseen image pairs and even outperforms the state-of-the-art methods.

【13】 Sidewalk Measurements from Satellite Images: Preliminary Findings 标题:来自卫星图像的人行道测量:初步发现 链接:https://arxiv.org/abs/2112.06120

作者:Maryam Hosseini,Iago B. Araujo,Hamed Yazdanpanah,Eric K. Tokuda,Fabio Miranda,Claudio T. Silva,Roberto M. Cesar Jr 机构: New York University (NYU), New York, NY, United States, University of S˜ao Paulo (USP), S˜ao Paulo, SP, Brazil, University of Illinois at Chicago (UIC), Chicago, IL, United States, Rutgers University, Newark, NJ, United States 备注:None 摘要:对行人基础设施,特别是人行道的大规模分析对于以人为中心的城市规划和设计至关重要。得益于纽约市开放数据门户网站提供的丰富的平面特征数据集和高分辨率正射影像,我们训练了一个计算机视觉模型,以从遥感影像中检测人行道、道路和建筑物,并实现了83%的mIoU超出测试集。我们应用形状分析技术来研究提取的人行道的不同属性。更具体地说,我们对人行道的宽度、角度和曲率进行了分块分析,除了它们对城市区域的可步行性和可达性的一般影响外,众所周知,这些因素对轮椅使用者的移动性具有重要作用。初步结果很有希望,它使研究人员和实践者能够对步行领域有一个更生动的了解,从而瞥见了在不同城市采用拟议方法的潜力。 摘要:Large-scale analysis of pedestrian infrastructures, particularly sidewalks, is critical to human-centric urban planning and design. Benefiting from the rich data set of planimetric features and high-resolution orthoimages provided through the New York City Open Data portal, we train a computer vision model to detect sidewalks, roads, and buildings from remote-sensing imagery and achieve 83% mIoU over held-out test set. We apply shape analysis techniques to study different attributes of the extracted sidewalks. More specifically, we do a tile-wise analysis of the width, angle, and curvature of sidewalks, which aside from their general impacts on walkability and accessibility of urban areas, are known to have significant roles in the mobility of wheelchair users. The preliminary results are promising, glimpsing the potential of the proposed approach to be adopted in different cities, enabling researchers and practitioners to have a more vivid picture of the pedestrian realm.

【14】 Stereoscopic Universal Perturbations across Different Architectures and Datasets 标题:跨不同体系结构和数据集的立体通用扰动 链接:https://arxiv.org/abs/2112.06116

作者:Zachary Berger,Parth Agrawal,Tian Yu Liu,Stefano Soatto,Alex Wong 机构:UCLA Vision Lab 摘要:我们研究了在视差估计任务中,图像对抗性扰动对深度立体匹配网络的影响。我们提出了一种方法来制作一组扰动,当添加到数据集中的任何立体图像对时,这些扰动可以欺骗立体网络显著改变感知的场景几何体。我们的扰动图像是“通用”的,因为它们不仅破坏了对优化数据集上网络的估计,而且还推广到不同数据集上具有不同体系结构的立体网络。我们在多个公共基准数据集上评估了我们的方法,结果表明,我们的扰动会将最先进的立体声网络的D1错误(类似于愚弄率)从1%增加到87%。我们研究扰动对估计场景几何体的影响,并确定最易受攻击的对象类。我们对左右图像之间的注册点激活的分析使我们发现,某些体系结构组件,即可变形卷积和显式匹配,可以提高对抗对手的鲁棒性。我们证明,通过简单地设计具有此类组件的网络,可以将对手的影响降低高达60.5%,这与通过昂贵的对手数据增强进行微调的网络的健壮性相媲美。 摘要:We study the effect of adversarial perturbations of images on deep stereo matching networks for the disparity estimation task. We present a method to craft a single set of perturbations that, when added to any stereo image pair in a dataset, can fool a stereo network to significantly alter the perceived scene geometry. Our perturbation images are "universal" in that they not only corrupt estimates of the network on the dataset they are optimized for, but also generalize to stereo networks with different architectures across different datasets. We evaluate our approach on multiple public benchmark datasets and show that our perturbations can increase D1-error (akin to fooling rate) of state-of-the-art stereo networks from 1% to as much as 87%. We investigate the effect of perturbations on the estimated scene geometry and identify object classes that are most vulnerable. Our analysis on the activations of registered points between left and right images led us to find that certain architectural components, i.e. deformable convolution and explicit matching, can increase robustness against adversaries. We demonstrate that by simply designing networks with such components, one can reduce the effect of adversaries by up to 60.5%, which rivals the robustness of networks fine-tuned with costly adversarial data augmentation.

【15】 Early Stopping for Deep Image Prior 标题:深度图像先验的提前停止算法 链接:https://arxiv.org/abs/2112.06074

作者:Hengkang Wang,Taihui Li,Zhong Zhuang,Tiancong Chen,Hengyue Liang,Ju Sun 机构:Department of Computer Science and Engineering, Department of Electrical and Computer Engineering, University of Minnesota, Twin Cities 摘要:深度图像先验(DIP)及其变体在解决计算机视觉中的反问题方面显示出巨大的潜力,无需任何额外的训练数据。实际倾角模型通常被过度参数化。在拟合过程中,这些模型首先主要学习所需的视觉内容,然后提取潜在的建模和观测噪声,即过度拟合。因此,倾角的实用性通常关键取决于捕捉过渡期的良好早期停止(ES)。就这一点而言,DIP针对愿景任务的大部分工作仅展示了模型的潜力——根据地面真相报告峰值性能,但没有提供如何在不接触地面真相的情况下获得接近峰值性能的线索。在本文中,我们试图打破DIP的这一实用障碍,并提出了一种有效的ES策略,该策略能够在多个视觉任务和DIP变体中持续检测接近峰值的性能。基于对连续倾角重建离散度的简单测量,我们的ES方法不仅超过了现有的方法——仅适用于非常狭窄的区域,而且与许多试图缓解过度拟合的方法相结合时仍然有效。该守则可于https://github.com/sun-umn/Early_Stopping_for_DIP. 摘要:Deep image prior (DIP) and its variants have showed remarkable potential for solving inverse problems in computer vision, without any extra training data. Practical DIP models are often substantially overparameterized. During the fitting process, these models learn mostly the desired visual content first, and then pick up the potential modeling and observational noise, i.e., overfitting. Thus, the practicality of DIP often depends critically on good early stopping (ES) that captures the transition period. In this regard, the majority of DIP works for vision tasks only demonstrates the potential of the models -- reporting the peak performance against the ground truth, but provides no clue about how to operationally obtain near-peak performance without access to the groundtruth. In this paper, we set to break this practicality barrier of DIP, and propose an efficient ES strategy, which consistently detects near-peak performance across several vision tasks and DIP variants. Based on a simple measure of dispersion of consecutive DIP reconstructions, our ES method not only outpaces the existing ones -- which only work in very narrow domains, but also remains effective when combined with a number of methods that try to mitigate the overfitting. The code is available at https://github.com/sun-umn/Early_Stopping_for_DIP.

【16】 Object Counting: You Only Need to Look at One 标题:对象计数:您只需查看一个 链接:https://arxiv.org/abs/2112.05993

作者:Hui Lin,Xiaopeng Hong,Yabin Wang 机构:School of Cyber Science and Engineering, Xi’an Jiaotong University, China 备注:Keywords: Crowd counting, one-shot object counting, Attention 摘要:本文旨在解决一次性物体计数这一具有挑战性的任务。给定包含新颖的、以前未看到的类别对象的图像,任务的目标是仅使用一个支持边界框示例来计算所需类别中的所有实例。为此,我们提出了一个计数模型,您只需查看一个实例(LaoNet)。首先,特征关联模块将自我注意和相关注意模块结合起来,学习内在关系和相互关系。它使网络对不同实例之间的旋转和大小不一致具有鲁棒性。其次,设计了一种尺度聚合机制,以帮助提取具有不同尺度信息的特征。与现有的几种镜头计数方法相比,LaoNet在学习的同时取得了先进的结果,收敛速度很快。代码很快就会发布。 摘要:This paper aims to tackle the challenging task of one-shot object counting. Given an image containing novel, previously unseen category objects, the goal of the task is to count all instances in the desired category with only one supporting bounding box example. To this end, we propose a counting model by which you only need to Look At One instance (LaoNet). First, a feature correlation module combines the Self-Attention and Correlative-Attention modules to learn both inner-relations and inter-relations. It enables the network to be robust to the inconsistency of rotations and sizes among different instances. Second, a Scale Aggregation mechanism is designed to help extract features with different scale information. Compared with existing few-shot counting methods, LaoNet achieves state-of-the-art results while learning with a high convergence speed. The code will be available soon.

【17】 Overview of The MediaEval 2021 Predicting Media Memorability Task 标题:中世纪2021年预测媒体记忆性任务概述 链接:https://arxiv.org/abs/2112.05982

作者:Rukiye Savran Kiziltepe,Mihai Gabriel Constantin,Claire-Helene Demarty,Graham Healy,Camilo Fosco,Alba Garcia Seco de Herrera,Sebastian Halder,Bogdan Ionescu,Ana Matran-Fernandez,Alan F. Smeaton,Lorin Sweeney 机构:University of Essex, UK, University Politehnica of Bucharest, Romania, InterDigital, France, Dublin City University, Ireland, Massachusetts Institute of Technology Cambridge, Massachusetts, USA. 备注:3 pages, to appear in Proceedings of MediaEval 2021, December 13-15 2021, Online 摘要:本文描述了中世纪2021预测媒体记忆性任务,这是在其第四版今年,作为预测短期和长期的视频记忆仍然是一项具有挑战性的任务。在2021,使用两个视频数据集:第一,TCREVID 2019视频到文本数据集的子集;其次是Memento10K数据集,以提供探索跨数据集概括的机会。此外,还介绍了一种基于脑电图(EEG)的预测导频子任务。在本文中,我们概述了任务的主要方面,并描述了数据集、评估指标和参与者提交的要求。 摘要:This paper describes the MediaEval 2021 Predicting Media Memorability}task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset generalisation. In addition, an Electroencephalography (EEG)-based prediction pilot subtask is introduced. In this paper, we outline the main aspects of the task and describe the datasets, evaluation metrics, and requirements for participants' submissions.

【18】 SLOSH: Set LOcality Sensitive Hashing via Sliced-Wasserstein Embeddings 标题:Slosh:通过分片的Wasserstein嵌入设置位置敏感散列 链接:https://arxiv.org/abs/2112.05872

作者:Yuzhe Lu,Xinran Liu,Andrea Soltoggio,Soheil Kolouri 机构:Computer Science Department, Vanderbilt University, Nashville, TN, School of Computer Science, Loughborough University, Leicestershire, UK 摘要:在机器学习和计算机视觉的许多应用中,从集合结构数据学习是一个基本问题。本文主要研究使用近似最近邻(ANN)解决方案从集合结构数据中进行非参数和数据独立的学习,特别是局部敏感哈希。我们考虑从输入集查询集检索的问题。这种检索问题需要:1)一种有效的机制来计算集合之间的距离/差异;2)一种适当的数据结构,用于快速最近邻搜索。为此,我们提出切片Wasserstein集嵌入作为一种计算效率高的“集2向量”机制,在理论上保证了下游ANN的可用性。集合元素被视为来自未知底层分布的样本,切片的Wasserstein距离用于比较集合。我们在不同的集合检索数据集上演示了我们的算法(表示为集合局部敏感哈希(SLOSH))的有效性,并将我们提出的嵌入与标准集合嵌入方法进行了比较,包括广义平均(GeM)嵌入/池、特征排序池(FSPool),和协方差池,并显示检索结果的一致性改进。复制结果的代码可在此处获得:\href{https://github.com/mint-vu/SLOSH}{https://github.com/mint-vu/SLOSH}. 摘要:Learning from set-structured data is an essential problem with many applications in machine learning and computer vision. This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions, particularly locality-sensitive hashing. We consider the problem of set retrieval from an input set query. Such retrieval problem requires: 1) an efficient mechanism to calculate the distances/dissimilarities between sets, and 2) an appropriate data structure for fast nearest neighbor search. To that end, we propose Sliced-Wasserstein set embedding as a computationally efficient "set-2-vector" mechanism that enables downstream ANN, with theoretical guarantees. The set elements are treated as samples from an unknown underlying distribution, and the Sliced-Wasserstein distance is used to compare sets. We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets and compare our proposed embedding with standard set embedding approaches, including Generalized Mean (GeM) embedding/pooling, Featurewise Sort Pooling (FSPool), and Covariance Pooling and show consistent improvement in retrieval results. The code for replicating our results is available here: \href{https://github.com/mint-vu/SLOSH}{https://github.com/mint-vu/SLOSH}.

【19】 Deep ViT Features as Dense Visual Descriptors 标题:作为密集视觉描述符的深层VIT特征 链接:https://arxiv.org/abs/2112.05814

作者:Shir Amir,Yossi Gandelsman,Shai Bagon,Tali Dekel 机构:⋆Weizmann AI Center (WAIC), Dept. of Computer Science and Applied Math, The Weizmann Inst. of Science, †Berkeley Artificial Intelligence Research (BAIR) 摘要:我们利用从预先训练的视觉转换器(ViT)中提取的深度特征作为密集的视觉描述符。我们证明,当从自监督ViT模型(DINO ViT)中提取这些特征时,它们表现出几个显著的特性:(i)这些特征以高空间分辨率编码强大的高层信息,即以精细的空间粒度捕获语义对象部分;(ii)编码的语义信息在相关领域共享,但不同的对象类别(即超级类别)。这些特性使我们能够设计功能强大的密集ViT描述符,以促进各种应用,包括共分段、零件共分段和对应关系——所有这些都是通过将轻量级方法应用于深度ViT特征(例如,装箱/聚类)来实现的。我们将这些应用程序进一步扩展到类间任务领域——演示如何在姿势和外观发生重大变化的情况下,将相关类别中的对象通常分割为语义部分。我们的方法经过广泛的定性和定量评估,获得了最先进的零件共分割结果,并与最近专门为共分割和对应训练的监督方法取得了竞争结果。 摘要:We leverage deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties: (i) the features encode powerful high level information at high spatial resolution -- i.e., capture semantic object parts at fine spatial granularity, and (ii) the encoded semantic information is shared across related, yet different object categories (i.e. super-categories). These properties allow us to design powerful dense ViT descriptors that facilitate a variety of applications, including co-segmentation, part co-segmentation and correspondences -- all achieved by applying lightweight methodologies to deep ViT features (e.g., binning / clustering). We take these applications further to the realm of inter-class tasks -- demonstrating how objects from related categories can be commonly segmented into semantic parts, under significant pose and appearance changes. Our methods, extensively evaluated qualitatively and quantitatively, achieve state-of-the-art part co-segmentation results, and competitive results with recent supervised methods trained specifically for co-segmentation and correspondences.

【20】 DPICT: Deep Progressive Image Compression Using Trit-Planes 标题:DPICT:基于Trit平面的深度渐进图像压缩 链接:https://arxiv.org/abs/2112.06334

作者:Jae-Han Lee,Seungmin Jeon,Kwang Pyo Choi,Youngo Park,Chang-Su Kim 机构:Korea University, Samsung Electronics 备注:10 pages, 15 figures 摘要:我们提出了基于trit平面的深度渐进图像压缩(DPICT)算法,这是第一个支持细粒度可伸缩性(FGS)的基于学习的编解码器。首先,我们使用分析网络将图像转换为潜在张量。然后,我们将潜在张量表示为三值数字(trit),并按重要性的降序将其编码为压缩比特流trit平面。此外,在每个trit平面内,我们根据其率失真优先级对trit进行分类,并首先传输更重要的信息。由于在使用较少的trit平面的情况下,压缩网络的优化程度较低,因此我们开发了一个用于以低速率细化重建图像的后处理网络。实验结果表明,DPICT在支持FGS传输的同时,显著优于传统的渐进式编解码器。 摘要:We propose the deep progressive image compression using trit-planes (DPICT) algorithm, which is the first learning-based codec supporting fine granular scalability (FGS). First, we transform an image into a latent tensor using an analysis network. Then, we represent the latent tensor in ternary digits (trits) and encode it into a compressed bitstream trit-plane by trit-plane in the decreasing order of significance. Moreover, within each trit-plane, we sort the trits according to their rate-distortion priorities and transmit more important information first. Since the compression network is less optimized for the cases of using fewer trit-planes, we develop a postprocessing network for refining reconstructed images at low rates. Experimental results show that DPICT outperforms conventional progressive codecs significantly, while enabling FGS transmission.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-12-14,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
图像处理
图像处理基于腾讯云深度学习等人工智能技术,提供综合性的图像优化处理服务,包括图像质量评估、图像清晰度增强、图像智能裁剪等。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档