计算机视觉与模式识别学术速递[12.9]

公众号-arXiv每日学术速递

发布于 2021-12-09 20:26:58

1.4K0

cs.CV 方向，今日共计61篇

Transformer(2篇)

【1】 Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval 标题：一举多得--用于视频检索的多模式融合转换器链接：https://arxiv.org/abs/2112.04446

作者：Nina Shvetsova,Brian Chen,Andrew Rouditchenko,Samuel Thomas,Brian Kingsbury,Rogerio Feris,David Harwath,James Glass,Hilde Kuehne 摘要：最近，视频数据的多模式学习受到了越来越多的关注，因为它可以在不需要人工注释的情况下训练语义上有意义的嵌入，从而实现Zero-Shot检索和分类等任务。在这项工作中，我们提出了一种多模态、模态不可知的融合变换方法，该方法学习在多模态（如视频、音频和文本）之间交换信息，并将它们集成到联合的多模态表示中，以获得聚合多模态时间信息的嵌入。我们建议在训练系统时，同时对所有东西进行组合损失，包括单个模态和成对模态，明确排除任何附加项，如位置或模态编码。在测试时，生成的模型可以处理和融合任意数量的输入模式。此外，Transformer的隐式特性允许处理不同长度的输入。为了评估所提出的方法，我们在大规模HowTo100M数据集上训练模型，并在四个具有挑战性的基准数据集上评估生成的嵌入空间，从而获得Zero-Shot视频检索和Zero-Shot视频动作定位的最新结果。摘要：Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

【2】 Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents 标题：基于变换的历史文献联合手写和命名实体识别方法链接：https://arxiv.org/abs/2112.04189

作者：Ahmed Cheikh Rouhoua,Marwa Dhiaf,Yousri Kessentini,Sinda Ben Salem 备注：None 摘要：手写文档中命名实体的相关信息提取仍然是一项具有挑战性的任务。与传统的信息提取方法不同，传统的信息提取方法通常将文本转录和命名实体识别作为单独的后续任务，本文提出了一种基于端到端转换器的方法来联合执行这两项任务。拟议办法在段落一级运作，这带来两个主要好处。首先，它允许模型避免由于线分割而导致的不可恢复的早期错误。其次，它允许模型利用更大的二维上下文信息来识别语义类别，从而达到更高的最终预测精度。我们还探索了不同的训练场景来展示它们对性能的影响，并且我们证明了两阶段学习策略可以使模型达到更高的最终预测精度。据我们所知，这项工作提出了第一种方法，采用Transformer网络的命名实体识别手写文件。我们使用ESPOSALES数据库在2017年ICDAR信息提取竞赛中为完成任务取得了最新的技术水平，尽管所提出的技术不使用任何词典、语言建模或后处理。摘要：The extraction of relevant information carried out by named entities in handwriting documents is still a challenging task. Unlike traditional information extraction approaches that usually face text transcription and named entity recognition as separate subsequent tasks, we propose in this paper an end-to-end transformer-based approach to jointly perform these two tasks. The proposed approach operates at the paragraph level, which brings two main benefits. First, it allows the model to avoid unrecoverable early errors due to line segmentation. Second, it allows the model to exploit larger bi-dimensional context information to identify the semantic categories, reaching a higher final prediction accuracy. We also explore different training scenarios to show their effect on the performance and we demonstrate that a two-stage learning strategy can make the model reach a higher final prediction accuracy. As far as we know, this work presents the first approach that adopts the transformer networks for named entity recognition in handwritten documents. We achieve the new state-of-the-art performance in the ICDAR 2017 Information Extraction competition using the Esposalles database, for the complete task, even though the proposed technique does not use any dictionaries, language modeling, or post-processing.

检测相关(8篇)

【1】 Contrastive Learning with Large Memory Bank and Negative Embedding Subtraction for Accurate Copy Detection 标题：基于大记忆库和负嵌入减法的精确拷贝检测对比学习链接：https://arxiv.org/abs/2112.04323

作者：Shuhei Yokoo 摘要：复制检测是一个尚未解决的问题，它是一项确定图像是否是数据库中任何图像的修改副本的任务。因此，我们通过对比学习训练卷积神经网络（CNN）来解决拷贝检测问题。使用大型内存库和硬数据扩充进行训练使CNN能够获得更具辨别力的表示。我们提出的负嵌入减法进一步提高了拷贝检测的准确性。使用我们的方法，我们在Facebook AI图像相似性挑战：描述符跟踪中获得了第一名。我们的代码在此处公开：\url{https://github.com/lyakaap/ISC21-Descriptor-Track-1st} 摘要：Copy detection, which is a task to determine whether an image is a modified copy of any image in a database, is an unsolved problem. Thus, we addressed copy detection by training convolutional neural networks (CNNs) with contrastive learning. Training with a large memory-bank and hard data augmentation enables the CNNs to obtain more discriminative representation. Our proposed negative embedding subtraction further boosts the copy detection accuracy. Using our methods, we achieved 1st place in the Facebook AI Image Similarity Challenge: Descriptor Track. Our code is publicly available here: \url{https://github.com/lyakaap/ISC21-Descriptor-Track-1st}

【2】 GCA-Net : Utilizing Gated Context Attention for Improving Image Forgery Localization and Detection 标题：GCA-Net：利用门限上下文关注度改进图像伪造定位与检测链接：https://arxiv.org/abs/2112.04298

作者：Sowmen Das,Md. Saiful Islam,Md. Ruhul Amin 摘要：法医分析依赖于从操纵图像中识别隐藏的痕迹。传统的神经网络在这项任务中失败，因为它们无法处理特征衰减和对主要空间特征的依赖。在这项工作中，我们提出了一种新的门控上下文注意网络（GCA-Net），它利用非局部注意块进行全局上下文学习。此外，我们利用门控注意机制和密集的解码器网络，在解码阶段引导相关特征流，从而实现精确定位。所提出的注意框架允许网络通过过滤粗略特征来关注相关区域。此外，通过利用多尺度特征融合和有效的学习策略，GCA网络可以更好地处理操纵区域的尺度变化。我们表明，我们的方法在多个基准数据集上的平均AUC为4.2%-5.4%，优于最先进的网络。最后，我们还进行了大量的烧蚀实验，以证明该方法对图像取证的鲁棒性。摘要：Forensic analysis depends on the identification of hidden traces from manipulated images. Traditional neural networks fail in this task because of their inability in handling feature attenuation and reliance on the dominant spatial features. In this work we propose a novel Gated Context Attention Network (GCA-Net) that utilizes the non-local attention block for global context learning. Additionally, we utilize a gated attention mechanism in conjunction with a dense decoder network to direct the flow of relevant features during the decoding phase, allowing for precise localization. The proposed attention framework allows the network to focus on relevant regions by filtering the coarse features. Furthermore, by utilizing multi-scale feature fusion and efficient learning strategies, GCA-Net can better handle the scale variation of manipulated regions. We show that our method outperforms state-of-the-art networks by an average of 4.2%-5.4% AUC on multiple benchmark datasets. Lastly, we also conduct extensive ablation experiments to demonstrate the method's robustness for image forensics.

【3】 A Hierarchical Spatio-Temporal Graph Convolutional Neural Network for Anomaly Detection in Videos 标题：一种用于视频异常检测的分层时空图卷积神经网络链接：https://arxiv.org/abs/2112.04294

作者：Xianlin Zeng,Yalong Jiang,Wenrui Ding,Hongguang Li,Yafeng Hao,Zifeng Qiu 摘要：深度学习模型已广泛应用于监控视频中的异常检测。典型模型具有重建正常视频和评估异常视频重建错误的能力，以指示异常的程度。然而，现有的方法有两个缺点。首先，它们只能独立地编码每个身份的运动，而不考虑身份之间的相互作用，这也可能表示异常。其次，它们利用了结构在不同场景下固定的不灵活模型，这种配置禁用了对场景的理解。在本文中，我们提出了一种层次时空图卷积神经网络（HSTGCNN）来解决这些问题，HSTGCNN由多个分支组成，这些分支对应于不同级别的图表示。高级图形表示编码人的轨迹和多个身份之间的交互，而低级图形表示编码每个人的局部身体姿势。此外，我们建议加权组合多个在不同场景下更好的分支。通过这种方式实现了对单级图表示的改进。对场景的理解有助于异常检测。高级图形表示被赋予更高的权重，用于在低分辨率视频中编码人的移动速度和方向，而低级图形表示被赋予更高的权重，用于在高分辨率视频中编码人的骨骼。实验结果表明，所提出的HSTGCNN在四个基准数据集（UCSD行人、上海理工大学、香港中文大学大道和IITB走廊）上，通过使用较少的可学习参数，显著优于当前最先进的模型。摘要：Deep learning models have been widely used for anomaly detection in surveillance videos. Typical models are equipped with the capability to reconstruct normal videos and evaluate the reconstruction errors on anomalous videos to indicate the extent of abnormalities. However, existing approaches suffer from two disadvantages. Firstly, they can only encode the movements of each identity independently, without considering the interactions among identities which may also indicate anomalies. Secondly, they leverage inflexible models whose structures are fixed under different scenes, this configuration disables the understanding of scenes. In this paper, we propose a Hierarchical Spatio-Temporal Graph Convolutional Neural Network (HSTGCNN) to address these problems, the HSTGCNN is composed of multiple branches that correspond to different levels of graph representations. High-level graph representations encode the trajectories of people and the interactions among multiple identities while low-level graph representations encode the local body postures of each person. Furthermore, we propose to weightedly combine multiple branches that are better at different scenes. An improvement over single-level graph representations is achieved in this way. An understanding of scenes is achieved and serves anomaly detection. High-level graph representations are assigned higher weights to encode moving speed and directions of people in low-resolution videos while low-level graph representations are assigned higher weights to encode human skeletons in high-resolution videos. Experimental results show that the proposed HSTGCNN significantly outperforms current state-of-the-art models on four benchmark datasets (UCSD Pedestrian, ShanghaiTech, CUHK Avenue and IITB-Corridor) by using much less learnable parameters.

【4】 Do Pedestrians Pay Attention? Eye Contact Detection in the Wild 标题：行人注意到了吗？野外的眼神接触检测链接：https://arxiv.org/abs/2112.04212

作者：Younes Belkada,Lorenzo Bertoni,Romain Caristan,Taylor Mordan,Alexandre Alahi 备注：Project website: this https URL 摘要：在城市或拥挤的环境中，人类依靠眼神交流与附近的人进行快速有效的沟通。自治代理还需要检测目光接触，以便与行人互动并在行人周围安全导航。在本文中，我们重点关注野外的目光接触检测，即对环境或行人距离没有控制的自动驾驶车辆的真实场景。我们介绍了一个利用语义关键点检测眼神接触的模型，并表明这种高级表示（i）在公开可用的数据集JAAD上实现了最先进的结果，（ii）比在端到端网络中利用原始图像具有更好的泛化特性。为了研究领域适应，我们创建了LOOK：一个用于野外眼睛接触检测的大规模数据集，它关注于各种各样的、不受约束的场景，以实现真实世界的泛化。源代码和LOOK数据集公开共享，以实现开放科学任务。摘要：In urban or crowded environments, humans rely on eye contact for fast and efficient communication with nearby people. Autonomous agents also need to detect eye contact to interact with pedestrians and safely navigate around them. In this paper, we focus on eye contact detection in the wild, i.e., real-world scenarios for autonomous vehicles with no control over the environment or the distance of pedestrians. We introduce a model that leverages semantic keypoints to detect eye contact and show that this high-level representation (i) achieves state-of-the-art results on the publicly-available dataset JAAD, and (ii) conveys better generalization properties than leveraging raw images in an end-to-end network. To study domain adaptation, we create LOOK: a large-scale dataset for eye contact detection in the wild, which focuses on diverse and unconstrained scenarios for real-world generalization. The source code and the LOOK dataset are publicly shared towards an open science mission.

【5】 Presentation Attack Detection Methods based on Gaze Tracking and Pupil Dynamic: A Comprehensive Survey 标题：基于凝视跟踪和瞳孔动态的呈现攻击检测方法综述链接：https://arxiv.org/abs/2112.04038

作者：Jalil Nourmohammadi Khiarak 摘要：研究目的：在生物特征识别领域，可见的人类特征在移动设备上的验证和识别非常流行和可行。然而，冒名顶替者可以通过伪造和人工生物特征来欺骗系统，从而欺骗这些特征。可见生物识别系统遭受了演示文稿攻击的高安全风险。方法：同时，基于挑战的方法，特别是凝视跟踪和瞳孔动态，似乎比其他非接触式生物识别系统更安全。我们回顾了现有的工作，探索凝视跟踪和瞳孔动态活性检测。主要结果：本研究分析了凝视跟踪和瞳孔动态呈现攻击的各个方面，如最先进的活性检测算法、各种伪影、公共数据库的可访问性，以及该领域的标准化总结。此外，我们还讨论了未来的工作以及创建基于挑战的系统的安全活性检测所面临的挑战。摘要：Purpose of the research: In the biometric community, visible human characteristics are popular and viable for verification and identification on mobile devices. However, imposters are able to spoof such characteristics by creating fake and artificial biometrics to fool the system. Visible biometric systems have suffered a high-security risk of presentation attack. Methods: In the meantime, challenge-based methods, in particular, gaze tracking and pupil dynamic appear to be more secure methods than others for contactless biometric systems. We review the existing work that explores gaze tracking and pupil dynamic liveness detection. The principal results: This research analyzes various aspects of gaze tracking and pupil dynamic presentation attacks, such as state-of-the-art liveness detection algorithms, various kinds of artifacts, the accessibility of public databases, and a summary of standardization in this area. In addition, we discuss future work and the open challenges to creating a secure liveness detection based on challenge-based systems.

【6】 A Robust Completed Local Binary Pattern (RCLBP) for Surface Defect Detection 标题：一种用于表面缺陷检测的鲁棒完全局部二值模式(RCLBP) 链接：https://arxiv.org/abs/2112.04021

作者：Nana Kankam Gyimah,Abenezer Girma,Mahmoud Nabil Mahmoud,Shamila Nateghi,Abdollah Homaifar,Daniel Opoku 备注：Accepted to IEEE SMC 2021 as a special invited session paper 摘要：在本文中，我们提出了一个鲁棒的完成局部二值模式（RCLBP）框架，用于表面缺陷检测任务。该方法将非局部（NL）均值滤波与小波阈值和完全局部二值模式（CLBP）相结合，提取鲁棒特征，并将其输入分类器进行表面缺陷检测。本文将三个部分结合起来：建立了一种基于非局部（NL）均值滤波的小波阈值去噪技术，在保留纹理和边缘的同时对噪声图像进行去噪。其次，利用CLBP技术提取鉴别特征。最后，将识别特征输入分类器，建立检测模型，并对该框架的性能进行评估。使用东北大学（NEU）的真实钢表面缺陷数据库评估了缺陷检测模型的性能。实验结果表明，所提出的RCLBP方法具有较强的噪声鲁棒性，适用于不同类内、类间变化条件下以及光照变化条件下的表面缺陷检测。摘要：In this paper, we present a Robust Completed Local Binary Pattern (RCLBP) framework for a surface defect detection task. Our approach uses a combination of Non-Local (NL) means filter with wavelet thresholding and Completed Local Binary Pattern (CLBP) to extract robust features which are fed into classifiers for surface defects detection. This paper combines three components: A denoising technique based on Non-Local (NL) means filter with wavelet thresholding is established to denoise the noisy image while preserving the textures and edges. Second, discriminative features are extracted using the CLBP technique. Finally, the discriminative features are fed into the classifiers to build the detection model and evaluate the performance of the proposed framework. The performance of the defect detection models are evaluated using a real-world steel surface defect database from Northeastern University (NEU). Experimental results demonstrate that the proposed approach RCLBP is noise robust and can be applied for surface defect detection under varying conditions of intra-class and inter-class changes and with illumination changes.

【7】 Scalable 3D Semantic Segmentation for Gun Detection in CT Scans 标题：CT扫描中用于枪支检测的可伸缩三维语义分割链接：https://arxiv.org/abs/2112.03917

作者：Marius Memmel,Christoph Reich,Nicolas Wagner,Faraz Saeedan 备注：This work was part of the Project Lab Deep Learning in Computer Vision Winter Semester 2019/2020 at TU Darmstadt 摘要：随着3D数据可用性的提高，对处理这些数据的解决方案的需求也迅速增加。然而，将维度添加到已经可靠精确的2D方法中会导致巨大的内存消耗和更高的计算复杂性。这些问题导致当前硬件达到其限制，大多数方法被迫大幅降低输入分辨率。我们的主要贡献是一种新的用于行李CT扫描中枪支检测的深度3D语义分割方法，该方法能够实现高分辨率体素化体积的快速训练和低视频内存消耗。我们介绍了一种移动金字塔方法，该方法在推理时利用多个前向过程分割实例。摘要：With the increased availability of 3D data, the need for solutions processing those also increased rapidly. However, adding dimension to already reliably accurate 2D approaches leads to immense memory consumption and higher computational complexity. These issues cause current hardware to reach its limitations, with most methods forced to reduce the input resolution drastically. Our main contribution is a novel deep 3D semantic segmentation method for gun detection in baggage CT scans that enables fast training and low video memory consumption for high-resolution voxelized volumes. We introduce a moving pyramid approach that utilizes multiple forward passes at inference time for segmenting an instance.

【8】 Which images to label for few-shot medical landmark detection? 标题：要标记哪些图像以进行Few-Shot医学标志性检测？链接：https://arxiv.org/abs/2112.04386

作者：Quan Quan,Qingsong Yao,Jun Li,S. Kevin Zhou 摘要：深度学习方法的成功依赖于标记良好的大规模数据集的可用性。然而，对于医学图像来说，注释如此丰富的训练数据通常需要有经验的放射科医生，并且消耗他们有限的时间。Few-Shot学习是为了减轻这一负担而开发的，它只需要几个标记数据就可以获得有竞争力的性能。然而，在少数镜头学习中，一个关键但先前被忽略的问题是在学习之前选择模板图像进行注释，这会影响最终的性能。在此，我们提出了一种新的样本选择策略（SCP）来选择“最有价值”的图像进行注释，在Few-Shot医学地标检测的背景下。SCP由三部分组成：1）构建预训练深度模型以从放射图像中提取特征的自我监督训练，2）定位信息斑块的关键点建议，以及3）搜索最具代表性样本或模板的代表性得分估计。在三个广泛使用的公共数据集上的各种实验证明了SCP的优势。对于一次性医疗标志物检测，其使用将头影测量和手X光数据集的平均径向误差分别减少14.2%（从3.595mm减少到3.083mm）和35.5%（4.114mm减少到2.653mm）。摘要：The success of deep learning methods relies on the availability of well-labeled large-scale datasets. However, for medical images, annotating such abundant training data often requires experienced radiologists and consumes their limited time. Few-shot learning is developed to alleviate this burden, which achieves competitive performances with only several labeled data. However, a crucial yet previously overlooked problem in few-shot learning is about the selection of template images for annotation before learning, which affects the final performance. We herein propose a novel Sample Choosing Policy (SCP) to select "the most worthy" images for annotation, in the context of few-shot medical landmark detection. SCP consists of three parts: 1) Self-supervised training for building a pre-trained deep model to extract features from radiological images, 2) Key Point Proposal for localizing informative patches, and 3) Representative Score Estimation for searching the most representative samples or templates. The advantage of SCP is demonstrated by various experiments on three widely-used public datasets. For one-shot medical landmark detection, its use reduces the mean radial errors on Cephalometric and HandXray datasets by 14.2% (from 3.595mm to 3.083mm) and 35.5% (4.114mm to 2.653mm), respectively.

分类|识别相关(8篇)

【1】 Progressive Multi-stage Interactive Training in Mobile Network for Fine-grained Recognition 标题：用于细粒度识别的移动网络渐进式多阶段交互训练链接：https://arxiv.org/abs/2112.04223

作者：Zhenxin Wu,Qingliang Chen,Yifeng Liu,Yinqi Zhang,Chengkai Zhu,Yang Yu 摘要：细粒度视觉分类（FGVC）旨在从子类别中识别对象。这是一项非常具有挑战性的任务，因为班级之间存在微妙的差异。现有的研究采用大规模卷积神经网络或视觉变换器作为特征抽取器，这在计算上非常昂贵。事实上，细粒度识别的真实场景通常需要一个更轻量级的移动网络，可以离线使用。然而，与大规模模型相比，移动网络的基本特征提取能力较弱。本文基于轻量级MobilenetV2，提出了一种基于递归马赛克生成器（RMG-PMSI）的渐进式多阶段交互式训练方法。首先，我们提出了一种递归马赛克生成器（RMG），它在不同的阶段生成不同粒度的图像。然后，不同阶段的特征通过一个多阶段交互（MSI）模块，该模块加强和补充了不同阶段的相应特征。最后，使用渐进训练（P），模型在不同阶段提取的特征可以充分利用并相互融合。在三个著名的细粒度基准测试上的实验表明，RMG-PMSI能够显著提高性能，具有良好的鲁棒性和可移植性。摘要：Fine-grained Visual Classification (FGVC) aims to identify objects from subcategories. It is a very challenging task because of the subtle inter-class differences. Existing research applies large-scale convolutional neural networks or visual transformers as the feature extractor, which is extremely computationally expensive. In fact, real-world scenarios of fine-grained recognition often require a more lightweight mobile network that can be utilized offline. However, the fundamental mobile network feature extraction capability is weaker than large-scale models. In this paper, based on the lightweight MobilenetV2, we propose a Progressive Multi-Stage Interactive training method with a Recursive Mosaic Generator (RMG-PMSI). First, we propose a Recursive Mosaic Generator (RMG) that generates images with different granularities in different phases. Then, the features of different stages pass through a Multi-Stage Interaction (MSI) module, which strengthens and complements the corresponding features of different stages. Finally, using the progressive training (P), the features extracted by the model in different stages can be fully utilized and fused with each other. Experiments on three prestigious fine-grained benchmarks show that RMG-PMSI can significantly improve the performance with good robustness and transferability.

【2】 Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs 标题：先分类后落地：将视频场景图重新表述为时间二部图链接：https://arxiv.org/abs/2112.04222

作者：Kaifeng Gao,Long Chen,Yulei Niu,Jian Shao,Jun Xiao 备注：12 pages, 8 figures 摘要：今天的VidSGG模型都是基于提议的方法，即，它们首先生成大量成对的主客体片段作为提议，然后对每个提议进行谓词分类。在本文中，我们认为这种流行的基于提议的框架有三个固有的缺点：1）提议的基本真理谓词标签部分正确。2）它们打破了同一主语-宾语对的不同谓词实例之间的高阶关系。3） VidSGG的绩效上限取决于提案的质量。为此，我们为VidSGG提出了一个新的分类接地框架，它可以避免所有三个被忽视的缺点。同时，在该框架下，我们将视频场景图转化为时间二部图，其中实体和谓词是两类有时隙的节点，边缘表示这些节点之间不同的语义角色。这一提法充分利用了我们的新框架。因此，我们进一步提出了一种新的基于二部图的SGG模型：BIG。具体来说，BIG包括两个部分：分类阶段和基础阶段，前者旨在对所有节点和边缘的类别进行分类，后者试图定位每个关系实例的时间位置。对两个VidSGG数据集的广泛烧蚀证明了我们的框架和BIG的有效性。摘要：Today's VidSGG models are all proposal-based methods, i.e., they first generate numerous paired subject-object snippets as proposals, and then conduct predicate classification for each proposal. In this paper, we argue that this prevalent proposal-based framework has three inherent drawbacks: 1) The ground-truth predicate labels for proposals are partially correct. 2) They break the high-order relations among different predicate instances of a same subject-object pair. 3) VidSGG performance is upper-bounded by the quality of the proposals. To this end, we propose a new classification-then-grounding framework for VidSGG, which can avoid all the three overlooked drawbacks. Meanwhile, under this framework, we reformulate the video scene graphs as temporal bipartite graphs, where the entities and predicates are two types of nodes with time slots, and the edges denote different semantic roles between these nodes. This formulation takes full advantage of our new framework. Accordingly, we further propose a novel BIpartite Graph based SGG model: BIG. Specifically, BIG consists of two parts: a classification stage and a grounding stage, where the former aims to classify the categories of all the nodes and the edges, and the latter tries to localize the temporal location of each relation instance. Extensive ablations on two VidSGG datasets have attested to the effectiveness of our framework and BIG.

【3】 Unimodal Face Classification with Multimodal Training 标题：基于多模态训练的单峰人脸分类链接：https://arxiv.org/abs/2112.04182

作者：Wenbin Teng,Chongyang Bai 备注：Accepted by IEEE International Conference On Automatic Face and Gesture Recognition 2021 摘要：人脸识别是各种多媒体应用中的一项重要任务，如安全检查、凭证访问和运动感知游戏。但是，当输入面有噪声（例如，条件较差的RGB图像）或缺少某些信息（例如，没有颜色的3D面）时，该任务具有挑战性。在这项工作中，我们提出了一个用于鲁棒人脸分类的多模态训练单峰测试（MTUT）框架，该框架利用了训练过程中的跨模态关系，并将其作为测试过程中不完美的单模态输入的补充。从技术上讲，在训练期间，该框架（1）借助面部属性构建模态内和跨模态自动编码器，以学习作为多模态描述符的潜在嵌入，（2）提出一种新的多模态嵌入发散损失，以对齐不同模态的异质特征，这也自适应地避免了无用模态（如果有的话）混淆模型。通过这种方式，学习的自动编码器可以在测试阶段生成单模态人脸分类的鲁棒嵌入。我们在两个人脸分类数据集和两种测试输入中评估了我们的框架：（1）不良图像和（2）点云或3D人脸网格，当2D和3D模式都可用于训练时。我们的实验表明，我们的MTUT框架在两个数据集的2D和3D设置上始终优于十条基线。摘要：Face recognition is a crucial task in various multimedia applications such as security check, credential access and motion sensing games. However, the task is challenging when an input face is noisy (e.g. poor-condition RGB image) or lacks certain information (e.g. 3D face without color). In this work, we propose a Multimodal Training Unimodal Test (MTUT) framework for robust face classification, which exploits the cross-modality relationship during training and applies it as a complementary of the imperfect single modality input during testing. Technically, during training, the framework (1) builds both intra-modality and cross-modality autoencoders with the aid of facial attributes to learn latent embeddings as multimodal descriptors, (2) proposes a novel multimodal embedding divergence loss to align the heterogeneous features from different modalities, which also adaptively avoids the useless modality (if any) from confusing the model. This way, the learned autoencoders can generate robust embeddings in single-modality face classification on test stage. We evaluate our framework in two face classification datasets and two kinds of testing input: (1) poor-condition image and (2) point cloud or 3D face mesh, when both 2D and 3D modalities are available for training. We experimentally show that our MTUT framework consistently outperforms ten baselines on 2D and 3D settings of both datasets.

【4】 Topology-aware Convolutional Neural Network for Efficient Skeleton-based Action Recognition 标题：基于拓扑感知卷积神经网络的高效骨架动作识别链接：https://arxiv.org/abs/2112.04178

作者：Kailin Xu,Fanfan Ye,Qiaoyong Zhong,Di Xie 备注：Accepted by AAAI 2022 摘要：在基于骨架的动作识别领域，图卷积网络（GCNs）得到了迅速的发展，而卷积神经网络（CNNs）受到的关注较少。一个原因是CNN在不规则骨架拓扑建模方面被认为很差。为了缓解这一限制，本文提出了一种纯CNN结构，称为拓扑感知CNN（Ta CNN）。特别是，我们开发了一个新的跨通道特征增强模块，它是一个map和组map操作的组合。通过将该模型应用于坐标层和关节层，有效地增强了拓扑特征。值得注意的是，我们从理论上证明了当连接维数作为通道处理时，图卷积是正规卷积的一个特例。这证实了GCNs的拓扑建模能力也可以通过使用CNN来实现。此外，我们还创造性地设计了SkeletonMix策略，以独特的方式将两个人混合在一起，进一步提高了性能。在四个广泛使用的数据集上进行了大量实验，即N-UCLA、SBU、NTU RGB+D和NTU RGB+D 120，以验证Ta CNN的有效性。我们大大超过了现有的基于CNN的方法。与领先的基于GCN的方法相比，我们在所需的GFLOP和参数方面实现了可比的性能，并且复杂性大大降低。摘要：In the context of skeleton-based action recognition, graph convolutional networks (GCNs) have been rapidly developed, whereas convolutional neural networks (CNNs) have received less attention. One reason is that CNNs are considered poor in modeling the irregular skeleton topology. To alleviate this limitation, we propose a pure CNN architecture named Topology-aware CNN (Ta-CNN) in this paper. In particular, we develop a novel cross-channel feature augmentation module, which is a combo of map-attend-group-map operations. By applying the module to the coordinate level and the joint level subsequently, the topology feature is effectively enhanced. Notably, we theoretically prove that graph convolution is a special case of normal convolution when the joint dimension is treated as channels. This confirms that the topology modeling power of GCNs can also be implemented by using a CNN. Moreover, we creatively design a SkeletonMix strategy which mixes two persons in a unique manner and further boosts the performance. Extensive experiments are conducted on four widely used datasets, i.e. N-UCLA, SBU, NTU RGB+D and NTU RGB+D 120 to verify the effectiveness of Ta-CNN. We surpass existing CNN-based methods significantly. Compared with leading GCN-based methods, we achieve comparable performance with much less complexity in terms of the required GFLOPs and parameters.

【5】 Image classifiers can not be made robust to small perturbations 标题：不能使图像分类器对小扰动具有鲁棒性链接：https://arxiv.org/abs/2112.04033

作者：Zheng Dai,David K. Gifford 摘要：图像分类器对输入小扰动的敏感性通常被视为其结构的缺陷。我们证明了这种敏感性是分类器的一个基本属性。对于$n$-by-$n$图像集上的任意分类器，我们表明，对于除一个类别外的所有类别，与以任何$p$-标准（包括汉明距离）测量的图像空间直径相比，可以通过微小的修改来改变该类别中除一小部分以外的所有图像的分类。然后，我们研究这一现象在人类视觉感知中的表现，并讨论其对计算机视觉系统设计考虑的影响。摘要：The sensitivity of image classifiers to small perturbations in the input is often viewed as a defect of their construction. We demonstrate that this sensitivity is a fundamental property of classifiers. For any arbitrary classifier over the set of $n$-by-$n$ images, we show that for all but one class it is possible to change the classification of all but a tiny fraction of the images in that class with a tiny modification compared to the diameter of the image space when measured in any $p$-norm, including the hamming distance. We then examine how this phenomenon manifests in human visual perception and discuss its implications for the design considerations of computer vision systems.

【6】 DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover's Distance Improves Out-Of-Distribution Face Identification 标题：DeepFace-EMD：使用补丁推土机距离进行重新排序改进了非分布人脸识别链接：https://arxiv.org/abs/2112.04016

作者：Hai Phan,Anh Nguyen 摘要：人脸识别（FI）无处不在，并推动执法部门做出许多高风险决策。最先进的FI方法通过获取图像嵌入之间的余弦相似性来比较两幅图像。然而，这种方法对未包含在训练集或图库中的新类型图像（例如，当查询面被遮罩、裁剪或旋转时）的非分布（out-of-distribution，OOD）泛化较差。在这里，我们提出了一种重新排序的方法，该方法使用图像块的深层空间特征上的地球移动器距离来比较两个面。我们的额外比较阶段明确地在细粒度级别（例如，眼睛到眼睛）检查图像相似性，并且比传统FI对OOD扰动和遮挡更鲁棒。有趣的是，在没有微调特征提取器的情况下，我们的方法持续提高了所有测试的OOD查询的准确性：屏蔽、裁剪、旋转和敌对，同时在分布图像上获得类似的结果。摘要：Face identification (FI) is ubiquitous and drives many high-stake decisions made by law enforcement. State-of-the-art FI approaches compare two images by taking the cosine similarity between their image embeddings. Yet, such an approach suffers from poor out-of-distribution (OOD) generalization to new types of images (e.g., when a query face is masked, cropped, or rotated) not included in the training set or the gallery. Here, we propose a re-ranking approach that compares two faces using the Earth Mover's Distance on the deep, spatial features of image patches. Our extra comparison stage explicitly examines image similarity at a fine-grained level (e.g., eyes to eyes) and is more robust to OOD perturbations and occlusions than traditional FI. Interestingly, without finetuning feature extractors, our method consistently improves the accuracy on all tested OOD queries: masked, cropped, rotated, and adversarial while obtaining similar results on in-distribution images.

【7】 Few-Shot Image Classification Along Sparse Graphs 标题：稀疏图上的Few-Shot图像分类链接：https://arxiv.org/abs/2112.03951

作者：Joseph F Comer,Philip L Jacobson,Heiko Hoffmann 摘要：很少有镜头学习仍然是一个具有挑战性的问题，大多数真实世界数据的单镜头精度都不令人满意。在这里，我们提出了一个不同的视角来研究深度网络特征空间中的数据分布，并展示了如何利用它进行少量镜头学习。首先，我们观察到特征空间中的最近邻具有同一类的高概率成员，而通常来自同一类的两个随机点之间的距离并不比来自不同类的点更近。这一观察结果表明，特征空间中的类形成稀疏的、松散连接的图，而不是密集的簇。为了利用这一特性，我们建议使用少量的标签传播到未标记空间，然后使用核PCA重建误差作为每个类的特征空间数据分布的决策边界。使用这种我们称之为“K-Prop”的方法，我们证明了对于主干网络可以以高的类内最近邻概率进行训练的数据集，大幅提高了少数镜头学习性能（例如，RESISC45卫星图像数据集的单镜头五向分类准确率为83%）。我们使用六种不同的数据集来演示这种关系。摘要：Few-shot learning remains a challenging problem, with unsatisfactory 1-shot accuracies for most real-world data. Here, we present a different perspective for data distributions in the feature space of a deep network and show how to exploit it for few-shot learning. First, we observe that nearest neighbors in the feature space are with high probability members of the same class while generally two random points from one class are not much closer to each other than points from different classes. This observation suggests that classes in feature space form sparse, loosely connected graphs instead of dense clusters. To exploit this property, we propose using a small amount of label propagation into the unlabeled space and then using a kernel PCA reconstruction error as decision boundary for the feature-space data distribution of each class. Using this method, which we call "K-Prop," we demonstrate largely improved few-shot learning performances (e.g., 83% accuracy for 1-shot 5-way classification on the RESISC45 satellite-images dataset) for datasets for which a backbone network can be trained with high within-class nearest-neighbor probabilities. We demonstrate this relationship using six different datasets.

【8】 Dyadic Sex Composition and Task Classification Using fNIRS Hyperscanning Data 标题：基于fNIRS超扫描数据的二元性组成与任务分类链接：https://arxiv.org/abs/2112.03911

作者：Liam A. Kruse,Allan L. Reiss,Mykel J. Kochenderfer,Stephanie Balters 备注：20th IEEE International Conference on Machine Learning and Applications 摘要：功能性近红外光谱（fNIRS）超扫描技术是一种新兴的神经成像应用，它可以测量潜在社会互动的细微神经特征。研究人员评估了性别和任务类型（如合作与竞争）对人与人互动过程中大脑间连贯性的影响。然而，目前还没有研究使用基于深度学习的方法来深入了解fNIRS超扫描环境中的性别和任务差异。这项工作提出了一种基于卷积神经网络的方法，用于对$N=222$参与者的广泛超扫描数据集进行二元性别组合和任务分类。使用动态时间扭曲计算的脑间信号相似性作为输入数据。该方法的分类准确率最高可达80%以上，为探索和理解复杂的大脑行为提供了新的途径。摘要：Hyperscanning with functional near-infrared spectroscopy (fNIRS) is an emerging neuroimaging application that measures the nuanced neural signatures underlying social interactions. Researchers have assessed the effect of sex and task type (e.g., cooperation versus competition) on inter-brain coherence during human-to-human interactions. However, no work has yet used deep learning-based approaches to extract insights into sex and task-based differences in an fNIRS hyperscanning context. This work proposes a convolutional neural network-based approach to dyadic sex composition and task classification for an extensive hyperscanning dataset with $N = 222$ participants. Inter-brain signal similarity computed using dynamic time warping is used as the input data. The proposed approach achieves a maximum classification accuracy of greater than $80$ percent, thereby providing a new avenue for exploring and understanding complex brain behavior.

分割|语义相关(5篇)

【1】 VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation 标题：VISOLO：基于网格的时空聚合高效在线视频实例分割链接：https://arxiv.org/abs/2112.04177

作者：Su Ho Han,Sukjun Hwang,Seoung Wug Oh,Yeonchool Park,Hyunwoo Kim,Min-Jung Kim,Seon Joo Kim 摘要：对于在线视频实例分割（VIS），有效地利用前一帧的信息对于实时应用至关重要。以前的大多数方法都采用两阶段方法，需要额外的计算，如RPN和ROIALLIGN，并且没有充分利用视频中的可用信息来处理VIS中的所有子任务。在本文中，我们提出了一种基于网格结构特征表示的在线视觉系统单阶段框架。基于网格的特性允许我们使用完全卷积的网络进行实时处理，还可以轻松地在不同组件中重用和共享特性。我们还引入了协作操作模块，从可用帧中聚合信息，以丰富VIS中所有子任务的功能。我们的设计充分利用以前的信息在网格形式的所有任务在VIS以一种有效的方式，我们实现了新的最先进的精度（38.6 AP和36.9 AP）和速度（40 FPS）在YouTube上的2019和2021数据集之间的在线VIS方法。摘要：For online video instance segmentation (VIS), fully utilizing the information from previous frames in an efficient manner is essential for real-time applications. Most previous methods follow a two-stage approach requiring additional computations such as RPN and RoIAlign, and do not fully exploit the available information in the video for all subtasks in VIS. In this paper, we propose a novel single-stage framework for online VIS built based on the grid structured feature representation. The grid-based features allow us to employ fully convolutional networks for real-time processing, and also to easily reuse and share features within different components. We also introduce cooperatively operating modules that aggregate information from available frames, in order to enrich the features for all subtasks in VIS. Our design fully takes advantage of previous information in a grid form for all tasks in VIS in an efficient way, and we achieved the new state-of-the-art accuracy (38.6 AP and 36.9 AP) and speed (40.0 FPS) on YouTube-VIS 2019 and 2021 datasets among online VIS methods.

【2】 Fully Attentional Network for Semantic Segmentation 标题：一种用于语义切分的全注意网络链接：https://arxiv.org/abs/2112.04108

作者：Qi Song,Jie Li,Chenghong Li,Hao Guo,Rui Huang 备注：Accepted by AAAI 2022 摘要：最近的非局部自我注意方法已被证明是有效的捕捉语义分割的长期依赖关系。这些方法通常形成RC*C（通过压缩空间维度）或RHW*HW（通过压缩通道）的相似图来描述沿通道或空间维度的特征关系，其中C是通道数，H和W是输入特征图的空间维度。然而，这种做法倾向于沿其他维度压缩特征依赖性，从而导致注意力缺失，这可能导致小/细类别的结果较差，或大对象内部的分割不一致。为了解决这个问题，我们提出了一种新的方法，即完全注意网络（FLANet），在保持高计算效率的同时，将空间和通道注意编码到一个单一的相似图中。具体来说，对于每个通道图，我们的FLANet可以通过一个新颖的完全注意模块从所有其他通道图以及相关的空间位置获取特征响应。我们的新方法在三个具有挑战性的语义分割数据集上取得了最先进的性能，即在Cityscapes测试集、ADE20K验证集和PASCAL VOC测试集上分别为83.6%、46.99%和88.5%。摘要：Recent non-local self-attention methods have proven to be effective in capturing long-range dependencies for semantic segmentation. These methods usually form a similarity map of RC*C (by compressing spatial dimensions) or RHW*HW (by compressing channels) to describe the feature relations along either channel or spatial dimensions, where C is the number of channels, H and W are the spatial dimensions of the input feature map. However, such practices tend to condense feature dependencies along the other dimensions,hence causing attention missing, which might lead to inferior results for small/thin categories or inconsistent segmentation inside large objects. To address this problem, we propose anew approach, namely Fully Attentional Network (FLANet),to encode both spatial and channel attentions in a single similarity map while maintaining high computational efficiency. Specifically, for each channel map, our FLANet can harvest feature responses from all other channel maps, and the associated spatial positions as well, through a novel fully attentional module. Our new method has achieved state-of-the-art performance on three challenging semantic segmentation datasets,i.e., 83.6%, 46.99%, and 88.5% on the Cityscapes test set,the ADE20K validation set, and the PASCAL VOC test set,respectively.

【3】 Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid 标题：基于学习语义金字塔的全上下文感知图像修复链接：https://arxiv.org/abs/2112.04107

作者：Wendong Zhang,Yunbo Wang,Junwei Zhu,Ying Tai,Bingbing Ni,Xiaokang Yang 备注：This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 摘要：为图像中任意缺失的区域恢复合理和真实的内容是一项重要而富有挑战性的任务。尽管最近的图像修复模型在生成生动的视觉细节方面取得了重大进展，但在处理更复杂的场景时，它们仍然可能由于上下文模糊而导致纹理模糊或结构扭曲。为了解决这个问题，我们提出了语义金字塔网络（SPN），其动机是从特定的借口任务中学习多尺度语义先验可以极大地帮助恢复图像中的局部缺失内容。SPN由两部分组成。首先，它将一个借口模型中的语义先验信息提取到一个多尺度特征金字塔中，从而实现对全局上下文和局部结构的一致理解。在先验学习者中，我们提出了一个可选的变分推理模块，以实现由各种先验知识驱动的概率图像修复。SPN的第二个组件是一个完全上下文感知的图像生成器，它使用（随机）先验金字塔在多个尺度上自适应地逐步细化低级视觉表示。我们将先前的学习者和图像生成器训练为一个统一的模型，无需任何后处理。在确定性和概率修复设置下，我们的方法在多个数据集（包括Places2、Paris StreetView、CelebA和CelebA HQ）上达到了最先进的水平。摘要：Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task. Although recent image inpainting models have made significant progress in generating vivid visual details, they can still lead to texture blurring or structural distortions due to contextual ambiguity when dealing with more complex scenes. To address this issue, we propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors from specific pretext tasks can greatly benefit the recovery of locally missing content in images. SPN consists of two components. First, it distills semantic priors from a pretext model into a multi-scale feature pyramid, achieving a consistent understanding of the global context and local structures. Within the prior learner, we present an optional module for variational inference to realize probabilistic image inpainting driven by various learned priors. The second component of SPN is a fully context-aware image generator, which adaptively and progressively refines low-level visual representations at multiple scales with the (stochastic) prior pyramid. We train the prior learner and the image generator as a unified model without any post-processing. Our approach achieves the state of the art on multiple datasets, including Places2, Paris StreetView, CelebA, and CelebA-HQ, under both deterministic and probabilistic inpainting setups.

【4】 Nuclei Segmentation in Histopathology Images using Deep Learning with Local and Global Views 标题：基于局部和全局深度学习的组织病理学图像细胞核分割链接：https://arxiv.org/abs/2112.03998

作者：Mahdi Arab Loodaricheh,Nader Karimi,Shadrokh Samavi 备注：5 pages, 5 figures 摘要：数字病理学是现代医学最重要的发展之一。病理检查是医疗协议的金标准，在诊断中起着基础性作用。最近，随着数字扫描仪的出现，组织组织病理切片现在可以数字化并存储为数字图像。因此，数字化组织病理学组织可用于计算机辅助图像分析程序和机器学习技术。细胞核的检测和分割是癌症诊断中的一些基本步骤。近年来，深度学习被用于细胞核分割。然而，用于细胞核分割的深度学习方法中的一个问题是缺乏来自补丁外的信息。本文提出了一种基于深度学习的细胞核分割方法，解决了斑块边界区域的预测失误问题。我们使用局部和全局补丁来预测最终的分割图。在多器官组织病理学数据集上的实验结果表明，我们的方法优于基线细胞核分割和流行的分割模型。摘要：Digital pathology is one of the most significant developments in modern medicine. Pathological examinations are the gold standard of medical protocols and play a fundamental role in diagnosis. Recently, with the advent of digital scanners, tissue histopathology slides can now be digitized and stored as digital images. As a result, digitized histopathological tissues can be used in computer-aided image analysis programs and machine learning techniques. Detection and segmentation of nuclei are some of the essential steps in the diagnosis of cancers. Recently, deep learning has been used for nuclei segmentation. However, one of the problems in deep learning methods for nuclei segmentation is the lack of information from out of the patches. This paper proposes a deep learning-based approach for nuclei segmentation, which addresses the problem of misprediction in patch border areas. We use both local and global patches to predict the final segmentation map. Experimental results on the Multi-organ histopathology dataset demonstrate that our method outperforms the baseline nuclei segmentation and popular segmentation models.

【5】 BT-Unet: A self-supervised learning framework for biomedical image segmentation using Barlow Twins with U-Net models 标题：BT-UNET：基于U网模型的Barlow双胞胎生物医学图像分割的自监督学习框架链接：https://arxiv.org/abs/2112.03916

作者：Narinder Singh Punn,Sonali Agarwal 摘要：深度学习为生物医学图像分割带来了最深远的贡献，使医学成像中的描绘过程自动化。为了完成这样的任务，需要使用大量带注释或标记的数据对模型进行训练，这些数据使用二进制掩码突出显示感兴趣的区域。然而，为如此庞大的数据高效地生成注释需要专业的生物医学分析员和大量的人工工作。这是一项乏味而昂贵的任务，同时也容易受到人为错误的影响。为了解决这个问题，提出了一个自监督学习框架BT-Unet，该框架使用巴洛双胞胎方法，通过无监督的方式减少冗余来预训练U-Net模型的编码器，以学习数据表示。之后，对整个网络进行微调，以执行实际分割。BT-Unet框架可以使用有限数量的带注释样本进行训练，同时具有大量的未注释样本，这在现实问题中是最常见的情况。通过使用标准评估指标生成有限数量标记样本的场景，在不同数据集的多个U-Net模型上验证了该框架。通过详尽的实验测试，可以观察到BT-Unet框架在这种情况下显著提高了U-Net模型的性能。摘要：Deep learning has brought the most profound contribution towards biomedical image segmentation to automate the process of delineation in medical imaging. To accomplish such task, the models are required to be trained using huge amount of annotated or labelled data that highlights the region of interest with a binary mask. However, efficient generation of the annotations for such huge data requires expert biomedical analysts and extensive manual effort. It is a tedious and expensive task, while also being vulnerable to human error. To address this problem, a self-supervised learning framework, BT-Unet is proposed that uses the Barlow Twins approach to pre-train the encoder of a U-Net model via redundancy reduction in an unsupervised manner to learn data representation. Later, complete network is fine-tuned to perform actual segmentation. The BT-Unet framework can be trained with a limited number of annotated samples while having high number of unannotated samples, which is mostly the case in real-world problems. This framework is validated over multiple U-Net models over diverse datasets by generating scenarios of a limited number of labelled samples using standard evaluation metrics. With exhaustive experiment trials, it is observed that the BT-Unet framework enhances the performance of the U-Net models with significant margin under such circumstances.

Zero/Few Shot|迁移|域适配|自适应(1篇)

【1】 Burn After Reading: Online Adaptation for Cross-domain Streaming Data 标题：阅后即焚：跨域流媒体数据的在线适配链接：https://arxiv.org/abs/2112.04345

作者：Luyu Yang,Mingfei Gao,Zeyuan Chen,Ran Xu,Abhinav Shrivastava,Chetan Ramaiah 摘要：在网络隐私的背景下，许多方法提出了复杂的隐私和安全保护措施来保护敏感数据。在本文中，我们认为：不存储任何敏感数据是最好的安全形式。因此，我们提出了一个“阅读后燃烧”的在线框架，即每个在线样本在处理后立即被删除。同时，我们将标记的公共数据和未标记的私有数据之间不可避免的分布转移作为无监督的域适配问题来解决。具体来说，我们提出了一种新的算法，旨在解决在线自适应设置的最基本挑战——缺乏不同的源-目标数据对。因此，我们设计了一种称为CroDoBo的跨域自举方法，以增加跨域的组合多样性。此外，为了充分利用不同组合之间的有价值差异，我们采用了多个学习者共同监督的训练策略。CroDoBo在四个领域适应基准上实现了最先进的在线性能。摘要：In the context of online privacy, many methods propose complex privacy and security preserving measures to protect sensitive data. In this paper, we argue that: not storing any sensitive data is the best form of security. Thus we propose an online framework that "burns after reading", i.e. each online sample is immediately deleted after it is processed. Meanwhile, we tackle the inevitable distribution shift between the labeled public data and unlabeled private data as a problem of unsupervised domain adaptation. Specifically, we propose a novel algorithm that aims at the most fundamental challenge of the online adaptation setting--the lack of diverse source-target data pairs. Therefore, we design a Cross-Domain Bootstrapping approach, called CroDoBo, to increase the combined diversity across domains. Further, to fully exploit the valuable discrepancies among the diverse combinations, we employ the training strategy of multiple learners with co-supervision. CroDoBo achieves state-of-the-art online performance on four domain adaptation benchmarks.

半弱无监督|主动学习|不确定性(7篇)

【1】 Exploring Temporal Granularity in Self-Supervised Video Representation Learning 标题：时间粒度在自监督视频表征学习中的探索链接：https://arxiv.org/abs/2112.04480

作者：Rui Qian,Yeqing Li,Liangzhe Yuan,Boqing Gong,Ting Liu,Matthew Brown,Serge Belongie,Ming-Hsuan Yang,Hartwig Adam,Yin Cui 摘要：本文提出了一个名为TeG的自监督学习框架来探索视频表示学习中的时间粒度。在TeG中，我们从视频中采样一个长片段，并在长片段中采样一个短片段。然后我们提取它们密集的时间嵌入。训练目标由两部分组成：一个细粒度的时间学习目标，用于最大化短片段和长片段中相应时间嵌入之间的相似性；一个持久的时间学习目标，用于将两个片段的全局嵌入合并在一起。我们的研究通过三个主要发现揭示了时间粒度的影响。1）不同的视频任务可能需要不同时间粒度的特征。2）有趣的是，一些被广泛认为需要时间意识的任务实际上可以通过时间持久性特征很好地解决。3） TeG的灵活性在8个视频基准上产生了最先进的结果，在大多数情况下优于有监督的预训练。摘要：This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips. Our study reveals the impact of temporal granularity with three major findings. 1) Different video tasks may require features of different temporal granularities. 2) Intriguingly, some tasks that are widely considered to require temporal awareness can actually be well addressed by temporally persistent features. 3) The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.

【2】 On visual self-supervision and its effect on model robustness 标题：论视觉自我监督及其对模型稳健性的影响链接：https://arxiv.org/abs/2112.04367

作者：Michal Kucer,Diane Oyen,Garrett Kenyon 摘要：最近的自我监督方法在学习特征表示方面取得了成功，这些特征表示可以与完全监督中的特征表示相媲美，并且已经证明在几个方面对模型有益：例如，提高模型鲁棒性和分布外检测。在我们的论文中，我们进行了一项实证研究，以更准确地了解自我监督学习（作为训练前技术或对抗性训练的一部分）以何种方式影响模型对$l_2$和$l_{\infty}$对抗性干扰和自然图像损坏的鲁棒性。自我监督确实可以提高模型的稳健性，但事实证明，问题在于细节。如果简单地将自我监督损失与对抗性训练结合起来，那么当对抗性干扰小于或可与稳健模型训练时的$\epsilon_{train}$值相比较时，可以看到模型准确性的提高。但是，如果观察$\epsilon{test}\ge\epsilon{train}$的精度，则模型精度会下降。事实上，监督损失的权重越大，性能下降的幅度就越大，即损害模型的稳健性。我们确定了将自我监督添加到对抗性训练中的主要方法，并观察到使用自我监督损失优化网络参数和查找对抗性示例可以最大程度地提高模型鲁棒性，因为这可以被视为整体对抗性训练的一种形式。虽然与随机权重初始化相比，自我监督预训练在改进对抗性训练方面有好处，但如果将自我监督纳入对抗性训练，我们观察到在模型鲁棒性或准确性方面没有好处。摘要：Recent self-supervision methods have found success in learning feature representations that could rival ones from full supervision, and have been shown to be beneficial to the model in several ways: for example improving models robustness and out-of-distribution detection. In our paper, we conduct an empirical study to understand more precisely in what way can self-supervised learning - as a pre-training technique or part of adversarial training - affects model robustness to $l_2$ and $l_{\infty}$ adversarial perturbations and natural image corruptions. Self-supervision can indeed improve model robustness, however it turns out the devil is in the details. If one simply adds self-supervision loss in tandem with adversarial training, then one sees improvement in accuracy of the model when evaluated with adversarial perturbations smaller or comparable to the value of $\epsilon_{train}$ that the robust model is trained with. However, if one observes the accuracy for $\epsilon_{test} \ge \epsilon_{train}$, the model accuracy drops. In fact, the larger the weight of the supervision loss, the larger the drop in performance, i.e. harming the robustness of the model. We identify primary ways in which self-supervision can be added to adversarial training, and observe that using a self-supervised loss to optimize both network parameters and find adversarial examples leads to the strongest improvement in model robustness, as this can be viewed as a form of ensemble adversarial training. Although self-supervised pre-training yields benefits in improving adversarial training as compared to random weight initialization, we observe no benefit in model robustness or accuracy if self-supervision is incorporated into adversarial training.

【3】 Adverse Weather Image Translation with Asymmetric and Uncertainty-aware GAN 标题：利用非对称不确定性感知的GAN进行不利天气图像的转换链接：https://arxiv.org/abs/2112.04283

作者：Jeong-gi Kwak,Youngsaeng Jin,Yuanming Li,Dongsik Yoon,Donghyeon Kim,Hanseok Ko 备注：BMVC 2021 摘要：恶劣天气图像翻译属于无监督图像到图像（I2I）的翻译任务，其目的是将恶劣天气域（如雨夜）转换为标准域（如白天）。这是一项具有挑战性的任务，因为来自不利领域的图像存在一些伪影和信息不足。最近，许多采用生成性对抗网络（GAN）的研究在I2I翻译方面取得了显著的成功，但将其应用于恶劣天气增强方面仍存在局限性。采用基于双向循环一致性损失的对称结构作为无监督域传输方法的标准框架。然而，如果这两个领域的信息不平衡，可能会导致较差的翻译结果。为了解决这个问题，我们提出了一个新的GAN模型，即AU-GAN，它具有一个不对称的结构，用于反向域转换。我们仅在一个普通的域生成器（即雨夜->白天）中插入一个建议的特征传输网络（${T}$-net），以增强不利域图像的编码特征。此外，我们还引入了非对称特征匹配来分离编码特征。最后，我们提出了不确定性感知的循环一致性损失来解决循环重建图像的区域不确定性。通过与最新模型的定性和定量比较，我们证明了我们方法的有效性。代码可在https://github.com/jgkwak95/AU-GAN. 摘要：Adverse weather image translation belongs to the unsupervised image-to-image (I2I) translation task which aims to transfer adverse condition domain (eg, rainy night) to standard domain (eg, day). It is a challenging task because images from adverse domains have some artifacts and insufficient information. Recently, many studies employing Generative Adversarial Networks (GANs) have achieved notable success in I2I translation but there are still limitations in applying them to adverse weather enhancement. Symmetric architecture based on bidirectional cycle-consistency loss is adopted as a standard framework for unsupervised domain transfer methods. However, it can lead to inferior translation result if the two domains have imbalanced information. To address this issue, we propose a novel GAN model, i.e., AU-GAN, which has an asymmetric architecture for adverse domain translation. We insert a proposed feature transfer network (${T}$-net) in only a normal domain generator (i.e., rainy night-> day) to enhance encoded features of the adverse domain image. In addition, we introduce asymmetric feature matching for disentanglement of encoded features. Finally, we propose uncertainty-aware cycle-consistency loss to address the regional uncertainty of a cyclic reconstructed image. We demonstrate the effectiveness of our method by qualitative and quantitative comparisons with state-of-the-art models. Codes are available at https://github.com/jgkwak95/AU-GAN.

【4】 Self-Supervised Models are Continual Learners 标题：自我监督模式是持续的学习器链接：https://arxiv.org/abs/2112.04215

作者：Enrico Fini,Victor G. Turrisi da Costa,Xavier Alameda-Pineda,Elisa Ricci,Karteek Alahari,Julien Mairal 摘要：自监督模型已被证明，在对未标记数据进行大规模离线训练时，其视觉表现比其监督模型更具可比性或更好。然而，在连续学习（CL）场景中，当数据按顺序呈现给模型时，它们的效率会灾难性地降低。在本文中，我们证明了通过添加一个预测网络，将表示的当前状态映射到其过去状态，自监督损失函数可以无缝地转换为CL的蒸馏机制。这使我们能够设计一个持续的自我监督视觉表征学习框架，该框架（i）显著提高学习表征的质量，（ii）与多个最先进的自我监督目标兼容，（iii）几乎不需要超参数调整。我们通过在不同的CL环境中训练六个流行的自我监督模型来证明我们方法的有效性。摘要：Self-supervised models have been shown to produce comparable or better visual representations than their supervised counterparts when trained offline on unlabeled data at scale. However, their efficacy is catastrophically reduced in a Continual Learning (CL) scenario where data is presented to the model sequentially. In this paper, we show that self-supervised loss functions can be seamlessly converted into distillation mechanisms for CL by adding a predictor network that maps the current state of the representations to their past state. This enables us to devise a framework for Continual self-supervised visual representation Learning that (i) significantly improves the quality of the learned representations, (ii) is compatible with several state-of-the-art self-supervised objectives, and (iii) needs little to no hyperparameter tuning. We demonstrate the effectiveness of our approach empirically by training six popular self-supervised models in various CL settings.

【5】 GPCO: An Unsupervised Green Point Cloud Odometry Method 标题：GPCO：一种无监督的绿点云里程计方法链接：https://arxiv.org/abs/2112.04054

作者：Pranav Kadam,Min Zhang,Shan Liu,C. -C. Jay Kuo 备注：10 pages, 5 figures 摘要：视觉里程计的目的是利用视觉传感器捕获的信息跟踪物体的增量运动。在这项工作中，我们研究了点云里程计问题，其中仅使用激光雷达（光探测和测距）获得的点云扫描来估计对象的运动轨迹。提出了一种轻量级的点云里程计解决方案，称为绿点云里程计（GPCO）方法。GPCO是一种无监督学习方法，通过匹配连续点云扫描的特征来预测对象运动。它包括三个步骤。首先，使用几何感知点采样方案从大型点云中选择判别点。其次，将视图划分为围绕对象的四个区域，并使用PointHop++方法提取点特征。第三，建立点对应关系来估计两次连续扫描之间的目标运动。在KITTI数据集上的实验证明了GPCO方法的有效性。据观察，GPCO在准确性方面优于基准深度学习方法，但它具有显著更小的模型大小和更少的训练时间。摘要：Visual odometry aims to track the incremental motion of an object using the information captured by visual sensors. In this work, we study the point cloud odometry problem, where only the point cloud scans obtained by the LiDAR (Light Detection And Ranging) are used to estimate object's motion trajectory. A lightweight point cloud odometry solution is proposed and named the green point cloud odometry (GPCO) method. GPCO is an unsupervised learning method that predicts object motion by matching features of consecutive point cloud scans. It consists of three steps. First, a geometry-aware point sampling scheme is used to select discriminant points from the large point cloud. Second, the view is partitioned into four regions surrounding the object, and the PointHop++ method is used to extract point features. Third, point correspondences are established to estimate object motion between two consecutive scans. Experiments on the KITTI dataset are conducted to demonstrate the effectiveness of the GPCO method. It is observed that GPCO outperforms benchmarking deep learning methods in accuracy while it has a significantly smaller model size and less training time.

【6】 Unsupervised Representation Learning via Neural Activation Coding 标题：基于神经激活编码的无监督表示学习链接：https://arxiv.org/abs/2112.04014

作者：Yookoon Park,Sangho Lee,Gunhee Kim,David M. Blei 备注：Published in International Conference on Machine Learning (ICML), 2021 摘要：我们提出神经激活编码（NAC）作为一种新的方法，用于从未标记数据中学习深度表示，以用于下游应用。我们认为深度编码器应该最大化其在数据上的非线性表现力，以便下游预测器充分利用其表示能力。为此，NAC在噪声通信信道上最大化编码器的激活模式和数据之间的互信息。我们证明了对噪声鲁棒激活码的学习增加了ReLU编码器的不同线性区域的数量，从而获得了最大的非线性表达能力。更有趣的是，NAC学习数据的连续和离散表示，我们分别对两个下游任务进行评估：（i）对CIFAR-10和ImageNet-1K进行线性分类，以及（ii）对CIFAR-10和FLICKR-25K进行最近邻检索。实证结果表明，NAC在最近的基线（包括SimCLR和DICTURHASH）中，在这两项任务上都取得了更好或可比的性能。此外，NAC预训练为深层生成模型的训练提供了显著的益处。我们的代码可在https://github.com/yookoon/nac. 摘要：We present neural activation coding (NAC) as a novel approach for learning deep representations from unlabeled data for downstream applications. We argue that the deep encoder should maximize its nonlinear expressivity on the data for downstream predictors to take full advantage of its representation power. To this end, NAC maximizes the mutual information between activation patterns of the encoder and the data over a noisy communication channel. We show that learning for a noise-robust activation code increases the number of distinct linear regions of ReLU encoders, hence the maximum nonlinear expressivity. More interestingly, NAC learns both continuous and discrete representations of data, which we respectively evaluate on two downstream tasks: (i) linear classification on CIFAR-10 and ImageNet-1K and (ii) nearest neighbor retrieval on CIFAR-10 and FLICKR-25K. Empirical results show that NAC attains better or comparable performance on both tasks over recent baselines including SimCLR and DistillHash. In addition, NAC pretraining provides significant benefits to the training of deep generative models. Our code is available at https://github.com/yookoon/nac.

【7】 Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation 标题：基于相似度知识提取的自监督视频表示辅助学习链接：https://arxiv.org/abs/2112.04011

作者：Amirhossein Dadashzadeh,Alan Whone,Majid Mirmehdi 摘要：尽管自监督预训练方法在视频表征学习中取得了显著的成功，但当用于预训练的未标记数据集很小或源任务（预训练）中未标记数据与目标任务（微调）中标记数据之间的域差异很大时，它们的泛化能力较差。为了缓解这些问题，我们提出了一种新的方法，通过辅助预训练阶段补充自我监督预训练，该阶段基于知识相似性提取（auxSKD），以便更好地概括视频数据，例如Kinetics-100而不是Kinetics-400。我们的方法部署了一个教师网络，通过捕获未标记视频数据片段之间的相似信息，迭代地将其知识提取到学生模型。然后，学生模型通过利用这些先验知识来解决借口任务。我们还引入了一个新的借口任务，视频片段速度预测或VSPP，它要求我们的模型预测随机选择的输入视频片段的播放速度，以提供更可靠的自我监督表示。我们的实验结果表明，在K100上进行预训练时，UCF101和HMDB51数据集的结果都优于最新技术。此外，我们还表明，我们的辅助工具auxSKD作为一个额外的预训练阶段添加到最新的最先进的自我监督方法（如Videospace和RSPNet）中，可以改善UCF101和HMDB51的结果。我们的代码很快就会发布。摘要：Despite the outstanding success of self-supervised pretraining methods for video representation learning, they generalise poorly when the unlabeled dataset for pretraining is small or the domain difference between unlabelled data in source task (pretraining) and labeled data in target task (finetuning) is significant. To mitigate these issues, we propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD, for better generalisation with a significantly smaller amount of video data, e.g. Kinetics-100 rather than Kinetics-400. Our method deploys a teacher network that iteratively distils its knowledge to the student model by capturing the similarity information between segments of unlabelled video data. The student model then solves a pretext task by exploiting this prior knowledge. We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations. Our experimental results show superior results to the state of the art on both UCF101 and HMDB51 datasets when pretraining on K100. Additionally, we show that our auxiliary pertaining, auxSKD, when added as an extra pretraining phase to recent state of the art self-supervised methods (e.g. VideoPace and RSPNet), improves their results on UCF101 and HMDB51. Our code will be released soon.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 Prompting Visual-Language Models for Efficient Video Understanding 标题：提示视觉语言模型以实现高效的视频理解链接：https://arxiv.org/abs/2112.04478

作者：Chen Ju,Tengda Han,Kunhao Zheng,Ya Zhang,Weidi Xie 备注：Project page: this https URL 摘要：视觉语言预训练在从大规模web数据学习联合视觉文本表示方面取得了巨大成功，展示了卓越的Zero-Shot概括能力。本文提出了一种简单的方法来有效地适应一个预先训练的视觉语言模型，以最小的训练新的任务，在这里，我们考虑视频理解任务。具体而言，我们建议优化一些随机向量，称为连续提示向量，将新任务转换为与训练前目标相同的格式。此外，为了弥合静态图像和视频之间的鸿沟，时间信息使用轻量级的变换器进行编码，这些变换器堆叠在逐帧的视觉特征之上。在实验上，我们进行了广泛的烧蚀研究，以分析关键部件和必要性。在动作识别、动作定位和文本视频检索的9个公共基准上，在封闭集、Few-Shot、开放集场景中，我们实现了与现有方法相比具有竞争力或最先进的性能，尽管训练的参数明显减少。摘要：Visual-language pre-training has shown great success for learning joint visual-textual representations from large-scale web data, demonstrating remarkable ability for zero-shot generalisation. This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training, and here, we consider video understanding tasks. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert the novel tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components and necessities. On 9 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, open-set scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite training significantly fewer parameters.

【2】 SNEAK: Synonymous Sentences-Aware Adversarial Attack on Natural Language Video Localization 标题：Screak：自然语言视频本地化的同义句感知对抗性攻击链接：https://arxiv.org/abs/2112.04154

作者：Wenbo Gou,Wen Shi,Jian Lou,Lijie Huang,Pan Zhou,Ruixuan Li 摘要：自然语言视频定位（NLVL）是视觉语言理解领域的一项重要任务，它不仅要求深入理解计算机视觉和自然语言方面，更重要的是深入理解两者之间的相互作用。对抗性漏洞已被公认为深层神经网络模型的一个关键安全问题，需要谨慎调查。尽管对视频和语言任务进行了广泛而独立的研究，但目前对NLVL等视觉-语言联合任务中对抗性稳健性的理解还不太成熟。因此，本文旨在通过从攻击和防御两个方面考察漏洞的三个方面，全面研究NLVL模型的对抗鲁棒性。为了实现攻击目标，我们提出了一种新的对抗性攻击范式，称为NLVL上的同义句感知对抗性攻击（SLEK），它捕获了视觉和语言双方之间的跨模态相互作用。摘要：Natural language video localization (NLVL) is an important task in the vision-language understanding area, which calls for an in-depth understanding of not only computer vision and natural language side alone, but more importantly the interplay between both sides. Adversarial vulnerability has been well-recognized as a critical security issue of deep neural network models, which requires prudent investigation. Despite its extensive yet separated studies in video and language tasks, current understanding of the adversarial robustness in vision-language joint tasks like NLVL is less developed. This paper therefore aims to comprehensively investigate the adversarial robustness of NLVL models by examining three facets of vulnerabilities from both attack and defense aspects. To achieve the attack goal, we propose a new adversarial attack paradigm called synonymous sentences-aware adversarial attack on NLVL (SNEAK), which captures the cross-modality interplay between the vision and language sides.

GAN|对抗|攻击|生成相关(3篇)

【1】 Adversarial Parametric Pose Prior 标题：对抗性参数姿势之前链接：https://arxiv.org/abs/2112.04203

作者：Andrey Davydov,Anastasia Remizova,Victor Constantin,Sina Honari,Mathieu Salzmann,Pascal Fua 摘要：蒙皮多人线性（SMPL）模型可以通过将姿势和形状参数映射到人体网格来表示人体。这有助于通过不同的学习模型从图像中推断3D人体姿势和形状。但是，并非所有的姿势和形状参数值都会生成物理上合理甚至真实的身体网格。换言之，SMPL受约束，因此，当使用SMPL从图像重建人类时，可能会导致无效结果，或者通过直接优化其参数，或者通过学习从图像到这些参数的映射。因此，在本文中，我们学习了一个先验知识，该先验知识将SMPL参数限制为通过对抗性训练生成真实姿势的值。我们表明，我们学习的先验知识涵盖了真实数据分布的多样性，有助于从二维关键点优化三维重建，并在用于图像回归时产生更好的姿势估计。我们发现，基于球面分布的先验方法得到了最好的结果。此外，在所有这些任务中，它在约束SMPL参数方面优于基于VAE的最新方法。摘要：The Skinned Multi-Person Linear (SMPL) model can represent a human body by mapping pose and shape parameters to body meshes. This has been shown to facilitate inferring 3D human pose and shape from images via different learning models. However, not all pose and shape parameter values yield physically-plausible or even realistic body meshes. In other words, SMPL is under-constrained and may thus lead to invalid results when used to reconstruct humans from images, either by directly optimizing its parameters, or by learning a mapping from the image to these parameters. In this paper, we therefore learn a prior that restricts the SMPL parameters to values that produce realistic poses via adversarial training. We show that our learned prior covers the diversity of the real-data distribution, facilitates optimization for 3D reconstruction from 2D keypoints, and yields better pose estimates when used for regression from images. We found that the prior based on spherical distribution gets the best results. Furthermore, in all these tasks, it outperforms the state-of-the-art VAE-based approach to constraining the SMPL parameters.

【2】 Assessing a Single Image in Reference-Guided Image Synthesis 标题：参考制导图像合成中的单幅图像评估链接：https://arxiv.org/abs/2112.04163

作者：Jiayi Guo,Chaoqun Du,Jiangshan Wang,Huijuan Huang,Pengfei Wan,Gao Huang 备注：Accepted by AAAI 2022 摘要：由于其现实意义，评估生成性对抗网络（GAN）的性能一直是一个重要的课题。虽然已经提出了几种评估指标，但它们通常评估整个生成图像分布的质量。对于参考引导图像合成（RIS）任务，即以另一参考图像的样式渲染源图像，其中评估单个生成图像的质量至关重要，这些度量不适用。在本文中，我们提出了一个通用的基于学习的框架，参考引导图像合成评估（RISA），用于定量评估单个生成图像的质量。值得注意的是，RISA的训练不需要人工注释。具体地说，RISA的训练数据由RIS训练过程中的中间模型获取，并根据图像质量和迭代次数之间的正相关性，通过模型的迭代次数进行弱注释。由于该注释作为监督信号过于粗糙，我们引入了两种技术：1）像素插值方案来细化粗糙标签，2）多个二进制分类器来代替na\“回归回归器。此外，还引入了无监督对比损失，有效地获取了生成图像和参考图像之间的风格相似性。对各种数据集的实证结果表明，RISA与人类偏好高度一致，并且在不同模型之间传递良好。摘要：Assessing the performance of Generative Adversarial Networks (GANs) has been an important topic due to its practical significance. Although several evaluation metrics have been proposed, they generally assess the quality of the whole generated image distribution. For Reference-guided Image Synthesis (RIS) tasks, i.e., rendering a source image in the style of another reference image, where assessing the quality of a single generated image is crucial, these metrics are not applicable. In this paper, we propose a general learning-based framework, Reference-guided Image Synthesis Assessment (RISA) to quantitatively evaluate the quality of a single generated image. Notably, the training of RISA does not require human annotations. In specific, the training data for RISA are acquired by the intermediate models from the training procedure in RIS, and weakly annotated by the number of models' iterations, based on the positive correlation between image quality and iterations. As this annotation is too coarse as a supervision signal, we introduce two techniques: 1) a pixel-wise interpolation scheme to refine the coarse labels, and 2) multiple binary classifiers to replace a na\"ive regressor. In addition, an unsupervised contrastive loss is introduced to effectively capture the style similarity between a generated image and its reference image. Empirical results on various datasets demonstrate that RISA is highly consistent with human preference and transfers well across models.

【3】 Feature Statistics Mixing Regularization for Generative Adversarial Networks 标题：产生式对抗性网络的特征统计混合正则化链接：https://arxiv.org/abs/2112.04120

作者：Junho Kim,Yunjey Choi,Youngjung Uh 摘要：在生成性对抗网络中，改进鉴别器是提高生成性能的关键因素之一。由于图像分类器偏向于纹理，并且去噪提高了准确性，我们研究了1）鉴别器是否有偏差，以及2）如果去噪鉴别器将提高生成性能。事实上，我们发现经验证据表明鉴别器对图像的风格（例如纹理和颜色）很敏感。作为补救措施，我们提出了特征统计混合正则化（FSMR），鼓励鉴别器的预测对输入图像的样式保持不变。具体而言，我们在鉴别器的特征空间中生成原始图像和参考图像的混合特征，并应用正则化，以便混合特征的预测与原始图像的预测一致。我们进行了大量实验，以证明我们的正则化降低了对样式的敏感性，并持续改进了九个数据集上各种GAN架构的性能。此外，将FSMR添加到最近提出的基于增强的GAN方法中，进一步提高了图像质量。代码将在研究社区的网上公开。摘要：In generative adversarial networks, improving discriminators is one of the key components for generation performance. As image classifiers are biased toward texture and debiasing improves accuracy, we investigate 1) if the discriminators are biased, and 2) if debiasing the discriminators will improve generation performance. Indeed, we find empirical evidence that the discriminators are sensitive to the style (\e.g., texture and color) of images. As a remedy, we propose feature statistics mixing regularization (FSMR) that encourages the discriminator's prediction to be invariant to the styles of input images. Specifically, we generate a mixed feature of an original and a reference image in the discriminator's feature space and we apply regularization so that the prediction for the mixed feature is consistent with the prediction for the original image. We conduct extensive experiments to demonstrate that our regularization leads to reduced sensitivity to style and consistently improves the performance of various GAN architectures on nine datasets. In addition, adding FSMR to recently-proposed augmentation-based GAN methods further improves image quality. Code will be publicly available online for the research community.

自动驾驶|车辆|车道检测等(2篇)

【1】 SoK: Vehicle Orientation Representations for Deep Rotation Estimation 标题：SOK：用于深度旋转估计的车辆方位表示链接：https://arxiv.org/abs/2112.04421

作者：Huahong Tu,Siyuan Peng,Vladimir Leung,Richard Gao 摘要：近年来，三维自主车辆目标检测算法大量涌现。然而，很少有人关注方向预测。现有的研究工作提出了各种预测方法，但尚未进行全面、结论性的综述。通过我们的实验，我们使用KITTI 3D目标检测数据集对现有各种方向表示的准确性进行分类和经验比较，并提出一种新的方向表示形式：Tricosine。其中，基于二维笛卡尔坐标的表示（或单箱）实现了最高的精度，而额外的通道输入（位置编码和深度贴图）并没有提高预测性能。我们的代码发布在Github上：https://github.com/umd-fire-coml/KITTI-orientation-learning 摘要：In recent years, an influx of 3D autonomous vehicle object detection algorithms. However, little attention was paid to orientation prediction. Existing research work proposed various prediction methods, but a holistic, conclusive review has not been conducted. Through our experiments, we categorize and empirically compare the accuracy performance of various existing orientation representations using the KITTI 3D object detection dataset, and propose a new form of orientation representation: Tricosine. Among these, the 2D Cartesian-based representation, or Single Bin, achieves the highest accuracy, with additional channeled inputs (positional encoding and depth map) not boosting prediction performance. Our code is published on Github: https://github.com/umd-fire-coml/KITTI-orientation-learning

【2】 FPPN: Future Pseudo-LiDAR Frame Prediction for Autonomous Driving 标题：FPPN：面向自动驾驶的未来伪激光雷达帧预测链接：https://arxiv.org/abs/2112.04401

作者：Xudong Huang,Chunyu Lin,Haojie Liu,Lang Nie,Yao Zhao 摘要：激光雷达传感器由于其可靠的三维空间信息，在自动驾驶中得到了广泛的应用。然而，激光雷达的数据是稀疏的，并且激光雷达的频率低于相机。为了在空间和时间上生成更密集的点云，我们提出了未来第一个伪激光雷达帧预测网络。给定连续的稀疏深度图和RGB图像，我们首先基于动态运动信息粗略地预测未来的密集深度图。为了消除光流估计的误差，提出了一种帧间聚合模块来融合具有自适应权值的扭曲深度图。然后，我们使用静态上下文信息细化预测的密集深度图。通过将预测的密集深度图转换为相应的三维点云，可以得到未来的伪激光雷达框架。实验结果表明，我们的方法在流行的KITTI基准上优于现有的解决方案。摘要：LiDAR sensors are widely used in autonomous driving due to the reliable 3D spatial information. However, the data of LiDAR is sparse and the frequency of LiDAR is lower than that of cameras. To generate denser point clouds spatially and temporally, we propose the first future pseudo-LiDAR frame prediction network. Given the consecutive sparse depth maps and RGB images, we first predict a future dense depth map based on dynamic motion information coarsely. To eliminate the errors of optical flow estimation, an inter-frame aggregation module is proposed to fuse the warped depth maps with adaptive weights. Then, we refine the predicted dense depth map using static contextual information. The future pseudo-LiDAR frame can be obtained by converting the predicted dense depth map into corresponding 3D point clouds. Experimental results show that our method outperforms the existing solutions on the popular KITTI benchmark.

Attention注意力(1篇)

【1】 BA-Net: Bridge Attention for Deep Convolutional Neural Networks 标题：BA网：深卷积神经网络的桥接关注点链接：https://arxiv.org/abs/2112.04150

作者：Yue Zhao,Junzhou Chen,Zirui Zhang,Ronghui Zhang 摘要：近年来，通道注意机制因其在改善深度卷积神经网络（CNN）性能方面的巨大潜力而被广泛研究。然而，在大多数现有方法中，仅将相邻卷积层的输出馈送到关注层以计算信道权重。忽略来自其他卷积层的信息。基于这些观察，我们提出了一种简单的策略，称为桥接注意网（BA-Net），以改善通道注意机制。该设计的主要思想是通过跳转连接桥接先前卷积层的输出，以生成信道权重。BA网络不仅可以在前馈时提供更丰富的信道权值计算功能，而且可以在反向时提供多条参数更新路径。综合评估表明，与现有方法相比，该方法在准确性和速度方面达到了最先进的性能。桥接注意为神经网络结构的设计提供了一个新的视角，在改善现有通道注意机制的性能方面显示出巨大的潜力。该代码位于\url{https://github.com/zhaoy376/Attention-mechanism 摘要：In recent years, channel attention mechanism is widely investigated for its great potential in improving the performance of deep convolutional neural networks (CNNs). However, in most existing methods, only the output of the adjacent convolution layer is fed to the attention layer for calculating the channel weights. Information from other convolution layers is ignored. With these observations, a simple strategy, named Bridge Attention Net (BA-Net), is proposed for better channel attention mechanisms. The main idea of this design is to bridge the outputs of the previous convolution layers through skip connections for channel weights generation. BA-Net can not only provide richer features to calculate channel weight when feedforward, but also multiply paths of parameters updating when backforward. Comprehensive evaluation demonstrates that the proposed approach achieves state-of-the-art performance compared with the existing methods in regards to accuracy and speed. Bridge Attention provides a fresh perspective on the design of neural network architectures and shows great potential in improving the performance of the existing channel attention mechanisms. The code is available at \url{https://github.com/zhaoy376/Attention-mechanism

人脸|人群计数(2篇)

【1】 What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods 标题：我不能预测的，我不能理解的：以人为中心的可解释性方法评估框架链接：https://arxiv.org/abs/2112.04417

作者：Thomas Fel,Julien Colin,Remi Cadene,Thomas Serre 摘要：人们提出了多种解释方法和理论评估分数。然而，目前尚不清楚：（1）这些方法在现实世界场景中有多有用；（2）理论测量如何预测这些方法对人类实际使用的有用性。为了填补这一空白，我们在量表上进行了人类心理物理学实验，以评估人类参与者（n=1150）利用代表性归因方法学习预测不同图像分类器决策的能力。我们的研究结果表明，用于对可解释性方法进行评分的理论度量很难反映出个体归因方法在现实场景中的实用性。此外，个体归因方法在多大程度上帮助人类参与者预测分类器的决策，在不同的分类任务和数据集上差异很大。总的来说，我们的结果突出了该领域的基本挑战——表明迫切需要开发更好的可解释性方法，并部署以人为中心的评估方法。我们将提供我们框架的代码，以便于对新的可解释性方法进行系统评估。摘要：A multitude of explainability methods and theoretical evaluation scores have been proposed. However, it is not yet known: (1) how useful these methods are in real-world scenarios and (2) how well theoretical measures predict the usefulness of these methods for practical use by a human. To fill this gap, we conducted human psychophysics experiments at scale to evaluate the ability of human participants (n=1,150) to leverage representative attribution methods to learn to predict the decision of different image classifiers. Our results demonstrate that theoretical measures used to score explainability methods poorly reflect the practical usefulness of individual attribution methods in real-world scenarios. Furthermore, the degree to which individual attribution methods helped human participants predict classifiers' decisions varied widely across categorization tasks and datasets. Overall, our results highlight fundamental challenges for the field -- suggesting a critical need to develop better explainability methods and to deploy human-centered evaluation approaches. We will make the code of our framework available to ease the systematic evaluation of novel explainability methods.

【2】 Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering 标题：几何引导的渐进神经网络用于泛化高效的神经人体绘制链接：https://arxiv.org/abs/2112.04312

作者：Mingfei Chen,Jianfeng Zhang,Xiangyu Xu,Lijuan Liu,Jiashi Feng,Shuicheng Yan 摘要：在这项工作中，我们开发了一个可推广和高效的神经辐射场（NeRF）管道，用于在具有稀疏摄影机视图的设置下进行高保真自由视点人体合成。尽管现有的基于NeRF的方法可以合成相当逼真的人体细节，但当输入具有自遮挡时，它们往往会产生较差的结果，尤其是对于稀疏视图下看不见的人。此外，这些方法通常需要大量的采样点进行渲染，这导致了效率低下，并限制了它们在现实世界中的适用性。为了应对这些挑战，我们提出了一种几何引导的渐进式NeRF~（GP-NeRF）。特别是，为了更好地解决自遮挡问题，我们设计了一种几何引导的多视图特征集成方法，该方法在集成来自输入视图的不完整信息之前利用估计的几何，并为目标人体构建完整的几何体体积。同时，为了获得更高的渲染效率，我们引入了一种几何引导的渐进式渲染管道，它利用几何特征体积和预测的密度值来逐步减少采样点的数量并加快渲染过程。在ZJU MoCap和THUman数据集上的实验表明，我们的方法在多个泛化设置中显著优于最新技术，同时通过应用我们高效的渐进式渲染管道，时间成本降低了70%以上。摘要：In this work we develop a generalizable and efficient Neural Radiance Field (NeRF) pipeline for high-fidelity free-viewpoint human body synthesis under settings with sparse camera views. Though existing NeRF-based methods can synthesize rather realistic details for human body, they tend to produce poor results when the input has self-occlusion, especially for unseen humans under sparse views. Moreover, these methods often require a large number of sampling points for rendering, which leads to low efficiency and limits their real-world applicability. To address these challenges, we propose a Geometry-guided Progressive NeRF~(GP-NeRF). In particular, to better tackle self-occlusion, we devise a geometry-guided multi-view feature integration approach that utilizes the estimated geometry prior to integrate the incomplete information from input views and construct a complete geometry volume for the target human body. Meanwhile, for achieving higher rendering efficiency, we introduce a geometry-guided progressive rendering pipeline, which leverages the geometric feature volume and the predicted density values to progressively reduce the number of sampling points and speed up the rendering process. Experiments on the ZJU-MoCap and THUman datasets show that our method outperforms the state-of-the-arts significantly across multiple generalization settings, while the time cost is reduced >70% via applying our efficient progressive rendering pipeline.

跟踪(1篇)

【1】 Tracking People by Predicting 3D Appearance, Location & Pose 标题：通过预测3D外观、位置和姿势来跟踪人链接：https://arxiv.org/abs/2112.04477

作者：Jathushan Rajasegaran,Georgios Pavlakos,Angjoo Kanazawa,Jitendra Malik 备注：Project Page : this https URL 摘要：在本文中，我们提出了一种在单目视频中跟踪人的方法，通过预测他们未来的3D表现。为了实现这一点，我们首先以一种健壮的方式将人从单个帧提升到3D。此提升包括有关人员的三维姿势、其在三维空间中的位置以及三维外观的信息。当我们跟踪一个人时，我们会在轨迹表示中收集随时间变化的3D观察结果。鉴于我们观察到的3D性质，我们为前面的每个属性建立时间模型。我们使用这些模型来预测轨迹的未来状态，包括3D位置、3D外观和3D姿势。对于未来的帧，我们以概率方式计算轨迹的预测状态和单帧观测值之间的相似性。关联通过简单的匈牙利匹配来解决，匹配用于更新相应的tracklet。我们根据各种基准评估我们的方法，并报告最新的结果。摘要：In this paper, we present an approach for tracking people in monocular videos, by predicting their future 3D representations. To achieve this, we first lift people to 3D from a single frame in a robust way. This lifting includes information about the 3D pose of the person, his or her location in the 3D space, and the 3D appearance. As we track a person, we collect 3D observations over time in a tracklet representation. Given the 3D nature of our observations, we build temporal models for each one of the previous attributes. We use these models to predict the future state of the tracklet, including 3D location, 3D appearance, and 3D pose. For a future frame, we compute the similarity between the predicted state of a tracklet and the single frame observations in a probabilistic manner. Association is solved with simple Hungarian matching, and the matches are used to update the respective tracklets. We evaluate our approach on various benchmarks and report state-of-the-art results.

蒸馏|知识提取(1篇)

【1】 Boosting Contrastive Learning with Relation Knowledge Distillation 标题：用关系知识提炼促进对比学习链接：https://arxiv.org/abs/2112.04174

作者：Kai Zheng,Yuanjiang Wang,Ye Yuan 备注：Accepted by AAAI-2022 摘要：虽然自监督表示学习（SSL）在大型模型中被证明是有效的，但在遵循相同的解决方案时，SSL和轻量级模型中的监督方法之间仍然存在巨大的差距。我们深入研究了这个问题，发现轻量级模型在仅仅执行实例对比时容易在语义空间中崩溃。为了解决这个问题，我们提出了一个关系知识提炼（ReKD）的关系型对比范式。我们引入了一个异构教师来显式挖掘语义信息，并将一个新的关系知识传递给学生（轻量级模型）。理论分析支持了我们对实例对比的主要关注，并验证了关系对比学习的有效性。大量的实验结果也表明，我们的方法在多个轻量级模型上取得了显著的改进。特别是，AlexNet的线性评估明显地将目前的技术水平从44.7%提高到了50.1%，这是第一次接近被监督的50.5%的工作。代码将可用。摘要：While self-supervised representation learning (SSL) has proved to be effective in the large model, there is still a huge gap between the SSL and supervised method in the lightweight model when following the same solution. We delve into this problem and find that the lightweight model is prone to collapse in semantic space when simply performing instance-wise contrast. To address this issue, we propose a relation-wise contrastive paradigm with Relation Knowledge Distillation (ReKD). We introduce a heterogeneous teacher to explicitly mine the semantic information and transferring a novel relation knowledge to the student (lightweight model). The theoretical analysis supports our main concern about instance-wise contrast and verify the effectiveness of our relation-wise contrastive learning. Extensive experimental results also demonstrate that our method achieves significant improvements on multiple lightweight models. Particularly, the linear evaluation on AlexNet obviously improves the current state-of-art from 44.7% to 50.1%, which is the first work to get close to the supervised 50.5%. Code will be made available.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 Garment4D: Garment Reconstruction from Point Cloud Sequences 标题：Garment4D：基于点云序列的服装重建链接：https://arxiv.org/abs/2112.04159

作者：Fangzhou Hong,Liang Pan,Zhongang Cai,Ziwei Liu 备注：Accepted to NeurIPS 2021. Project Page: this https URL . Codes are available: this https URL 摘要：学习重建3D服装对于在不同姿势下穿着不同形状的3D人体非常重要。以前的工作通常依赖于2D图像作为输入，但是会受到比例和姿势模糊的影响。为了避免由2D图像引起的问题，我们提出了一个原则性框架Garment4D，该框架使用穿着人体的3D点云序列进行服装重建。Garment4D有三个专用步骤：顺序服装注册、标准服装估计和姿势服装重建。主要的挑战有两个方面：1）有效的3D特征学习以获得细节；2）捕捉服装与人体之间的相互作用所产生的服装动态，特别是对于宽松的服装，如裙子。为了解决这些问题，我们引入了一种新的建议引导的分层特征网络和迭代图卷积网络，该网络集成了高级语义特征和低级几何特征进行精细细节重建。此外，我们还提出了一种用于平滑服装运动捕捉的时间变换器。与非参数方法不同，该方法重构的服装网格与人体是分离的，具有很强的可解释性，适用于后续任务。作为这项任务的第一次尝试，通过大量实验定性和定量地说明了高质量的重建结果。代码可在https://github.com/hongfz16/Garment4D. 摘要：Learning to reconstruct 3D garments is important for dressing 3D human bodies of different shapes in different poses. Previous works typically rely on 2D images as input, which however suffer from the scale and pose ambiguities. To circumvent the problems caused by 2D images, we propose a principled framework, Garment4D, that uses 3D point cloud sequences of dressed humans for garment reconstruction. Garment4D has three dedicated steps: sequential garments registration, canonical garment estimation, and posed garment reconstruction. The main challenges are two-fold: 1) effective 3D feature learning for fine details, and 2) capture of garment dynamics caused by the interaction between garments and the human body, especially for loose garments like skirts. To unravel these problems, we introduce a novel Proposal-Guided Hierarchical Feature Network and Iterative Graph Convolution Network, which integrate both high-level semantic features and low-level geometric features for fine details reconstruction. Furthermore, we propose a Temporal Transformer for smooth garment motions capture. Unlike non-parametric methods, the reconstructed garment meshes by our method are separable from the human body and have strong interpretability, which is desirable for downstream tasks. As the first attempt at this task, high-quality reconstruction results are qualitatively and quantitatively illustrated through extensive experiments. Codes are available at https://github.com/hongfz16/Garment4D.

【2】 Neural Points: Point Cloud Representation with Neural Fields 标题：神经点：使用神经场的点云表示链接：https://arxiv.org/abs/2112.04148

作者：Wanquan Feng,Jin Li,Hongrui Cai,Xiaonan Luo,Juyong Zhang 摘要：在本文中，我们提出了一种新的点云表示方法\emph{Neural Points}。与传统点云表示不同，传统点云表示中的每个点仅表示三维空间中的一个位置或局部平面，神经点中的每个点通过神经场表示局部连续几何形状。因此，神经点可以表达更复杂的细节，因此具有更强的表示能力。神经点采用包含丰富几何细节的高分辨率曲面进行训练，使训练后的模型对各种形状具有足够的表达能力。具体来说，我们通过二维参数域和三维局部面片之间的局部同构来提取点上的深层局部特征并构造神经场。最后，将局部神经场集成在一起，形成全局曲面。实验结果表明，神经点具有很强的表示能力，具有良好的鲁棒性和泛化能力。使用神经点，我们可以对任意分辨率的点云进行重采样，并且它的性能大大优于最先进的点云上采样方法。摘要：In this paper, we propose \emph{Neural Points}, a novel point cloud representation. Unlike traditional point cloud representation where each point only represents a position or a local plane in the 3D space, each point in Neural Points represents a local continuous geometric shape via neural fields. Therefore, Neural Points can express much more complex details and thus have a stronger representation ability. Neural Points is trained with high-resolution surface containing rich geometric details, such that the trained model has enough expression ability for various shapes. Specifically, we extract deep local features on the points and construct neural fields through the local isomorphism between the 2D parametric domain and the 3D local patch. In the final, local neural fields are integrated together to form the global surface. Experimental results show that Neural Points has powerful representation ability and demonstrate excellent robustness and generalization ability. With Neural Points, we can resample point cloud with arbitrary resolutions, and it outperforms state-of-the-art point cloud upsampling methods by a large margin.

3D|3D重建等相关(2篇)

【1】 What's Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction 标题：沙发后面是什么？用于三维场景重建的定向射线距离函数(DRDF) 链接：https://arxiv.org/abs/2112.04481

作者：Nilesh Kulkarni,Justin Johnson,David F. Fouhey 备注：Project Page see this https URL 摘要：我们提出了一种从看不见的RGB图像进行场景级三维重建（包括遮挡区域）的方法。我们的方法是在真实的3D扫描和图像上训练的。事实证明，由于多种原因，这个问题很难解决；真正的扫描不是无懈可击的，排除了许多方法；场景中的距离需要跨对象进行推理（这使得推理更加困难）；正如我们所展示的，地表位置的不确定性促使网络产生缺乏基本距离函数特性的输出。我们提出了一种新的类距离函数，它可以在非结构化扫描上计算，并且在表面位置不确定的情况下具有良好的性能。在光线上计算此函数进一步降低了复杂性。我们训练了一个深度网络来预测这个函数，并证明它在Matterport3D、3D Front和ScanNet上优于其他方法。摘要：We present an approach for scene-level 3D reconstruction, including occluded regions, from an unseen RGB image. Our approach is trained on real 3D scans and images. This problem has proved difficult for multiple reasons; Real scans are not watertight, precluding many methods; distances in scenes require reasoning across objects (making it even harder); and, as we show, uncertainty about surface locations motivates networks to produce outputs that lack basic distance function properties. We propose a new distance-like function that can be computed on unstructured scans and has good behavior under uncertainty about surface location. Computing this function over rays reduces the complexity further. We train a deep network to predict this function and show it outperforms other methods on Matterport3D, 3D Front, and ScanNet.

【2】 Shortest Paths in Graphs with Matrix-Valued Edges: Concepts, Algorithm and Application to 3D Multi-Shape Analysis 标题：矩阵值边图的最短路：概念、算法及其在三维多形状分析中的应用链接：https://arxiv.org/abs/2112.04165

作者：Viktoria Ehm,Daniel Cremers,Florian Bernard 备注：published at 3DV 摘要：在图形中寻找最短路径与计算机视觉和图形学中的许多问题有关，包括图像分割、形状匹配或离散曲面上测地线距离的计算。传统上，对于具有标量边权重的图，最短路径的概念被考虑，这使得通过将各个边权重相加来计算路径的长度成为可能。然而，具有标量边权重的图在其表达能力上受到严重限制，因为经常使用边来编码显著更复杂的相互关系。在这项工作中，我们补偿了这种建模限制，并引入了新的图论概念，即具有矩阵值边的图中的最短路径。为此，我们定义了一种有意义的方法来量化矩阵值边的路径长度，并提出了一种简单而有效的算法来计算各自的最短路径。虽然我们的形式主义具有普遍性，因此适用于视觉、图形和其他领域的广泛设置，但我们将重点展示其在三维多形状分析中的优点。摘要：Finding shortest paths in a graph is relevant for numerous problems in computer vision and graphics, including image segmentation, shape matching, or the computation of geodesic distances on discrete surfaces. Traditionally, the concept of a shortest path is considered for graphs with scalar edge weights, which makes it possible to compute the length of a path by adding up the individual edge weights. Yet, graphs with scalar edge weights are severely limited in their expressivity, since oftentimes edges are used to encode significantly more complex interrelations. In this work we compensate for this modelling limitation and introduce the novel graph-theoretic concept of a shortest path in a graph with matrix-valued edges. To this end, we define a meaningful way for quantifying the path length for matrix-valued edges, and we propose a simple yet effective algorithm to compute the respective shortest path. While our formalism is universal and thus applicable to a wide range of settings in vision, graphics and beyond, we focus on demonstrating its merits in the context of 3D multi-shape analysis.

其他神经网络|深度学习|模型|建模(7篇)

【1】 FLAVA: A Foundational Language And Vision Alignment Model 标题：FLAVA：一种基础性的语言和视觉对齐模型链接：https://arxiv.org/abs/2112.04482

作者：Amanpreet Singh,Ronghang Hu,Vedanuj Goswami,Guillaume Couairon,Wojciech Galuba,Marcus Rohrbach,Douwe Kiela 备注：18 pages 摘要：最先进的视觉和视觉语言模型依赖于大规模的视觉语言预训练，以便在各种下游任务中获得良好的性能。一般来说，此类模型通常是跨模态（对比）或多模态（早期融合），但不是两者兼而有之；它们通常只针对特定的模式或任务。一个有前途的方向是使用一个整体的通用模型，作为一个“基础”，同时针对所有的模式——一个真正的视觉和语言基础模型应该擅长于视觉任务、语言任务和跨和多模态视觉和语言任务。我们将FLAVA作为这样一个模型引入，并在跨越这些目标模式的35项任务中展示了令人印象深刻的性能。摘要：State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

【2】 Revisiting Contrastive Learning through the Lens of Neighborhood Component Analysis: an Integrated Framework 标题：从邻域成分分析的视角重新审视对比学习：一个完整的框架链接：https://arxiv.org/abs/2112.04468

作者：Ching-Yun Ko,Jeet Mohapatra,Sijia Liu,Pin-Yu Chen,Luca Daniel,Lily Weng 备注：The full version of SSLNeurIPS'21 contributed talk (NeurIPS 2021 Workshop: Self-Supervised Learning - Theory and Practice). Work in progress 摘要：对比学习作为自我监督表征学习的重要工具，近年来受到了前所未有的重视。本质上，对比学习旨在利用成对的正样本和负样本进行表征学习，这涉及到在特征空间中利用邻域信息。通过研究对比学习和邻域成分分析（NCA）之间的联系，我们提出了一种新的随机最近邻对比学习观点，并随后提出了一系列优于现有对比学习的对比损失。在我们提出的框架下，我们展示了一种新的方法来设计综合对比损失，该方法可以同时在下游任务上实现良好的准确性和鲁棒性。通过集成框架，我们的标准精度提高了6%，对抗精度提高了17%。摘要：As a seminal tool in self-supervised representation learning, contrastive learning has gained unprecedented attention in recent years. In essence, contrastive learning aims to leverage pairs of positive and negative samples for representation learning, which relates to exploiting neighborhood information in a feature space. By investigating the connection between contrastive learning and neighborhood component analysis (NCA), we provide a novel stochastic nearest neighbor viewpoint of contrastive learning and subsequently propose a series of contrastive losses that outperform the existing ones. Under our proposed framework, we show a new methodology to design integrated contrastive losses that could simultaneously achieve good accuracy and robustness on downstream tasks. With the integrated framework, we achieve up to 6\% improvement on the standard accuracy and 17\% improvement on the adversarial accuracy.

【3】 MLP Architectures for Vision-and-Language Modeling: An Empirical Study 标题：用于视觉和语言建模的MLP体系结构：一项实证研究链接：https://arxiv.org/abs/2112.04453

作者：Yixin Nie,Linjie Li,Zhe Gan,Shuohang Wang,Chenguang Zhu,Michael Zeng,Zicheng Liu,Mohit Bansal,Lijuan Wang 备注：15 pages 摘要：我们启动了第一个关于使用MLP架构进行视觉和语言（VL）融合的实证研究。通过对5个VL任务和5个稳健的VQA基准的大量实验，我们发现：（i）在没有预训练的情况下，使用MLP进行多模融合与Transformer相比有明显的性能差距；（ii）然而，VL预训练有助于缩小绩效差距；（iii）在MLP上增加微小的单头注意，而不是沉重的多头注意，足以实现与Transformer相当的性能。此外，我们还发现，当在更硬的鲁棒VQA基准上进行评估时，MLP和Transformer之间的性能差距并未扩大，这表明将MLP用于VL融合可以大致推广到与使用Transformer类似的程度。这些结果提示，MLP可以有效地学习对齐从低级编码器中提取的视觉和文本特征，而无需严重依赖自我注意。基于此，我们提出了一个更大胆的问题：我们是否可以有一个用于VL建模的全MLP架构，其中VL融合和视觉编码器都被MLP取代？我们的结果表明，当两个模型都经过预训练时，与最先进的全功能VL模型相比，全MLP VL模型是次优的。然而，与未经预训练的全功能Transformer型号相比，全MLP的预训练平均得分更高。这表明了大规模预训练MLP类体系结构用于VL建模的潜力，并启发了未来的研究方向，即以较少的归纳设计偏差简化已建立的VL建模。我们的代码可在以下网站公开获取：https://github.com/easonnie/mlp-vil 摘要：We initiate the first empirical study on the use of MLP architectures for vision-and-language (VL) fusion. Through extensive experiments on 5 VL tasks and 5 robust VQA benchmarks, we find that: (i) Without pre-training, using MLPs for multimodal fusion has a noticeable performance gap compared to transformers; (ii) However, VL pre-training can help close the performance gap; (iii) Instead of heavy multi-head attention, adding tiny one-head attention to MLPs is sufficient to achieve comparable performance to transformers. Moreover, we also find that the performance gap between MLPs and transformers is not widened when being evaluated on the harder robust VQA benchmarks, suggesting using MLPs for VL fusion can generalize roughly to a similar degree as using transformers. These results hint that MLPs can effectively learn to align vision and text features extracted from lower-level encoders without heavy reliance on self-attention. Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs? Our result shows that an all-MLP VL model is sub-optimal compared to state-of-the-art full-featured VL models when both of them get pre-trained. However, pre-training an all-MLP can surprisingly achieve a better average score than full-featured transformer models without pre-training. This indicates the potential of large-scale pre-training of MLP-like architectures for VL modeling and inspires the future research direction on simplifying well-established VL modeling with less inductive design bias. Our code is publicly available at: https://github.com/easonnie/mlp-vil

【4】 DMRVisNet: Deep Multi-head Regression Network for Pixel-wise Visibility Estimation Under Foggy Weather 标题：DMRVisNet：雾天气下像素级能见度估计的深度多头回归网络链接：https://arxiv.org/abs/2112.04278

作者：Jing You,Shaocheng Jia,Xin Pei,Danya Yao 备注：8 figures 摘要：场景感知对于驾驶决策和交通安全至关重要。然而，雾作为一种常见的天气，在现实世界中频繁出现，尤其是在山区，使得人们很难对周围环境进行准确的观测。因此，准确估计雾天气下的能见度对交通管理和安全具有重要意义。为了解决这一问题，目前大多数方法使用在道路固定位置配备的专业仪器进行能见度测量；这些方法既昂贵又不灵活。在本文中，我们提出了一种创新的端到端卷积神经网络框架，专门利用图像数据利用Koschmieder定律来估计可见性。该方法通过将物理模型集成到所提出的框架中来估计可见性，而不是通过卷积神经网络直接预测可见性值。此外，我们将可见性估计为像素级的可见性图，而以前的可见性测量方法仅预测整个图像的单个值。因此，我们的方法的估计结果信息量更大，特别是在雾不均匀的情况下，这有助于开发更精确的雾天气预警系统，从而更好地保护智能交通基础设施系统，促进其发展。为了验证所提出的框架，使用AirSim平台收集了一个虚拟数据集FACI，其中包含3000幅不同浓度的雾图像。详细的实验表明，该方法达到了与现有方法相媲美的性能。摘要：Scene perception is essential for driving decision-making and traffic safety. However, fog, as a kind of common weather, frequently appears in the real world, especially in the mountain areas, making it difficult to accurately observe the surrounding environments. Therefore, precisely estimating the visibility under foggy weather can significantly benefit traffic management and safety. To address this, most current methods use professional instruments outfitted at fixed locations on the roads to perform the visibility measurement; these methods are expensive and less flexible. In this paper, we propose an innovative end-to-end convolutional neural network framework to estimate the visibility leveraging Koschmieder's law exclusively using the image data. The proposed method estimates the visibility by integrating the physical model into the proposed framework, instead of directly predicting the visibility value via the convolutional neural work. Moreover, we estimate the visibility as a pixel-wise visibility map against those of previous visibility measurement methods which solely predict a single value for an entire image. Thus, the estimated result of our method is more informative, particularly in uneven fog scenarios, which can benefit to developing a more precise early warning system for foggy weather, thereby better protecting the intelligent transportation infrastructure systems and promoting its development. To validate the proposed framework, a virtual dataset, FACI, containing 3,000 foggy images in different concentrations, is collected using the AirSim platform. Detailed experiments show that the proposed method achieves performance competitive to those of state-of-the-art methods.

【5】 Symmetry Perception by Deep Networks: Inadequacy of Feed-Forward Architectures and Improvements with Recurrent Connections 标题：深层网络的对称性感知：前馈结构的不足与循环连接的改进链接：https://arxiv.org/abs/2112.04162

作者：Shobhita Sundaram,Darius Sinha,Matthew Groth,Tomotake Sasaki,Xavier Boix 摘要：对称性在自然界中无处不在，许多物种的视觉系统都能感知到对称性，因为对称性有助于探测环境中具有生态重要性的物体类别。对称感知需要提取图像区域之间的非局部空间依赖关系，其潜在的神经机制仍然难以捉摸。在本文中，我们评估了深度神经网络（DNN）结构在学习对称感知任务中的作用。我们证明，擅长模拟人类在目标识别任务中表现的前馈DNN无法获得对称性的一般概念。即使DNN被设计为捕获非本地空间依赖性，例如通过“扩展”卷积和最近引入的“Transformer”设计，情况也是如此。相比之下，我们发现递归架构能够通过将非局部空间依赖分解为一系列可用于新图像的局部操作来学习感知对称性。这些结果表明，在人工系统中，重复连接可能在对称感知中起着重要作用，也可能在生物系统中起着重要作用。摘要：Symmetry is omnipresent in nature and perceived by the visual system of many species, as it facilitates detecting ecologically important classes of objects in our environment. Symmetry perception requires abstraction of non-local spatial dependencies between image regions, and its underlying neural mechanisms remain elusive. In this paper, we evaluate Deep Neural Network (DNN) architectures on the task of learning symmetry perception from examples. We demonstrate that feed-forward DNNs that excel at modelling human performance on object recognition tasks, are unable to acquire a general notion of symmetry. This is the case even when the DNNs are architected to capture non-local spatial dependencies, such as through `dilated' convolutions and the recently introduced `transformers' design. By contrast, we find that recurrent architectures are capable of learning to perceive symmetry by decomposing the non-local spatial dependencies into a sequence of local operations, that are reusable for novel images. These results suggest that recurrent connections likely play an important role in symmetry perception in artificial systems, and possibly, biological ones too.

【6】 Contrastive Instruction-Trajectory Learning for Vision-Language Navigation 标题：视觉语言导航的对比教学-轨迹学习链接：https://arxiv.org/abs/2112.04138

作者：Xiwen Liang,Fengda Zhu,Yi Zhu,Bingqian Lin,Bing Wang,Xiaodan Liang 备注：Accepted by AAAI 2022 摘要：视觉语言导航（VLN）任务要求agent在自然语言指令的指导下到达目标。以前的作品学习按照指示一步一步地导航。然而，这些工作可能无法区分指令轨迹对之间的相似性和差异，并且忽略了子指令的时间连续性。这些问题阻碍了代理学习独特的视觉和语言表示，损害了导航策略的健壮性和通用性。在本文中，我们提出了一个对比指令轨迹学习（CITL）框架，该框架探索了相似数据样本之间的不变性和不同数据样本之间的差异，以学习鲁棒导航的独特表示。具体而言，我们提出：（1）粗粒度对比学习目标，通过对比全轨迹观察和指令的语义来增强视觉和语言表征；（2）利用子指令的时间信息感知指令的细粒度对比学习目标；（3）一种用于对比学习的成对样本重加权机制，用于挖掘硬样本，从而减轻对比学习中数据采样偏差的影响。我们的CITL可以轻松地与VLN主干集成，形成新的学习范式，并在看不见的环境中实现更好的通用性。大量实验表明，使用CITL的模型在R2R、R4R和RxR上优于以前的最新方法。摘要：The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction. Previous works learn to navigate step-by-step following an instruction. However, these works may fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. These problems hinder agents from learning distinctive vision-and-language representations, harming the robustness and generalizability of the navigation policy. In this paper, we propose a Contrastive Instruction-Trajectory Learning (CITL) framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation. Specifically, we propose: (1) a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions, respectively; (2) a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions; (3) a pairwise sample-reweighting mechanism for contrastive learning to mine hard samples and hence mitigate the influence of data sampling bias in contrastive learning. Our CITL can be easily integrated with VLN backbones to form a new learning paradigm and achieve better generalizability in unseen environments. Extensive experiments show that the model with CITL surpasses the previous state-of-the-art methods on R2R, R4R, and RxR.

【7】 GraDIRN: Learning Iterative Gradient Descent-based Energy Minimization for Deformable Image Registration 标题：GraDIRN：基于学习迭代梯度下降的能量最小化可变形图像配准链接：https://arxiv.org/abs/2112.03915

作者：Huaqi Qiu,Kerstin Hammernik,Chen Qin,Daniel Rueckert 摘要：我们提出了一种基于梯度下降的图像配准网络（GraDIRN），通过在深度学习框架中嵌入基于梯度的迭代能量最小化来学习可变形图像配准。传统的图像配准算法通常使用迭代能量最小化优化来寻找一对图像之间的最佳变换，这在需要多次迭代时非常耗时。相比之下，最近的基于学习的方法通过训练深度神经网络来分摊这种昂贵的迭代优化，以便在训练后通过快速网络前向传递来实现一对图像的配准。基于深度学习与迭代变分能量优化数学结构相结合的图像重建技术的成功，我们提出了一种基于多分辨率梯度下降能量最小化的新型配准网络。网络的前向传递采用显式图像相异梯度步骤和卷积神经网络（CNN）参数化的广义正则化步骤，迭代次数固定。我们使用自动微分法推导出显式图像相异度梯度w.r.t.变换的正向计算图，因此可以使用任意图像相异度度量和变换模型，而无需复杂且容易出错的梯度推导。我们通过使用2D心脏MR图像和3D大脑MR图像对配准任务进行广泛评估，证明该方法在使用较少可学习参数的同时实现了最先进的配准性能。摘要：We present a Gradient Descent-based Image Registration Network (GraDIRN) for learning deformable image registration by embedding gradient-based iterative energy minimization in a deep learning framework. Traditional image registration algorithms typically use iterative energy-minimization optimization to find the optimal transformation between a pair of images, which is time-consuming when many iterations are needed. In contrast, recent learning-based methods amortize this costly iterative optimization by training deep neural networks so that registration of one pair of images can be achieved by fast network forward pass after training. Motivated by successes in image reconstruction techniques that combine deep learning with the mathematical structure of iterative variational energy optimization, we formulate a novel registration network based on multi-resolution gradient descent energy minimization. The forward pass of the network takes explicit image dissimilarity gradient steps and generalized regularization steps parameterized by Convolutional Neural Networks (CNN) for a fixed number of iterations. We use auto-differentiation to derive the forward computational graph for the explicit image dissimilarity gradient w.r.t. the transformation, so arbitrary image dissimilarity metrics and transformation models can be used without complex and error-prone gradient derivations. We demonstrate that this approach achieves state-of-the-art registration performance while using fewer learnable parameters through extensive evaluations on registration tasks using 2D cardiac MR images and 3D brain MR images.

其他(7篇)

【1】 Audio-Visual Synchronisation in the wild 标题：野外的视听同步链接：https://arxiv.org/abs/2112.04432

作者：Honglie Chen,Weidi Xie,Triantafyllos Afouras,Arsha Nagrani,Andrea Vedaldi,Andrew Zisserman 摘要：在本文中，我们考虑的音频-视频同步应用到视频“在野生”（即一般类以外的语音）。作为一项新的任务，我们确定并管理一个具有高视听相关性的测试集，即VGG声音同步。我们比较了一些基于Transformer的架构变体，这些变体专门设计用于模拟任意长度的音频和视频信号，同时显著降低了训练期间的内存需求。我们进一步对策划的数据集进行深入分析，并为开放域视听同步定义评估指标。我们将我们的方法应用于标准唇读语音基准测试LRS2和LRS3，并在各个方面进行了烧蚀。最后，我们在新的VGG声音同步视频数据集中为超过160个不同类别的普通视听同步设置了第一个基准。在所有情况下，我们提出的模型都比以前的先进模型有很大的优势。摘要：In this paper, we consider the problem of audio-visual synchronisation applied to videos `in-the-wild' (ie of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.

【2】 Feature matching for multi-epoch historical aerial images 标题：多时代历史航空影像的特征匹配链接：https://arxiv.org/abs/2112.04255

作者：Lulin Zhang,Ewelina Rupnik,Marc Pierrot-Deseilligny 备注：None 摘要：历史图像具有高空间分辨率和立体采集的特点，为恢复三维土地覆盖信息提供了宝贵的资源。由于难以在不断演变的景观下找到足够数量的特征对应，通过自校准对历时历史图像进行精确地理参考仍然是一个瓶颈。在这项研究中，我们提出了一种全自动的方法来检测不同时间（即跨历元）拍摄的历史图像之间的特征对应关系，而不需要辅助数据。基于在同一历元内（即历元内）计算的相对方向，我们获得DSMs（数字表面模型），并将其合并到一个从粗到精的匹配中。该方法包括：（1）历元间DSMs匹配，以粗略地共同注册方向和DSMs（即3D Helmert变换），然后（2）使用原始RGB图像进行精确的历元间特征匹配。通过使用共同注册的数据缩小搜索空间，后者固有的模糊性在很大程度上得到了缓解。利用历元间特征，我们细化图像方向，并定量评估结果（1）与DoD（DSMs差异），（2）与地面检查点，以及（3）通过量化地震引起的地面位移。我们证明了我们的方法：（1）能够自动地理参考历时历史图像；（2）能够有效地减轻由于摄像机参数估计不当而导致的系统误差；（3）对剧烈的场景变化具有鲁棒性。与现有技术相比，我们的方法将图像的地理参考精度提高了2倍。所提出的方法在一个免费的开源摄影测量软件MicMac中实现。摘要：Historical imagery is characterized by high spatial resolution and stereo-scopic acquisitions, providing a valuable resource for recovering 3D land-cover information. Accurate geo-referencing of diachronic historical images by means of self-calibration remains a bottleneck because of the difficulty to find sufficient amount of feature correspondences under evolving landscapes. In this research, we present a fully automatic approach to detecting feature correspondences between historical images taken at different times (i.e., inter-epoch), without auxiliary data required. Based on relative orientations computed within the same epoch (i.e., intra-epoch), we obtain DSMs (Digital Surface Model) and incorporate them in a rough-to-precise matching. The method consists of: (1) an inter-epoch DSMs matching to roughly co-register the orientations and DSMs (i.e, the 3D Helmert transformation), followed by (2) a precise inter-epoch feature matching using the original RGB images. The innate ambiguity of the latter is largely alleviated by narrowing down the search space using the co-registered data. With the inter-epoch features, we refine the image orientations and quantitatively evaluate the results (1) with DoD (Difference of DSMs), (2) with ground check points, and (3) by quantifying ground displacement due to an earthquake. We demonstrate that our method: (1) can automatically georeference diachronic historical images; (2) can effectively mitigate systematic errors induced by poorly estimated camera parameters; (3) is robust to drastic scene changes. Compared to the state-of-the-art, our method improves the image georeferencing accuracy by a factor of 2. The proposed methods are implemented in MicMac, a free, open-source photogrammetric software.

【3】 SimulSLT: End-to-End Simultaneous Sign Language Translation 标题：SimulSLT：端到端同步手语翻译链接：https://arxiv.org/abs/2112.04228

作者：Aoxiong Yin,Zhou Zhao,Jinglin Liu,Weike Jin,Meng Zhang,Xingshan Zeng,Xiaofei He 备注：Accepted by ACM Multimedia 2021 摘要：手语翻译作为一种具有深远社会意义的技术，近年来引起了越来越多研究者的兴趣。然而，现有的手语翻译方法在开始翻译之前需要阅读所有的视频，这导致了很高的推理延迟，也限制了其在现实场景中的应用。为了解决这个问题，我们提出了第一个端到端的手语同步翻译模型SimulSLT，它可以将手语视频同时翻译成目标文本。SimulSLT由文本解码器、边界预测器和屏蔽编码器组成。我们1）使用wait-k策略进行同声翻译。2）设计了一种新的基于integrate和fire模块的边界预测器输出光泽边界，用于建模手语视频与光泽之间的对应关系。3）提出了一种创新的重新编码方法，帮助模型获得更丰富的上下文信息，使现有的视频特征能够充分交互。在RWTH PHOENIX Weather 2014T数据集上进行的实验结果表明，SimulSLT在保持低延迟的同时，实现了BLEU分数，超过了最新的端到端非同步手语翻译模型，证明了我们方法的有效性。摘要：Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years. However, the existing sign language translation methods need to read all the videos before starting the translation, which leads to a high inference latency and also limits their application in real-life scenarios. To solve this problem, we propose SimulSLT, the first end-to-end simultaneous sign language translation model, which can translate sign language videos into target text concurrently. SimulSLT is composed of a text decoder, a boundary predictor, and a masked encoder. We 1) use the wait-k strategy for simultaneous translation. 2) design a novel boundary predictor based on the integrate-and-fire module to output the gloss boundary, which is used to model the correspondence between the sign language video and the gloss. 3) propose an innovative re-encode method to help the model obtain more abundant contextual information, which allows the existing video features to interact fully. The experimental results conducted on the RWTH-PHOENIX-Weather 2014T dataset show that SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model while maintaining low latency, which proves the effectiveness of our method.

【4】 Transformaly -- Two (Feature Spaces) Are Better Than One 标题：变形--两个(特征空间)比一个好链接：https://arxiv.org/abs/2112.04185

作者：Matan Jacob Cohen,Shai Avidan 摘要：异常检测是一个成熟的研究领域，旨在识别预定分布以外的样本。异常检测管道由两个主要阶段组成：（1）特征提取和（2）正态性评分分配。最近的论文使用预先训练好的网络进行特征提取，获得了最先进的结果。然而，使用预先训练的网络并不能充分利用列车时刻可用的正常样本。本文建议通过师生训练来利用这一信息。在我们的设置中，使用预训练的教师网络在正常训练样本上训练学生网络。由于学生网络仅在正常样本上训练，因此在异常情况下，学生网络可能会偏离教师网络。这种差异可以作为预训练特征向量的补充表示。我们的方法——Transformaly——利用预先训练的视觉变换器（ViT）来提取两个特征向量：预先训练的（不可知的）特征和师生（微调的）特征。我们报告了最先进的AUROC结果，其中一个类别被视为正常，其他类别被视为异常的普通单峰设置，以及多峰设置，其中除一个类别外，所有类别都被视为正常，只有一个类别被视为异常。该守则可于https://github.com/MatanCohen1/Transformaly. 摘要：Anomaly detection is a well-established research area that seeks to identify samples outside of a predetermined distribution. An anomaly detection pipeline is comprised of two main stages: (1) feature extraction and (2) normality score assignment. Recent papers used pre-trained networks for feature extraction achieving state-of-the-art results. However, the use of pre-trained networks does not fully-utilize the normal samples that are available at train time. This paper suggests taking advantage of this information by using teacher-student training. In our setting, a pretrained teacher network is used to train a student network on the normal training samples. Since the student network is trained only on normal samples, it is expected to deviate from the teacher network in abnormal cases. This difference can serve as a complementary representation to the pre-trained feature vector. Our method -- Transformaly -- exploits a pre-trained Vision Transformer (ViT) to extract both feature vectors: the pre-trained (agnostic) features and the teacher-student (fine-tuned) features. We report state-of-the-art AUROC results in both the common unimodal setting, where one class is considered normal and the rest are considered abnormal, and the multimodal setting, where all classes but one are considered normal, and just one class is considered abnormal. The code is available at https://github.com/MatanCohen1/Transformaly.

【5】 Vision-Cloud Data Fusion for ADAS: A Lane Change Prediction Case Study 标题：面向ADAS的视云数据融合：车道变化预测实例研究链接：https://arxiv.org/abs/2112.04042

作者：Yongkang Liu,Ziran Wang,Kyungtae Han,Zhenyu Shou,Prashant Tiwari,John H. L. Hansen 备注：Published on IEEE Transactions on Intelligent Vehicles 摘要：随着智能车辆和高级驾驶员辅助系统（ADAS）的快速发展，一个新的趋势是交通系统将涉及混合级别的驾驶员参与。因此，在这种情况下，为驾驶员提供必要的视觉指导对于预防潜在风险至关重要。为了推动视觉制导系统的发展，我们引入了一种新的视觉云数据融合方法，将云中的摄像机图像和数字孪生信息集成在一起，以帮助智能车辆做出更好的决策。目标车辆边界框是在目标探测器（在ego车辆上运行）和位置信息（从云端接收）的帮助下绘制和匹配的。以深度图像作为附加特征源，在联合阈值上0.7相交的情况下，获得了79.2%的匹配精度。通过对车道变化预测的实例分析，验证了所提出的数据融合方法的有效性。在案例研究中，提出了一种多层感知器算法和改进的车道变化预测方法。从Unity game engine获得的人在回路仿真结果表明，所提出的模型可以在安全性、舒适性和环境可持续性方面显著改善公路驾驶性能。摘要：With the rapid development of intelligent vehicles and Advanced Driver-Assistance Systems (ADAS), a new trend is that mixed levels of human driver engagements will be involved in the transportation system. Therefore, necessary visual guidance for drivers is vitally important under this situation to prevent potential risks. To advance the development of visual guidance systems, we introduce a novel vision-cloud data fusion methodology, integrating camera image and Digital Twin information from the cloud to help intelligent vehicles make better decisions. Target vehicle bounding box is drawn and matched with the help of the object detector (running on the ego-vehicle) and position information (received from the cloud). The best matching result, a 79.2% accuracy under 0.7 intersection over union threshold, is obtained with depth images served as an additional feature source. A case study on lane change prediction is conducted to show the effectiveness of the proposed data fusion methodology. In the case study, a multi-layer perceptron algorithm is proposed with modified lane change prediction approaches. Human-in-the-loop simulation results obtained from the Unity game engine reveal that the proposed model can improve highway driving performance significantly in terms of safety, comfort, and environmental sustainability.

【6】 Implicit Neural Representations for Image Compression 标题：用于图像压缩的隐式神经表示法链接：https://arxiv.org/abs/2112.04267

作者：Yannick Strümpler,Janis Postels,Ren Yang,Luc van Gool,Federico Tombari 摘要：近年来，隐式神经表征（INRs）作为一种新的、有效的数据类型表征方法受到了广泛的关注。到目前为止，以前的工作主要集中在优化其重建性能。这项工作从一个新的角度研究INRs，即作为图像压缩工具。为此，我们提出了第一个基于INRs的综合压缩管道，包括量化、量化感知再训练和熵编码。使用INRs编码，即过度拟合数据样本，通常要慢几个数量级。为了缓解这一缺点，我们利用基于MAML的元学习初始化以较少的梯度更新达到编码，这通常也提高了INRs的率失真性能。我们发现，我们使用INRs进行源压缩的方法大大优于以前类似的工作，与专门为图像设计的普通压缩算法具有竞争力，并且与基于率失真自动编码器的最新学习方法相比，缩小了差距。此外，我们还对我们的方法的各个组成部分的重要性进行了广泛的研究，希望这有助于进一步研究这种新的图像压缩方法。摘要：Recently Implicit Neural Representations (INRs) gained attention as a novel and effective representation for various data types. Thus far, prior work mostly focused on optimizing their reconstruction performance. This work investigates INRs from a novel perspective, i.e., as a tool for image compression. To this end, we propose the first comprehensive compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding. Encoding with INRs, i.e. overfitting to a data sample, is typically orders of magnitude slower. To mitigate this drawback, we leverage meta-learned initializations based on MAML to reach the encoding in fewer gradient updates which also generally improves rate-distortion performance of INRs. We find that our approach to source compression with INRs vastly outperforms similar prior work, is competitive with common compression algorithms designed specifically for images and closes the gap to state-of-the-art learned approaches based on Rate-Distortion Autoencoders. Moreover, we provide an extensive ablation study on the importance of individual components of our method which we hope facilitates future research on this novel approach to image compression.

【7】 Reverse image filtering using total derivative approximation and accelerated gradient descent 标题：基于全导数近似和加速梯度下降的图像逆滤波链接：https://arxiv.org/abs/2112.04121

作者：Fernando J. Galetto,Guang Deng 摘要：在本文中，我们解决了一个新的问题，扭转的影响，图像滤波器，可以是线性或非线性的。假设滤波器的算法未知，并且滤波器作为黑盒可用。我们将此反问题描述为最小化基于局部面片的代价函数，并使用总导数来逼近梯度下降中使用的梯度来解决该问题。我们分析了傅里叶域中影响收敛性和输出质量的因素。我们还研究了加速梯度下降算法在三种无梯度反向滤波器中的应用，包括本文提出的一种。我们给出了大量实验的结果，以评估该算法的复杂性和有效性。结果表明，该算法的性能优于现有的算法：（1）它的复杂度与最快的反向滤波器相同，但可以反转更多的滤波器；（2）它可以反转与非常复杂的反向滤波器相同的滤波器列表，但复杂度要小得多。摘要：In this paper, we address a new problem of reversing the effect of an image filter, which can be linear or nonlinear. The assumption is that the algorithm of the filter is unknown and the filter is available as a black box. We formulate this inverse problem as minimizing a local patch-based cost function and use total derivative to approximate the gradient which is used in gradient descent to solve the problem. We analyze factors affecting the convergence and quality of the output in the Fourier domain. We also study the application of accelerated gradient descent algorithms in three gradient-free reverse filters, including the one proposed in this paper. We present results from extensive experiments to evaluate the complexity and effectiveness of the proposed algorithm. Results demonstrate that the proposed algorithm outperforms the state-of-the-art in that (1) it is at the same level of complexity as that of the fastest reverse filter, but it can reverse a larger number of filters, and (2) it can reverse the same list of filters as that of the very complex reverse filter, but its complexity is much smaller.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-09，如有侵权请联系 cloudcommunity@tencent.com 删除

linux