前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >计算机视觉学术速递[9.13]

计算机视觉学术速递[9.13]

作者头像
公众号-arXiv每日学术速递
发布2021-09-16 17:30:42
8600
发布2021-09-16 17:30:42
举报

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

cs.CV 方向,今日共计32篇

Transformer(1篇)

【1】 Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering 标题:用于视频问答的多模态交互时间金字塔转换器 链接:https://arxiv.org/abs/2109.04735

作者:Min Peng,Chongyang Wang,Yuan Gao,Yu Shi,Xiang-Dong Zhou 机构:Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing School, University of Chinese Academy of Sciences, University College London ,Shenzhen Institute of Artificial Intelligence and Robotics for Society 备注:Submitted to AAAI'22 摘要:视频问答(VideoQA)是一种具有挑战性的多模态视觉理解和自然语言理解的结合。虽然现有的方法很少在多个时间尺度上利用视频中的外观运动信息,但文本语义提取中问题与视觉信息之间的交互作用常常被忽略。针对这些问题,本文提出了一种新的具有多模态交互的时间金字塔变换(TPT)视频质量保证模型。TPT模型包括两个模块,即问题特定转换(QT)和视觉推理(VI)。给定从视频构建的时间金字塔,QT从每个单词和视觉内容之间的粗糙到精细多模态共现构建问题语义。在这种问题特定语义的指导下,虚拟仪器从问题与视频之间的局部到全局多层次交互中推断出视觉线索。在每个模块中,我们引入了一种多模式注意机制,以帮助提取问题视频交互,并对通过不同层次的信息采用剩余连接。通过在三个VideoQA数据集上的大量实验,我们证明了与现有技术相比,该方法具有更好的性能。 摘要:Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts.

检测相关(3篇)

【1】 Detection of GAN-synthesized street videos 标题:GaN合成街道视频的检测 链接:https://arxiv.org/abs/2109.04991

作者:Omran Alamayreh,Mauro Barni 机构:Department of Information Engineering and Mathematics, University of Siena, Via Roma , - Siena, ITALY 备注:2021 29th European Association for Signal Processing (EURASIP) 摘要:对人工智能生成视频检测的研究几乎完全集中在人脸视频上,通常称为深度伪造。随着一系列有效工具的开发,人脸交换、人脸重现和表情操纵等操作已经成为一项深入研究的主题,以区分人工视频和真实视频。对于人工非面部视频的检测,人们的关注要少得多。然而,用于生成此类视频的新工具正在快速开发,并将很快达到视频的质量水平。本文的目的是研究一种新的人工智能生成的视频帧驱动街道序列(这里称为DeepStreets视频)的可检测性,根据其性质,无法使用用于面部伪造的相同工具进行分析。具体来说,我们提供了一个简单的基于帧的检测器,在由Vid2vid体系结构生成的最先进的DeepStreets视频上实现了非常好的性能。值得注意的是,检测器在压缩视频上保持了非常好的性能,即使训练期间使用的压缩级别与测试视频不匹配。 摘要:Research on the detection of AI-generated videos has focused almost exclusively on face videos, usually referred to as deepfakes. Manipulations like face swapping, face reenactment and expression manipulation have been the subject of an intense research with the development of a number of efficient tools to distinguish artificial videos from genuine ones. Much less attention has been paid to the detection of artificial non-facial videos. Yet, new tools for the generation of such kind of videos are being developed at a fast pace and will soon reach the quality level of deepfake videos. The goal of this paper is to investigate the detectability of a new kind of AI-generated videos framing driving street sequences (here referred to as DeepStreets videos), which, by their nature, can not be analysed with the same tools used for facial deepfakes. Specifically, we present a simple frame-based detector, achieving very good performance on state-of-the-art DeepStreets videos generated by the Vid2vid architecture. Noticeably, the detector retains very good performance on compressed videos, even when the compression level used during training does not match that used for the test videos.

【2】 Unsupervised Change Detection in Hyperspectral Images using Feature Fusion Deep Convolutional Autoencoders 标题:基于特征融合深卷积自动编码器的高光谱图像无监督变化检测 链接:https://arxiv.org/abs/2109.04990

作者:Debasrita Chakraborty,Ashish Ghosh 备注:19 pages 摘要:由于数据中存在大量谱带,双时相共配准高光谱图像中的二值变化检测是一项具有挑战性的任务。因此,研究人员试图通过缩小尺寸来处理它。提出的工作旨在建立一个新的特征提取系统,使用特征融合深度卷积自动编码器来检测一对这样的双时共配准高光谱图像之间的变化。特征融合考虑了连续层次和多个感受野的特征,因此与现有的特征提取方法相比增加了竞争优势。所描述的变化检测技术是完全无监督的,并且比其他需要一定数量标签信息的有监督或半监督方法更加优雅。对提取的特征采用了不同的方法来发现两幅图像中的变化,并且发现所提出的方法在所有数据集的无监督变化检测方面明显优于最新的方法。 摘要:Binary change detection in bi-temporal co-registered hyperspectral images is a challenging task due to a large number of spectral bands present in the data. Researchers, therefore, try to handle it by reducing dimensions. The proposed work aims to build a novel feature extraction system using a feature fusion deep convolutional autoencoder for detecting changes between a pair of such bi-temporal co-registered hyperspectral images. The feature fusion considers features across successive levels and multiple receptive fields and therefore adds a competitive edge over the existing feature extraction methods. The change detection technique described is completely unsupervised and is much more elegant than other supervised or semi-supervised methods which require some amount of label information. Different methods have been applied to the extracted features to find the changes in the two images and it is found that the proposed method clearly outperformed the state of the art methods in unsupervised change detection for all the datasets.

【3】 ACFNet: Adaptively-Cooperative Fusion Network for RGB-D Salient Object Detection 标题:ACFNet:用于RGB-D显著目标检测的自适应协同融合网络 链接:https://arxiv.org/abs/2109.04627

作者:Jinchao Zhu 机构:· Feng Dong, · Yan Sheng, · Siyu Yan, · Xianbang Meng, Received: date Accepted: date 摘要:RGB和深度数据的合理使用对于促进计算机视觉任务和机器人环境交互的发展具有重要意义。然而,这两类数据的早期和晚期融合有不同的优点和缺点。此外,由于对象信息的多样性,在特定场景中使用单一类型的数据往往会导致语义误导。基于上述考虑,我们提出了一种具有ResinRes结构的自适应协作融合网络(ACFNet)用于显著目标检测。该结构旨在灵活利用早期和后期特征融合的优势。其次,设计了一种自适应协同语义制导(ACG)方案,以抑制制导阶段的不准确特征。此外,我们提出了一种基于类型的注意模块(TAM)来优化网络,增强对不同对象的多尺度感知。对于不同的对象,通过门控机制增强或抑制由不同类型卷积生成的特征,以进行分割优化。ACG和TAM分别根据数据属性和卷积属性优化特征流的传输。在RGB-D SOD数据集上进行的充分实验表明,该网络相对于18种最先进的算法具有良好的性能。 摘要:The reasonable employment of RGB and depth data show great significance in promoting the development of computer vision tasks and robot-environment interaction. However, there are different advantages and disadvantages in the early and late fusion of the two types of data. Besides, due to the diversity of object information, using a single type of data in a specific scenario tends to result in semantic misleading. Based on the above considerations, we propose an adaptively-cooperative fusion network (ACFNet) with ResinRes structure for salient object detection. This structure is designed to flexibly utilize the advantages of feature fusion in early and late stages. Secondly, an adaptively-cooperative semantic guidance (ACG) scheme is designed to suppress inaccurate features in the guidance phase. Further, we proposed a type-based attention module (TAM) to optimize the network and enhance the multi-scale perception of different objects. For different objects, the features generated by different types of convolution are enhanced or suppressed by the gated mechanism for segmentation optimization. ACG and TAM optimize the transfer of feature streams according to their data attributes and convolution attributes, respectively. Sufficient experiments conducted on RGB-D SOD datasets illustrate that the proposed network performs favorably against 18 state-of-the-art algorithms.

分类|识别相关(2篇)

【1】 Face-NMS: A Core-set Selection Approach for Efficient Face Recognition 标题:Face-NMS:一种有效的人脸识别核集选择方法 链接:https://arxiv.org/abs/2109.04698

作者:Yunze Chen,Junjie Huang,Jiagang Zhu,Zheng Zhu,Tian Yang,Guan Huang,Dalong Du 机构:XForwardAI Technology Co.,Ltd, Beijing, China, Institute of Automation, Chinese Academy of Sciences, Beijing, China, Tsinghua University, Beijing, China 备注:Data efficient face recognition; Core-set selection 摘要:最近,人脸识别在野外取得了显著的成功,其中一个关键引擎是训练数据的不断增加。例如,最大的人脸数据集WebFace42M包含约200万个身份和4200万张人脸。然而,大量的人脸增加了训练时间、计算资源和内存成本方面的限制。目前对这一问题的研究主要集中在设计一个高效的全连接层(FC),以减少由于大量身份而导致的GPU内存消耗。在这项工作中,我们通过解决贪婪收集操作(即核心集选择透视图)导致的最新人脸数据集的冗余问题来放松这些约束。作为从这个角度对人脸识别问题的第一次尝试,我们发现现有的方法在性能和效率上都是有限的。为了获得更高的成本效率,我们提出了一种称为Face NMS的新型过滤策略。人脸NMS在特征空间上工作,在生成核心集时同时考虑局部稀疏性和全局稀疏性。实际上,人脸NMS类似于目标检测社区中的非最大抑制(NMS)。该算法根据人脸对整体稀疏度的潜在贡献对人脸进行排序,并在局部稀疏度较高的人脸对中过滤掉多余的人脸。在效率方面,Face NMS通过使用较小但足够的代理数据集来训练代理模型,从而加速了整个流程。因此,使用Face NMS,我们成功地将WebFace 42M数据集的规模缩小到60%,同时在主要基准上保持其性能,提供了40%的资源节约和1.64倍的加速。该守则可在以下网址公开查阅:https://github.com/HuangJunJie2017/Face-NMS. 摘要:Recently, face recognition in the wild has achieved remarkable success and one key engine is the increasing size of training data. For example, the largest face dataset, WebFace42M contains about 2 million identities and 42 million faces. However, a massive number of faces raise the constraints in training time, computing resources, and memory cost. The current research on this problem mainly focuses on designing an efficient Fully-connected layer (FC) to reduce GPU memory consumption caused by a large number of identities. In this work, we relax these constraints by resolving the redundancy problem of the up-to-date face datasets caused by the greedily collecting operation (i.e. the core-set selection perspective). As the first attempt in this perspective on the face recognition problem, we find that existing methods are limited in both performance and efficiency. For superior cost-efficiency, we contribute a novel filtering strategy dubbed Face-NMS. Face-NMS works on feature space and simultaneously considers the local and global sparsity in generating core sets. In practice, Face-NMS is analogous to Non-Maximum Suppression (NMS) in the object detection community. It ranks the faces by their potential contribution to the overall sparsity and filters out the superfluous face in the pairs with high similarity for local sparsity. With respect to the efficiency aspect, Face-NMS accelerates the whole pipeline by applying a smaller but sufficient proxy dataset in training the proxy model. As a result, with Face-NMS, we successfully scale down the WebFace42M dataset to 60% while retaining its performance on the main benchmarks, offering a 40% resource-saving and 1.64 times acceleration. The code is publicly available for reference at https://github.com/HuangJunJie2017/Face-NMS.

【2】 Object recognition for robotics from tactile time series data utilising different neural network architectures 标题:基于不同神经网络结构的机器人触觉时间序列目标识别 链接:https://arxiv.org/abs/2109.04573

作者:Wolfgang Bottcher,Pedro Machado,Nikesh Lama,T. M. McGinnity 机构:ID, Computational Neurosciences and Cognitive Robotics Group, School of Science and Technology, Nottingham Trent University, Nottingham, UK, T.M. McGinnity, Intelligent Systems Research Centre, Ulster University, Northern Ireland, UK 备注:None 摘要:机器人需要利用被抓取物体的高质量信息与物理环境进行交互。因此,触觉数据可用于补充视觉模态。本文研究了卷积神经网络(CNN)和长短时记忆(LSTM)神经网络结构在时空触觉抓取数据对象分类中的应用。此外,我们在相同的物理设置下,使用来自两个不同指尖传感器(即BioTac SP和WTS-FT)的数据对这些方法进行了比较,从而对相同触觉对象分类数据集的方法和传感器进行了现实的比较。此外,我们还提出了一种从记录的数据中创建更多训练示例的方法。结果表明,该方法将两种传感器类型的最大精度从82.4%(BioTac SP fingertips)和90.7%(WTS-FT fingertips)提高到94%。 摘要:Robots need to exploit high-quality information on grasped objects to interact with the physical environment. Haptic data can therefore be used for supplementing the visual modality. This paper investigates the use of Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) neural network architectures for object classification on Spatio-temporal tactile grasping data. Furthermore, we compared these methods using data from two different fingertip sensors (namely the BioTac SP and WTS-FT) in the same physical setup, allowing for a realistic comparison across methods and sensors for the same tactile object classification dataset. Additionally, we propose a way to create more training examples from the recorded data. The results show that the proposed method improves the maximum accuracy from 82.4% (BioTac SP fingertips) and 90.7% (WTS-FT fingertips) with complete time-series data to about 94% for both sensor types.

分割|语义相关(1篇)

【1】 S3G-ARM: Highly Compressive Visual Self-localization from Sequential Semantic Scene Graph Using Absolute and Relative Measurements 标题:S3G-ARM:基于绝对和相对测量的序列语义场景图高度压缩视觉自定位 链接:https://arxiv.org/abs/2109.04569

作者:Mitsuki Yoshida,Ryogo Yamamoto,Kanji Tanaka 机构:The authors are with Graduate School of Engineering, University ofFukui 备注:6 pages, 7 figures, technical report 摘要:在本文中,我们从一种新的高度压缩的场景表示,即序列语义场景图(S3G)来解决基于图像序列的自定位(ISS)问题。深度图卷积神经网络(GCN)的最新发展使得一种高度压缩的视觉位置分类器(VPC)能够使用场景图作为输入模态。然而,在这种高度压缩的应用中,图像到图映射中丢失的信息量非常大,可能会损害分类性能。为了解决这个问题,我们提出了一对保持相似性的映射,即图像到节点和图像到边,这样节点和边分别作为相互补充的绝对和相对特征。此外,将提出的GCN-VPC应用到查询图像序列的视点规划(VP)任务中,进一步提高了VPC的性能。使用公共NCLT数据集的实验验证了该方法的有效性。 摘要:In this paper, we address the problem of image sequence-based self-localization (ISS) from a new highly compressive scene representation called sequential semantic scene graph (S3G). Recent developments in deep graph convolutional neural networks (GCNs) have enabled a highly compressive visual place classifier (VPC) that can use a scene graph as the input modality. However, in such a highly compressive application, the amount of information lost in the image-to-graph mapping is significant and can damage the classification performance. To address this issue, we propose a pair of similarity-preserving mappings, image-to-nodes and image-to-edges, such that the nodes and edges act as absolute and relative features, respectively, that complement each other. Moreover, the proposed GCN-VPC is applied to a new task of viewpoint planning (VP) of the query image sequence, which contributes to further improvement in the VPC performance. Experiments using the public NCLT dataset validated the effectiveness of the proposed method.

Zero/Few Shot|迁移|域适配|自适应(3篇)

【1】 An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA 标题:GPT-3用于少发知识型VQA的实证研究 链接:https://arxiv.org/abs/2109.05014

作者:Zhengyuan Yang,Zhe Gan,Jianfeng Wang,Xiaowei Hu,Yumao Lu,Zicheng Liu,Lijuan Wang 机构:Microsoft Corporation 摘要:基于知识的可视化问答(VQA)涉及回答需要图像中不存在的外部知识的问题。现有的方法首先从外部资源中检索知识,然后对所选知识、输入图像和问题进行推理以进行答案预测。但是,这种两步方法可能会导致不匹配,从而可能限制VQA性能。例如,检索到的知识可能有噪声,与问题无关,推理过程中重新嵌入的知识特征可能偏离其在知识库(KB)中的原始含义。为了应对这一挑战,我们提出了PICa,这是一种通过使用图像标题提示GPT3的简单而有效的方法,用于基于知识的VQA。受GPT-3在知识检索和问答方面的强大功能的启发,我们不再像以前那样使用结构化知识库,而是将GPT-3视为一种可以联合获取和处理相关知识的隐式非结构化知识库。具体地说,我们首先将图像转换为GPT-3可以理解的标题(或标记),然后调整GPT-3,通过提供一些上下文中的VQA示例,以几种快照方式解决VQA任务。我们通过仔细研究进一步提高性能:(i)哪些文本格式最能描述图像内容,以及(ii)如何更好地选择和使用上下文中的示例。PICa首次将GPT-3用于多模式任务。通过仅使用16个示例,PICa在OK-VQA数据集上的绝对+8.6点超过了受监督的最新水平。我们还在VQAv2上对PICa进行了基准测试,其中PICa还显示了良好的几次拍摄性能。 摘要:Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

【2】 LibFewShot: A Comprehensive Library for Few-shot Learning 标题:LibFewShot:一个面向小规模学习的综合性图书馆 链接:https://arxiv.org/abs/2109.04898

作者:Wenbin Li,Chuanqi Dong,Pinzhuo Tian,Tiexin Qin,Xuesong Yang,Ziyi Wang,Jing Huo,Yinghuan Shi,Lei Wang,Yang Gao,Jiebo Luo 机构:State Key Laboratory for Novel Software Technology, Nanjing University, China, University of Wollongong, Australia, University of Rochester, USA 备注:14 pages 摘要:近几年来,Few-Shot学习,特别是Few-Shot图像分类,受到了越来越多的关注,并取得了显著的进展。最近的一些研究含蓄地表明,许多通用技术或“技巧”,如数据扩充、预训练、知识提炼和自我监督,可能会极大地提高少量学习方法的性能。此外,不同的作品可能采用不同的软件平台、不同的训练计划、不同的主干架构,甚至不同的输入图像大小,这使得公平比较变得困难,实践者在再现性方面遇到困难。为了解决这些情况,我们提出了一个综合性的少数镜头学习库(LibFewShot),通过在PyTorch中使用相同的单一代码库在统一框架中重新实现17种最先进的少数镜头学习方法。此外,基于LibFewShot,我们提供了对具有多个主干架构的多个基准数据集的综合评估,以评估不同训练技巧的常见陷阱和效果。此外,鉴于最近对元或情景训练机制必要性的质疑,我们的评估结果表明,这种机制仍然是必要的,尤其是当与预训练相结合时。我们希望我们的工作不仅能降低初学者学习少数镜头的障碍,而且能消除非平凡技巧的影响,促进少数镜头学习的内在研究。源代码可从https://github.com/RL-VIG/LibFewShot. 摘要:Few-shot learning, especially few-shot image classification, has received increasing attention and witnessed significant advances in recent years. Some recent studies implicitly show that many generic techniques or ``tricks'', such as data augmentation, pre-training, knowledge distillation, and self-supervision, may greatly boost the performance of a few-shot learning method. Moreover, different works may employ different software platforms, different training schedules, different backbone architectures and even different input image sizes, making fair comparisons difficult and practitioners struggle with reproducibility. To address these situations, we propose a comprehensive library for few-shot learning (LibFewShot) by re-implementing seventeen state-of-the-art few-shot learning methods in a unified framework with the same single codebase in PyTorch. Furthermore, based on LibFewShot, we provide comprehensive evaluations on multiple benchmark datasets with multiple backbone architectures to evaluate common pitfalls and effects of different training tricks. In addition, given the recent doubts on the necessity of meta- or episodic-training mechanism, our evaluation results show that such kind of mechanism is still necessary especially when combined with pre-training. We hope our work can not only lower the barriers for beginners to work on few-shot learning but also remove the effects of the nontrivial tricks to facilitate intrinsic research on few-shot learning. The source code is available from https://github.com/RL-VIG/LibFewShot.

【3】 TADA: Taxonomy Adaptive Domain Adaptation 标题:TADA:分类法自适应领域适应 链接:https://arxiv.org/abs/2109.04813

作者:Rui Gong,Martin Danelljan,Dengxin Dai,Wenguan Wang,Danda Pani Paudel,Ajad Chhatkuli,Fisher Yu,Luc Van Gool 机构:Computer Vision Lab, ETH Zurich., VISICS, KU Leuven., MPI for Informatics. 备注:15 pages, 5 figures, 6 tables 摘要:传统的领域自适应解决了在有限或无额外监督的情况下将模型自适应到新的目标领域的任务。在解决输入域差距时,标准域自适应设置假定输出空间中没有域更改。在语义预测任务中,不同的数据集通常根据不同的语义分类进行标记。在许多实际设置中,目标域任务需要与源域强制使用的分类不同的分类。因此,我们引入了更一般的分类自适应域适应(TADA)问题,允许两个域之间的分类不一致。我们进一步提出了一种联合处理图像级和标签级域自适应的方法。在标签级别,我们采用双边混合采样策略来扩充目标域,并采用重新标记方法来统一和对齐标签空间。我们通过提出一种不确定性校正的对比学习方法来解决图像级的领域差距,从而产生更多的领域不变和类别区分特征。我们在不同的TADA设置下广泛地评估了我们的框架的有效性:开放分类法、从粗到细的分类法和部分重叠的分类法。我们的框架大大优于以前的最新技术,同时能够适应新的目标域分类法。 摘要:Traditional domain adaptation addresses the task of adapting a model to a novel target domain under limited or no additional supervision. While tackling the input domain gap, the standard domain adaptation settings assume no domain change in the output space. In semantic prediction tasks, different datasets are often labeled according to different semantic taxonomies. In many real-world settings, the target domain task requires a different taxonomy than the one imposed by the source domain. We therefore introduce the more general taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent taxonomies between the two domains. We further propose an approach that jointly addresses the image-level and label-level domain adaptation. On the label-level, we employ a bilateral mixed sampling strategy to augment the target domain, and a relabelling method to unify and align the label spaces. We address the image-level domain gap by proposing an uncertainty-rectified contrastive learning method, leading to more domain-invariant and class discriminative features. We extensively evaluate the effectiveness of our framework under different TADA settings: open taxonomy, coarse-to-fine taxonomy, and partially-overlapping taxonomy. Our framework outperforms previous state-of-the-art by a large margin, while capable of adapting to new target domain taxonomies.

半弱无监督|主动学习|不确定性(1篇)

【1】 View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution 标题:视盲点为修复:掩模引导残差卷积的自监督去噪 链接:https://arxiv.org/abs/2109.04970

作者:Yuhongze Zhou,Liguang Zhou,Tin Lun Lam,Yangsheng Xu 机构:McGill University, The Chinese University of Hong Kong, Shenzhen, Shenzhen Institute of Artificial Intelligence and Robotics for Society 摘要:近年来,自监督去噪方法表现出了令人印象深刻的性能,它绕过了监督去噪方法中噪声清晰图像对的艰苦采集过程,提高了去噪在现实世界中的适用性。盲点训练方案是一种著名的自监督去噪策略。然而,在网络结构方面,有一些工作试图改进基于盲点的自去噪器。在本文中,我们直观地看待盲点策略,并考虑其使用相邻像素预测操作像素作为修复过程的过程。因此,我们提出了一种新的掩模引导剩余卷积(MGRConv)到普通卷积神经网络,例如U-Net,以促进基于盲点的去噪。我们的MGRConv可以被视为软部分卷积,并在部分卷积、可学习注意映射和门卷积之间找到折衷。它支持使用适当的掩码约束进行动态掩码学习。与部分卷积和门卷积不同,它为网络学习提供了适度的自由度。与可学习的注意力映射不同,它还避免了利用外部可学习参数激活遮罩。实验表明,我们提出的即插即用MGRConv可以帮助基于盲点的去噪网络在现有的基于单图像和基于数据集的方法上取得令人满意的结果。 摘要:In recent years, self-supervised denoising methods have shown impressive performance, which circumvent painstaking collection procedure of noisy-clean image pairs in supervised denoising methods and boost denoising applicability in real world. One of well-known self-supervised denoising strategies is the blind-spot training scheme. However, a few works attempt to improve blind-spot based self-denoiser in the aspect of network architecture. In this paper, we take an intuitive view of blind-spot strategy and consider its process of using neighbor pixels to predict manipulated pixels as an inpainting process. Therefore, we propose a novel Mask Guided Residual Convolution (MGRConv) into common convolutional neural networks, e.g. U-Net, to promote blind-spot based denoising. Our MGRConv can be regarded as soft partial convolution and find a trade-off among partial convolution, learnable attention maps, and gated convolution. It enables dynamic mask learning with appropriate mask constrain. Different from partial convolution and gated convolution, it provides moderate freedom for network learning. It also avoids leveraging external learnable parameters for mask activation, unlike learnable attention maps. The experiments show that our proposed plug-and-play MGRConv can assist blind-spot based denoising network to reach promising results on both existing single-image based and dataset-based methods.

时序|行为识别|姿态|视频|运动估计(3篇)

【1】 Spatio-Temporal Recurrent Networks for Event-Based Optical Flow Estimation 标题:基于事件的时空递归网络光流估计 链接:https://arxiv.org/abs/2109.04871

作者:Ziluo Ding,Rui Zhao,Jiyuan Zhang,Tianxiao Gao,Ruiqin Xiong,Zhaofei Yu,Tiejun Huang 机构:Peking University 摘要:事件摄像机为视觉感知提供了很有希望的替代方案,特别是在高速和高动态范围的场景中。最近,许多深度学习方法在为许多基于事件的问题(如光流估计)提供无模型解决方案方面取得了巨大成功。然而,现有的深度学习方法并没有从建筑设计的角度很好地解决时间信息的重要性,也不能有效地提取时空特征。另一个利用尖峰神经网络的研究领域存在更深层次架构的训练问题。为了解决这些问题,提出了一种新的输入表示法,用于捕获事件的时间分布以增强信号。此外,我们还介绍了一种用于基于事件的光流估计的时空递归编码-解码神经网络结构,该结构利用卷积选通递归单元从一系列事件图像中提取特征映射。此外,我们的架构允许合并一些传统的基于帧的核心模块,如相关层和迭代残差细化方案。该网络在多车辆立体事件摄像机数据集上通过自监督学习进行端到端训练。我们已经证明,它大大优于所有现有的最先进的方法。 摘要:Event camera has offered promising alternative for visual perception, especially in high speed and high dynamic range scenes. Recently, many deep learning methods have shown great success in providing model-free solutions to many event-based problems, such as optical flow estimation. However, existing deep learning methods did not address the importance of temporal information well from the perspective of architecture design and cannot effectively extract spatio-temporal features. Another line of research that utilizes Spiking Neural Network suffers from training issues for deeper architecture. To address these points, a novel input representation is proposed that captures the events temporal distribution for signal enhancement. Moreover, we introduce a spatio-temporal recurrent encoding-decoding neural network architecture for event-based optical flow estimation, which utilizes Convolutional Gated Recurrent Units to extract feature maps from a series of event images. Besides, our architecture allows some traditional frame-based core modules, such as correlation layer and iterative residual refine scheme, to be incorporated. The network is end-to-end trained with self-supervised learning on the Multi-Vehicle Stereo Event Camera dataset. We have shown that it outperforms all the existing state-of-the-art methods by a large margin.

【2】 EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery Generation 标题:EVOQUER:通过视频旋转的BackQuery生成增强时间基础 链接:https://arxiv.org/abs/2109.04600

作者:Yanjun Gao,Lulu Liu,Jason Wang,Xin Chen,Huayan Wang,Rui Zhang 机构:Pennsylvania State University, Kwai Inc 备注:Accepted by Visually Grounded Interaction and Language (ViGIL) Workshop at NAACL 2021 摘要:时间基础旨在预测与自然语言查询输入相对应的视频片段的时间间隔。在这项工作中,我们提出了EVOQUER,一个结合现有文本到视频接地模型和视频辅助查询生成网络的临时接地框架。给定一个查询和一个未剪辑的视频,时间接地模型预测目标间隔,并通过生成输入查询的简化版本将预测的视频剪辑输入到视频翻译任务中。EVOQUER通过合并来自时间基础和作为反馈的查询生成的损失函数来形成闭环学习。我们在两个广泛使用的数据集(Charades STA和ActivityNet)上的实验表明,EVOQUER在最短时间内实现了1.05和1.31的有希望的改进R@0.7. 我们还讨论了查询生成任务如何通过解释时态基础模型行为来促进错误分析。 摘要:Temporal grounding aims to predict a time interval of a video clip corresponding to a natural language query input. In this work, we present EVOQUER, a temporal grounding framework incorporating an existing text-to-video grounding model and a video-assisted query generation network. Given a query and an untrimmed video, the temporal grounding model predicts the target interval, and the predicted video clip is fed into a video translation task by generating a simplified version of the input query. EVOQUER forms closed-loop learning by incorporating loss functions from both temporal grounding and query generation serving as feedback. Our experiments on two widely used datasets, Charades-STA and ActivityNet, show that EVOQUER achieves promising improvements by 1.05 and 1.31 at R@0.7. We also discuss how the query generation task could facilitate error analysis by explaining temporal grounding model behavior.

【3】 Automatic Portrait Video Matting via Context Motion Network 标题:基于上下文运动网络的人像视频自动遮片 链接:https://arxiv.org/abs/2109.04598

作者:Qiqi Hou,Charlie Wang 机构:Portland State University , Intel, Frame, Segmentation (Chen et al. ,), Optical flow, FBA Matting (Forte and Piti´e ,), Ours 摘要:我们的自动人像视频抠图方法不需要额外的输入。大多数最先进的抠图方法依靠语义分割方法自动生成trimap。由于缺乏时间信息,它们的性能受到影响。我们的方法利用了来自光流的语义信息和时间信息,并产生高质量的结果。 摘要:Our automatic portrait video matting method does not require extra inputs. Most state-of-the-art matting methods rely on semantic segmentation methods to automatically generate the trimap. Their performance is compromised due to the lack of temporal information. Our method exploits semantic information as well as temporal information from optical flow and produces high-quality results.

GAN|对抗|攻击|生成相关(1篇)

【1】 LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation 标题:LAViTeR:在图像和字幕生成的辅助下学习对齐的视觉和文本表示 链接:https://arxiv.org/abs/2109.04993

作者:Mohammad Abuzar Shaikh,Zhanghexuan Ji,Dana Moukheiber,Sargur Srihari,Mingchen Gao 机构:Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY, USA 备注:14 pages, 10 Figures, 5 Tables 摘要:从大规模图像-文本对中预训练视觉和文本表示正成为许多下游视觉语言任务的标准方法。基于Transformer的模型通过一系列自我监督学习任务来学习模态间和模态内注意。本文提出了一种新颖的视觉和文本表征学习体系结构LAViTeR。主要模块,视觉文本对齐(VTA)将由两个辅助任务辅助,基于GAN的图像合成和图像字幕。我们还提出了一个新的评价指标来衡量学习到的视觉嵌入和文本嵌入之间的相似性。在两个公共数据集CUB和MS-COCO上的实验结果表明,在联合特征嵌入空间中,视觉和文本表示对齐效果更好 摘要:Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space

Attention注意力(1篇)

【1】 Is Attention Better Than Matrix Decomposition? 标题:注意力比矩阵分解好吗? 链接:https://arxiv.org/abs/2109.04553

作者:Zhengyang Geng,Meng-Hao Guo,Hongxu Chen,Xia Li,Ke Wei,Zhouchen Lin 机构:Zhejiang Lab; ,Key Lab. of Machine Perception (MoE), School of EECS, Peking University;, Tsinghua University; ,School of Data Science, Fudan University; ,Pazhou Lab 备注:ICLR 2021 摘要:注意机制,特别是自我注意,作为现代深度学习的重要组成部分,在全球相关性发现中起着至关重要的作用。然而,在为全球环境建模时,手工制作的注意力是不可替代的吗?我们有趣的发现是,就编码长距离依赖关系的性能和计算成本而言,自我关注并不比20年前开发的矩阵分解(MD)模型好。我们将全局上下文问题建模为一个低秩恢复问题,并表明其优化算法可以帮助设计全局信息块。然后,本文提出了一系列的汉堡包,其中我们使用求解MDs的优化算法将输入表示分解为子矩阵,并重构低秩嵌入。当谨慎应对通过MDs传播回来的梯度时,具有不同MDs的汉堡包可以很好地对抗流行的全局上下文模块自我注意。在视觉任务中进行了全面的实验,在视觉任务中学习全局上下文至关重要,包括语义分割和图像生成,证明了自我注意及其变体的显著改进。 摘要:As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank recovery problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants.

蒸馏|知识提取(1篇)

【1】 Residual 3D Scene Flow Learning with Context-Aware Feature Extraction 标题:基于上下文感知特征提取的残差3D场景流学习 链接:https://arxiv.org/abs/2109.04685

作者:Guangming Wang,Yunzhe Hu,Xinrui Wu,Hesheng Wang 机构:Wang are with Department of Au-tomation, Key Laboratory of System Control and Information Processing ofMinistry of Education, Key Laboratory of Marine Intelligent Equipment andSystem of Ministry of Education, Shanghai Jiao Tong University 备注:8 pages, 4 figures, under review 摘要:场景流估计是预测连续两帧点云之间点方向三维位移矢量的任务,在服务机器人、自主驾驶等领域有着重要的应用。尽管之前的许多工作已经对基于点云的场景流估计进行了大量的探索,但我们指出了两个以前没有注意到或很好解决的问题:1)重复模式中相邻帧的点可能由于其邻域中相似的空间结构而错误关联;2) 具有长距离移动的点云相邻帧之间的场景流可能会被错误估计。为了解决第一个问题,我们提出了一种新的上下文感知集合conv层,利用欧氏空间的上下文结构信息,学习局部点特征的软聚集权重。我们的设计灵感来源于场景理解过程中人类对上下文结构信息的感知。我们将上下文感知集合conv层合并到3D点云的上下文感知点特征金字塔模块中,用于场景流估计。对于第二个问题,我们在残差流细化层中提出了一个显式的残差流学习结构来处理长距离运动。在FlyingThings3D和KITTI场景流数据集上的实验和烧蚀研究证明了每个组件的有效性,并且我们解决了模糊帧间关联和长距离运动估计问题。在FlyingThings3D和KITTI场景流数据集上的定量结果表明,我们的方法达到了最先进的性能,超过了我们所知的所有其他以前的工作至少25%。 摘要:Scene flow estimation is the task to predict the point-wise 3D displacement vector between two consecutive frames of point clouds, which has important application in fields such as service robots and autonomous driving. Although many previous works have explored greatly on scene flow estimation based on point clouds, we point out two problems that have not been noticed or well solved before: 1) Points of adjacent frames in repetitive patterns may be wrongly associated due to similar spatial structure in their neighbourhoods; 2) Scene flow between adjacent frames of point clouds with long-distance movement may be inaccurately estimated. To solve the first problem, we propose a novel context-aware set conv layer to exploit contextual structure information of Euclidean space and learn soft aggregation weights for local point features. Our design is inspired by human perception of contextual structure information during scene understanding. We incorporate the context-aware set conv layer in a context-aware point feature pyramid module of 3D point clouds for scene flow estimation. For the second problem, we propose an explicit residual flow learning structure in the residual flow refinement layer to cope with long-distance movement. The experiments and ablation study on FlyingThings3D and KITTI scene flow datasets demonstrate the effectiveness of each proposed component and that we solve problem of ambiguous inter-frame association and long-distance movement estimation. Quantitative results on both FlyingThings3D and KITTI scene flow datasets show that our method achieves state-of-the-art performance, surpassing all other previous works to the best of our knowledge by at least 25%.

3D|3D重建等相关(1篇)

【1】 Mesh convolutional neural networks for wall shear stress estimation in 3D artery models 标题:网格卷积神经网络在三维动脉模型壁面切应力估计中的应用 链接:https://arxiv.org/abs/2109.04797

作者:Julian Suk,Pim de Haan,Phillip Lippe,Christoph Brune,Jelmer M. Wolterink 机构:Department of Applied Mathematics & Technical Medical Centre, University of, Twente, Enschede, The Netherlands, QUVA Lab, University of Amsterdam, Amsterdam, The Netherlands, Qualcomm AI Research, Qualcomm Technologies Netherlands B.V.⋆ 备注:(MICCAI 2021) Workshop on Statistical Atlases and Computational Modelling of the Heart (STACOM) 摘要:计算流体力学(CFD)是一种有价值的工具,用于个性化、无创性地评估动脉血流动力学,但其复杂性和耗时性限制了其在实践中的大规模应用。最近,人们研究了利用深度学习快速估算表面网格上的壁面剪应力(WSS)等CFD参数。然而,现有的方法通常依赖于手工制作的曲面网格重新参数化来匹配卷积神经网络结构。在这项工作中,我们建议使用网格卷积神经网络,直接操作CFD中使用的相同有限元表面网格。我们使用从CFD模拟中获得的基本真实值,在有分叉和无分叉的两组合成冠状动脉模型数据集上训练和评估我们的方法。我们证明了我们灵活的深度学习模型能够准确地预测该表面网格上的三维WSS向量。我们的方法在不到5[s]的时间内处理新网格,一致地实现了$\leq$1.6[%]的归一化平均绝对误差,并且在保持的测试集上达到了90.5[%]中值近似精度的峰值,与之前发表的工作相比,这是有利的。这表明了在动脉模型中使用网格卷积神经网络进行血流动力学参数估计的CFD替代模型的可行性。 摘要:Computational fluid dynamics (CFD) is a valuable tool for personalised, non-invasive evaluation of hemodynamics in arteries, but its complexity and time-consuming nature prohibit large-scale use in practice. Recently, the use of deep learning for rapid estimation of CFD parameters like wall shear stress (WSS) on surface meshes has been investigated. However, existing approaches typically depend on a hand-crafted re-parametrisation of the surface mesh to match convolutional neural network architectures. In this work, we propose to instead use mesh convolutional neural networks that directly operate on the same finite-element surface mesh as used in CFD. We train and evaluate our method on two datasets of synthetic coronary artery models with and without bifurcation, using a ground truth obtained from CFD simulation. We show that our flexible deep learning model can accurately predict 3D WSS vectors on this surface mesh. Our method processes new meshes in less than 5 [s], consistently achieves a normalised mean absolute error of $\leq$ 1.6 [%], and peaks at 90.5 [%] median approximation accuracy over the held-out test set, comparing favorably to previously published work. This shows the feasibility of CFD surrogate modelling using mesh convolutional neural networks for hemodynamic parameter estimation in artery models.

其他神经网络|深度学习|模型|建模(5篇)

【1】 Saliency Guided Experience Packing for Replay in Continual Learning 标题:用于持续学习中重放的显著引导体验打包 链接:https://arxiv.org/abs/2109.04954

作者:Gobinda Saha,Kaushik Roy 机构:School of Electrical and Computer Engineering, Purdue University 备注:13 pages, 3 figures 摘要:人工学习系统希望通过不断地从一系列任务中学习而不忘记过去的知识来模仿人类智能。实现这种学习的一种方法是将过去的经验以输入示例的形式存储在情景记忆中,并在学习新任务时重放它们。然而,当内存变小时,这种方法的性能会受到影响。在本文中,我们提出了一种新的经验重放方法,通过观察显著性图来选择过去的经验,显著性图为模型的决策提供了直观的解释。在这些显著性映射的指导下,我们只使用对模型预测重要的输入图像的部分或面片来填充内存。在学习新任务时,我们使用适当的零填充重放这些内存补丁,以提醒模型其过去的决定。我们在不同的图像分类数据集上评估了我们的算法,并报告了比最先进的方法更好的性能。通过定性和定量分析,我们发现我们的方法在没有任何记忆增长的情况下捕获了更丰富的过去经验总结,因此在小情景记忆中表现良好。 摘要:Artificial learning systems aspire to mimic human intelligence by continually learning from a stream of tasks without forgetting past knowledge. One way to enable such learning is to store past experiences in the form of input examples in episodic memory and replay them when learning new tasks. However, performance of such method suffers as the size of the memory becomes smaller. In this paper, we propose a new approach for experience replay, where we select the past experiences by looking at the saliency maps which provide visual explanations for the model's decision. Guided by these saliency maps, we pack the memory with only the parts or patches of the input images important for the model's prediction. While learning a new task, we replay these memory patches with appropriate zero-padding to remind the model about its past decisions. We evaluate our algorithm on diverse image classification datasets and report better performance than the state-of-the-art approaches. With qualitative and quantitative analyses we show that our method captures richer summary of past experiences without any memory increase, and hence performs well with small episodic memory.

【2】 Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding 标题:负样本问题:时间基础度量学习的复兴 链接:https://arxiv.org/abs/2109.04872

作者:Zhenzhi Wang,Limin Wang,Tao Wu,Tianhao Li,Gangshan Wu 机构:State Key Laboratory for Novel Software Technology, Nanjing University, China 备注:18 pages, 18 figures. 1st place solution to HC-STVG challenge of the 3rd PIC workshop (this http URL) 摘要:时间接地的目的是对视频中语义与给定自然语言查询相关的视频瞬间进行时间定位。现有方法通常在融合表示上应用检测或回归管道,重点是设计复杂的头部和融合策略。相反,从时间基础作为度量学习问题的角度出发,我们提出了一种双重匹配网络(DMN),用于在联合嵌入空间中直接建模语言查询和视频时刻之间的关系。这种新的度量学习框架能够从两个新的方面充分利用负样本:从双重匹配方案构造负交叉模式对和挖掘不同视频中的负对。这些新的负样本可以通过跨模态对鉴别来增强两种模态的联合表征学习,以最大化它们的互信息。实验表明,在四个视频基准测试中,DMN与最先进的方法相比具有很强的竞争力。基于DMN,我们为第三届PIC研讨会的STVG挑战提供了一个赢家解决方案。这表明,通过在联合嵌入空间中捕获基本的互模态相关,度量学习仍然是一种很有前途的时间接地方法。 摘要:Temporal grounding aims to temporally localize a video moment in the video whose semantics are related to a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with a focus on designing complicated heads and fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Dual Matching Network (DMN), to directly model the relations between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs from a dual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal pair discrimination to maximize their mutual information. Experiments show that DMN achieves highly competitive performance compared with state-of-the-art methods on four video grounding benchmarks. Based on DMN, we present a winner solution for STVG challenge of the 3rd PIC workshop. This suggests that metric-learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space.

【3】 EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling 标题:EfficientCLIP:基于集成自信学习和语言建模的高效跨模态预训练 链接:https://arxiv.org/abs/2109.04699

作者:Jue Wang,Haofan Wang,Jincan Deng,Weijia Wu,Debing Zhang 机构:†Zhongnan University of Economics and Law, ¶Zhejiang University, ‡Kuaishou Technology 摘要:虽然大规模的学前训练在弥合视力和语言之间的差距方面取得了巨大成就,但它仍然面临一些挑战。首先,预训练的成本很高。其次,没有有效的方法来处理数据噪声,这会降低模型性能。第三,以前的方法只利用有限的图像-文本配对数据,而忽略了更丰富的单模态数据,这可能导致对单模态下游任务的泛化能力较差。在这项工作中,我们提出了一种通过集成自信学习获得低噪声数据子集的高效IP方法。超丰富的非成对单模态文本数据用于增强文本分支的泛化。与CLIP和WenLan相比,我们仅使用1/10的训练资源就实现了中文跨模态检索任务的最新性能,同时对单模态任务(包括文本检索和文本分类)表现出了出色的泛化能力。 摘要:While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.

【4】 Efficiently Identifying Task Groupings for Multi-Task Learning 标题:多任务学习中任务分组的有效识别 链接:https://arxiv.org/abs/2109.04617

作者:Christopher Fifty,Ehsan Amid,Zhe Zhao,Tianhe Yu,Rohan Anil,Chelsea Finn 机构:Google Brain, Google Research, Stanford University 摘要:多任务学习可以利用一个任务学习到的信息来帮助其他任务的训练。尽管有这种能力,但在一个模型中天真地将所有任务训练在一起通常会降低性能,并且通过任务分组组合进行彻底搜索的成本可能会高得令人望而却步。因此,在没有明确解决方案的情况下,有效地确定将从联合训练中受益的任务仍然是一个具有挑战性的设计问题。在本文中,我们提出了一种在多任务学习模型中选择哪些任务应该一起训练的方法。我们的方法通过将所有任务共同训练并量化一个任务的梯度对另一个任务损失的影响来确定单个训练运行中的任务分组。在大规模Taskonomy计算机视觉数据集上,我们发现,与简单地同时训练所有任务相比,这种方法可以减少10.0%的测试损失,同时操作速度比最先进的任务分组方法快11.6倍。 摘要:Multi-task learning can leverage information learned by one task to benefit the training of other tasks. Despite this capacity, naively training all tasks together in one model often degrades performance, and exhaustively searching through combinations of task groupings can be prohibitively expensive. As a result, efficiently identifying the tasks that would benefit from co-training remains a challenging design question without a clear solution. In this paper, we suggest an approach to select which tasks should train together in multi-task learning models. Our method determines task groupings in a single training run by co-training all tasks together and quantifying the effect to which one task's gradient would affect another task's loss. On the large-scale Taskonomy computer vision dataset, we find this method can decrease test loss by 10.0\% compared to simply training all tasks together while operating 11.6 times faster than a state-of-the-art task grouping method.

【5】 Automatic Displacement and Vibration Measurement in Laboratory Experiments with A Deep Learning Method 标题:深度学习方法在实验室实验中的自动位移和振动测量 链接:https://arxiv.org/abs/2109.04960

作者:Yongsheng Bai,Ramzi M. Abduallah,Halil Sezen,Alper Yilmaz 机构:Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, Columbus, OH, USA 备注:None 摘要:本文提出了一种在实验室实验中自动跟踪和测量结构试件位移和振动的管道。最新的Mask区域卷积神经网络(Mask R-CNN)可以从静止摄像机记录的视频中定位目标并监控其运动。为了提高精度和去除噪声,采用了尺度不变特征变换(SIFT)和各种信号处理滤波器等技术。通过三个小型钢筋混凝土梁的试验和振动台试验验证了该方法的有效性。结果表明,所提出的深度学习方法能够实现实验室实验中被测构件运动的自动精确测量。 摘要:This paper proposes a pipeline to automatically track and measure displacement and vibration of structural specimens during laboratory experiments. The latest Mask Regional Convolutional Neural Network (Mask R-CNN) can locate the targets and monitor their movement from videos recorded by a stationary camera. To improve precision and remove the noise, techniques such as Scale-invariant Feature Transform (SIFT) and various filters for signal processing are included. Experiments on three small-scale reinforced concrete beams and a shaking table test are utilized to verify the proposed method. Results show that the proposed deep learning method can achieve the goal to automatically and precisely measure the motion of tested structural members during laboratory experiments.

其他(9篇)

【1】 Panoptic Narrative Grounding 标题:全景叙事落地 链接:https://arxiv.org/abs/2109.04988

作者:C. González,N. Ayobi,I. Hernández,J. Hernández,J. Pont-Tuset,P. Arbeláez 机构:Universidad de los Andes 备注:10 pages, 6 figures, to appear at ICCV 2021 (Oral presentation) 摘要:本文提出了全景叙事基础,这是自然语言视觉基础问题在空间上的一个精细和一般的表述。我们为这项新任务的研究建立了一个实验框架,包括新的基本事实和指标,并提出了一个强大的基线方法,作为未来工作的垫脚石。我们通过包含全景类别来利用图像中固有的语义丰富性,并通过分割在细粒度级别上实现视觉基础。在地面真实性方面,我们提出了一种算法来自动将局部叙述注释转移到MS COCO数据集全景分割中的特定区域。为了保证注释的质量,我们利用WordNet中包含的语义结构,专门包含基于有意义的相关全景切分区域的名词短语。建议的基线实现了55.4个绝对平均召回点的性能。这一结果是一个合适的基础,推动信封进一步发展的全景叙事接地的方法。 摘要:This paper proposes Panoptic Narrative Grounding, a spatially fine and general formulation of the natural language visual grounding problem. We establish an experimental framework for the study of this new task, including new ground truth and metrics, and we propose a strong baseline method to serve as stepping stone for future work. We exploit the intrinsic semantic richness in an image by including panoptic categories, and we approach visual grounding at a fine-grained level by using segmentations. In terms of ground truth, we propose an algorithm to automatically transfer Localized Narratives annotations to specific regions in the panoptic segmentations of the MS COCO dataset. To guarantee the quality of our annotations, we take advantage of the semantic structure contained in WordNet to exclusively incorporate noun phrases that are grounded to a meaningfully related panoptic segmentation region. The proposed baseline achieves a performance of 55.4 absolute Average Recall points. This result is a suitable foundation to push the envelope further in the development of methods for Panoptic Narrative Grounding.

【2】 Emerging AI Security Threats for Autonomous Cars -- Case Studies 标题:自动驾驶汽车新出现的人工智能安全威胁--案例研究 链接:https://arxiv.org/abs/2109.04865

作者:Shanthi Lekkala,Tanya Motwani,Manojkumar Parmar,Amit Phadke 机构:Robert Bosch Engineering and Business Solutions Private Limited, Bengaluru, India 备注:6 pages, 4 figures; Manuscript is accepted at ESCAR Europe 2021 conference 摘要:从目标检测到路径规划,人工智能对自主车辆做出了重大贡献。然而,人工智能模型需要大量敏感的训练数据,并且通常需要大量计算才能建立。此类模型的商业价值促使攻击者发起各种攻击。对手可以出于赚钱的目的发动模型提取攻击,或者向模型规避等其他攻击发起攻击。在特定情况下,它甚至会破坏品牌声誉、差异化和价值主张。此外,知识产权法律和人工智能相关法律仍在发展中,各国之间并不统一。我们详细讨论了模型提取攻击,包括两个用例和一个可能危害自动驾驶汽车的通用杀伤链。必须调查管理和降低模型被盗风险的策略。 摘要:Artificial Intelligence has made a significant contribution to autonomous vehicles, from object detection to path planning. However, AI models require a large amount of sensitive training data and are usually computationally intensive to build. The commercial value of such models motivates attackers to mount various attacks. Adversaries can launch model extraction attacks for monetization purposes or step-ping-stone towards other attacks like model evasion. In specific cases, it even results in destroying brand reputation, differentiation, and value proposition. In addition, IP laws and AI-related legalities are still evolving and are not uniform across countries. We discuss model extraction attacks in detail with two use-cases and a generic kill-chain that can compromise autonomous cars. It is essential to investigate strategies to manage and mitigate the risk of model theft.

【3】 Temporally Coherent Person Matting Trained on Fake-Motion Dataset 标题:在假运动数据集上训练的时间一致的人垫 链接:https://arxiv.org/abs/2109.04843

作者:Ivan Molodetskikh,Mikhail Erofeev,Andrey Moskalenko,Dmitry Vatolin 机构:Lomonosov Moscow State University, Moscow, Russia 备注:13 pages, 5 figures 摘要:我们提出了一种新的基于神经网络的方法来对描述人的视频进行铺垫,而不需要额外的用户输入,如trimaps。我们的架构通过使用基于运动估计的图像分割算法输出平滑,并结合U-Net跳跃连接上的卷积LSTM模块,实现了生成的alpha蒙版的时间稳定性。我们还提出了一种伪运动算法,该算法为给定具有地面真实alpha蒙版和背景视频的照片的视频蒙版网络生成训练片段。我们将随机运动应用于照片及其蒙版,以模拟真实视频中的运动,并将结果与背景剪辑合成。它使我们能够在没有大型带注释的视频数据集的情况下训练对视频进行操作的深度神经网络,并提供用于损失函数的地面真实训练剪辑前景光流。 摘要:We propose a novel neural-network-based method to perform matting of videos depicting people that does not require additional user input such as trimaps. Our architecture achieves temporal stability of the resulting alpha mattes by using motion-estimation-based smoothing of image-segmentation algorithm outputs, combined with convolutional-LSTM modules on U-Net skip connections. We also propose a fake-motion algorithm that generates training clips for the video-matting network given photos with ground-truth alpha mattes and background videos. We apply random motion to photos and their mattes to simulate movement one would find in real videos and composite the result with the background clips. It lets us train a deep neural network operating on videos in an absence of a large annotated video dataset and provides ground-truth training-clip foreground optical flow for use in loss functions.

【4】 Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization 标题:作为视觉句子的线条:用于视觉本地化的上下文感知线条描述符 链接:https://arxiv.org/abs/2109.04753

作者:Sungho Yoon,Ayoung Kim 机构:Kim is with the Department of Mechanical Engineering 摘要:除了用于图像匹配的特征点外,线条特征还提供了额外的约束,用于解决机器人技术和计算机视觉(CV)中的视觉几何问题。尽管最近基于卷积神经网络(CNN)的线描述符在视点变化或动态环境中很有前景,但我们认为CNN结构在将可变线长度抽象为固定维描述符方面存在固有的缺点。在本文中,我们有效地介绍了处理可变线路的线路Transformer。受自然语言处理(NLP)任务的启发,在神经网络中句子可以很好地理解和抽象,我们将线段视为包含点(单词)的句子。通过动态处理aline上的可描述点,我们的描述符在可变线长度上表现出色。我们还提出了线签名网络共享线的几何属性的邻域。作为组描述符,网络通过理解线的相对几何结构来增强线描述符。最后,我们在点和线定位(PL Loc)中提出了建议的线描述符和匹配。我们表明,使用我们的线特征可以改进带有特征点的视觉定位。我们验证了所提出的单应估计和视觉定位方法。 摘要:Along with feature points for image matching, line features provide additional constraints to solve visual geometric problems in robotics and computer vision (CV). Although recent convolutional neural network (CNN)-based line descriptors are promising for viewpoint changes or dynamic environments, we claim that the CNN architecture has innate disadvantages to abstract variable line length into the fixed-dimensional descriptor. In this paper, we effectively introduce Line-Transformers dealing with variable lines. Inspired by natural language processing (NLP) tasks where sentences can be understood and abstracted well in neural nets, we view a line segment as a sentence that contains points (words). By attending to well-describable points on aline dynamically, our descriptor performs excellently on variable line length. We also propose line signature networks sharing the line's geometric attributes to neighborhoods. Performing as group descriptors, the networks enhance line descriptors by understanding lines' relative geometries. Finally, we present the proposed line descriptor and matching in a Point and Line Localization (PL-Loc). We show that the visual localization with feature points can be improved using our line features. We validate the proposed method for homography estimation and visual localization.

【5】 PIP: Physical Interaction Prediction via Mental Imagery with Span Selection 标题:PIP:基于带跨度选择的心理意象的物理交互预测 链接:https://arxiv.org/abs/2109.04683

作者:Jiafei Duan,Samson Yu,Soujanya Poria,Bihan Wen,Cheston Tan 机构:Institute for Infocomm Research, ASTAR, Singapore University of Technology and Design, Nanyang Technological University of Singapore 摘要:为了使高级人工智能(AI)与人类价值观保持一致并促进安全的AI,AI预测物理交互的结果非常重要。即使关于人类如何预测现实世界中物体之间物理交互的结果的争论还在继续,也有人试图通过认知启发的人工智能方法来解决这一问题。然而,仍然缺乏模拟人类用于预测现实世界中物理交互的心理意象的人工智能方法。在这项工作中,我们提出了一种新的PIP方案:通过跨度选择的心理意象进行物理交互预测。PIP利用深度生成模型输出对象间物理交互的未来帧,然后通过使用跨度选择关注显著帧来提取预测物理交互的关键信息。为了评估我们的模型,我们提出了一个大规模的空间+合成视频帧数据集,包括三维环境中的三个物理交互事件。我们的实验表明,PIP在可见和不可见对象的物理交互预测方面都优于基线和人类性能。此外,PIP的跨度选择方案可以有效地识别在生成的帧中对象之间发生物理交互的帧,从而增加可解释性。 摘要:To align advanced artificial intelligence (AI) with human values and promote safe AI, it is important for AI to predict the outcome of physical interactions. Even with the ongoing debates on how humans predict the outcomes of physical interactions among objects in the real world, there are works attempting to tackle this task via cognitive-inspired AI approaches. However, there is still a lack of AI approaches that mimic the mental imagery humans use to predict physical interactions in the real world. In this work, we propose a novel PIP scheme: Physical Interaction Prediction via Mental Imagery with Span Selection. PIP utilizes a deep generative model to output future frames of physical interactions among objects before extracting crucial information for predicting physical interactions by focusing on salient frames using span selection. To evaluate our model, we propose a large-scale SPACE+ dataset of synthetic video frames, including three physical interaction events in a 3D environment. Our experiments show that PIP outperforms baselines and human performance in physical interaction prediction for both seen and unseen objects. Furthermore, PIP's span selection scheme can effectively identify the frames where physical interactions among objects occur within the generated frames, allowing for added interpretability.

【6】 Per Garment Capture and Synthesis for Real-time Virtual Try-on 标题:面向实时虚拟试穿的单件服装采集与合成 链接:https://arxiv.org/abs/2109.04654

作者:Toby Chong,I-Chao Shen,Nobuyuki Umetani,Takeo Igarashi 机构:The University of Tokyo 备注:Accepted to UIST2021. Project page: this https URL 摘要:虚拟试穿是计算机图形学和人机交互的一个很有前途的应用,特别是在这场大流行期间,它可以对现实世界产生深远的影响。现有的基于图像的作品试图从目标服装的单个图像合成一个试穿图像,但它固有地限制了对可能的交互做出反应的能力。很难再现由姿势和体型变化以及手拉扯和拉伸衣服引起的褶皱变化。在本文中,我们提出了一种可选的每件衣服捕获和合成工作流,通过使用许多系统捕获的图像训练模型来处理这种丰富的交互。我们的工作流程由两部分组成:服装捕捉和服装人物图像合成。我们设计了一个驱动人体模型和一个有效的捕捉过程,收集目标服装在不同体型和姿势下的详细变形。此外,我们建议使用定制设计的测量服装,并捕获测量服装和目标服装的成对图像。然后,我们使用深度图像到图像的转换学习测量服装和目标服装之间的映射。然后,客户可以在网上购物时以交互方式试穿目标服装。 摘要:Virtual try-on is a promising application of computer graphics and human computer interaction that can have a profound real-world impact especially during this pandemic. Existing image-based works try to synthesize a try-on image from a single image of a target garment, but it inherently limits the ability to react to possible interactions. It is difficult to reproduce the change of wrinkles caused by pose and body size change, as well as pulling and stretching of the garment by hand. In this paper, we propose an alternative per garment capture and synthesis workflow to handle such rich interactions by training the model with many systematically captured images. Our workflow is composed of two parts: garment capturing and clothed person image synthesis. We designed an actuated mannequin and an efficient capturing process that collects the detailed deformations of the target garments under diverse body sizes and poses. Furthermore, we proposed to use a custom-designed measurement garment, and we captured paired images of the measurement garment and the target garments. We then learn a mapping between the measurement garment and the target garments using deep image-to-image translation. The customer can then try on the target garments interactively during online shopping.

【7】 CrowdDriven: A New Challenging Dataset for Outdoor Visual Localization 标题:CrowdDriven:一种新的具有挑战性的室外视觉定位数据集 链接:https://arxiv.org/abs/2109.04527

作者:Ara Jafarzadeh,Manuel Lopez Antequera,Pau Gargallo,Yubin Kuang,Carl Toft,Fredrik Kahl,Torsten Sattler 机构:Manuel L´opez Antequera, Chalmers University of Technology, Facebook, Czech Technical University in Prague 摘要:视觉定位是在已知场景中估计给定图像(或图像序列)的位置和方向的问题。它是计算机视觉和机器人技术广泛应用的重要组成部分,从自动驾驶汽车到增强/虚拟现实系统。视觉定位技术应在各种条件下可靠可靠地工作,包括季节、天气、照明和人为变化。最近的基准测试工作通过在不同条件下提供图像来模拟这一点,社区自创建以来在这些数据集上取得了快速进展。但是,它们仅限于几个地理区域,并且通常使用单个设备进行记录。我们提出了一个新的户外场景视觉定位基准,使用众包数据覆盖广泛的地理区域和相机设备,重点关注当前算法的失败案例。使用最先进的本地化方法进行的实验表明,我们的数据集非常具有挑战性,所有评估的方法在最困难的部分都失败了。作为数据集发布的一部分,我们提供了用于生成数据集的工具,使高效和有效的2D对应注释能够获得参考姿势。 摘要:Visual localization is the problem of estimating the position and orientation from which a given image (or a sequence of images) is taken in a known scene. It is an important part of a wide range of computer vision and robotics applications, from self-driving cars to augmented/virtual reality systems. Visual localization techniques should work reliably and robustly under a wide range of conditions, including seasonal, weather, illumination and man-made changes. Recent benchmarking efforts model this by providing images under different conditions, and the community has made rapid progress on these datasets since their inception. However, they are limited to a few geographical regions and often recorded with a single device. We propose a new benchmark for visual localization in outdoor scenes, using crowd-sourced data to cover a wide range of geographical regions and camera devices with a focus on the failure cases of current algorithms. Experiments with state-of-the-art localization approaches show that our dataset is very challenging, with all evaluated methods failing on its hardest parts. As part of the dataset release, we provide the tooling used to generate it, enabling efficient and effective 2D correspondence annotation to obtain reference poses.

【8】 Resolving gas bubbles ascending in liquid metal from low-SNR neutron radiography images 标题:从低信噪比中子照相图像解析液态金属中上升的气泡 链接:https://arxiv.org/abs/2109.04883

作者:Mihails Birjukovs,Pavel Trtik,Anders Kaestner,Jan Hovind,Martins Klevs,Knud Thomsen,Andris Jakovics 机构:Institute of Numerical Modelling, University of Latvia, Riga, Latvia, Jelgavas , Research with Neutrons and Muons, Paul Scherrer Institut, Villigen, Switzerland, Forschungsstrasse 摘要:我们展示了一种新的图像处理方法,用于从本质上低信噪比的动态中子照相图像中解析穿过液态金属的气泡。详细介绍了图像预处理、去噪和气泡分割,并给出了实用建议。给出了实验验证——对带有中子透明腔的静止和移动参考体进行射线照相,成像条件类似于液态金属中有气泡的情况。将新方法应用于我们以前和最近的成像活动的实验数据,并将本文提出的方法的性能与我们以前开发的方法进行了比较。观察到显著的改进,以及从高度不利的成像条件下进行的测量中可靠地提取具有物理意义的信息的能力。展示的图像处理解决方案及其单独的元素易于扩展到本应用程序之外,并且已成为开源的。 摘要:We demonstrate a new image processing methodology for resolving gas bubbles travelling through liquid metal from dynamic neutron radiography images with intrinsically low signal-to-noise ratio. Image pre-processing, denoising and bubble segmentation are described in detail, with practical recommendations. Experimental validation is presented - stationary and moving reference bodies with neutron-transparent cavities are radiographed with imaging conditions similar to the cases with bubbles in liquid metal. The new methods are applied to our experimental data from previous and recent imaging campaigns, and the performance of the methods proposed in this paper is compared against our previously developed methods. Significant improvements are observed as well as the capacity to reliably extract physically meaningful information from measurements performed under highly adverse imaging conditions. The showcased image processing solution and separate elements thereof are readily extendable beyond the present application, and have been made open-source.

【9】 ReconfigISP: Reconfigurable Camera Image Processing Pipeline 标题:可重构摄像机图像处理流水线 链接:https://arxiv.org/abs/2109.04760

作者:Ke Yu,Zexian Li,Yue Peng,Chen Change Loy,Jinwei Gu 机构:SenseTime Research and Tetras.AI, CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Beihang University, S-Lab, Nanyang Technological University, Shanghai AI Laboratory 备注:ICCV 2021 摘要:图像信号处理器(ISP)是数码相机中的一个关键部件,它将传感器信号转换成图像供我们感知和理解。现有的ISP设计总是采用固定的体系结构,例如,以刚性顺序连接的几个顺序模块。这种固定的ISP体系结构对于现实世界的应用来说可能是次优的,因为在现实世界中,摄像机传感器、场景和任务是多样的。在这项研究中,我们提出了一种新的可重构ISP(ReconfigISP),其结构和参数可以根据特定的数据和任务自动调整。特别是,我们实现了几个ISP模块,并通过训练可微代理为每个模块启用反向传播,从而允许我们利用流行的可微神经结构搜索,有效地搜索最佳ISP结构。在所有情况下,采用代理调优机制来保持代理网络的准确性。在不同传感器、光照条件和效率约束下,对图像恢复和目标检测进行了大量实验,验证了ReconfigISP的有效性。每个任务只需要调整数百个参数。 摘要:Image Signal Processor (ISP) is a crucial component in digital cameras that transforms sensor signals into images for us to perceive and understand. Existing ISP designs always adopt a fixed architecture, e.g., several sequential modules connected in a rigid order. Such a fixed ISP architecture may be suboptimal for real-world applications, where camera sensors, scenes and tasks are diverse. In this study, we propose a novel Reconfigurable ISP (ReconfigISP) whose architecture and parameters can be automatically tailored to specific data and tasks. In particular, we implement several ISP modules, and enable backpropagation for each module by training a differentiable proxy, hence allowing us to leverage the popular differentiable neural architecture search and effectively search for the optimal ISP architecture. A proxy tuning mechanism is adopted to maintain the accuracy of proxy networks in all cases. Extensive experiments conducted on image restoration and object detection, with different sensors, light conditions and efficiency constraints, validate the effectiveness of ReconfigISP. Only hundreds of parameters need tuning for every task.

机器翻译,仅供参考

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-09-13,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
图像处理
图像处理基于腾讯云深度学习等人工智能技术,提供综合性的图像优化处理服务,包括图像质量评估、图像清晰度增强、图像智能裁剪等。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档