计算机视觉学术速递[7.21]

公众号-arXiv每日学术速递

发布于 2021-07-27 11:09:23

1.6K0

发布于 2021-07-27 11:09:23

文章被收录于专栏：arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CV 方向，今日共计45篇

Transformer(1篇)

【1】 Generative Video Transformer: Can Objects be the Words? 标题：生成性视频Transformer：物体能变成文字吗？

作者：Yi-Fu Wu,Jaesik Yoon,Sungjin Ahn 机构： While it is of primary 1Department of Computer Science, Rutgers University 2SAPLabs 3Rutgers Center for Cognitive Science 备注：Published in ICML 2021 链接：https://arxiv.org/abs/2107.09240 摘要：Transformers已经成功地用于许多自然语言处理任务。然而，由于高计算复杂度和缺乏自然的标记化，将变换器应用于视频领域以实现诸如长期视频生成和场景理解等任务仍然是难以捉摸的。在本文中，我们提出了以对象为中心的视频转换器（OCVT），它利用一种以对象为中心的方法将场景分解成适合在生成视频转换器中使用的令牌。通过将视频分解为对象，我们的完全无监督模型能够学习场景中多个交互对象的复杂时空动态，并生成视频的未来帧。我们的模型也明显比基于像素的模型内存效率更高，因此能够用一个48GB的GPU对长度高达70帧的视频进行训练。我们将我们的模型与以前基于RNN的方法以及其他可能的视频转换器基线进行了比较。我们证明了OCVT在生成未来帧时与基线相比表现良好。OCVT还为视频推理开发了有用的表示方法，实现了艺术表演的开始。摘要：Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high computational complexity and the lack of natural tokenization. In this paper, we propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer. By factoring the video into objects, our fully unsupervised model is able to learn complex spatio-temporal dynamics of multiple interacting objects in a scene and generate future frames of the video. Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU. We compare our model with previous RNN-based approaches as well as other possible video transformer baselines. We demonstrate OCVT performs well when compared to baselines in generating future frames. OCVT also develops useful representations for video reasoning, achieving start-of-the-art performance on the CATER task.

检测相关(6篇)

【1】 QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries 标题：QV亮点：通过自然语言查询检测视频中的瞬间和亮点

作者：Jie Lei,Tamara L. Berg,Mohit Bansal 机构：Department of Computer Science, University of North Carolina at Chapel Hill 备注：17 pages, 11 figures, 5 tables 链接：https://arxiv.org/abs/2107.09609 摘要：从给定自然语言（NL）用户查询的视频中检测定制的时刻和亮点是一个重要但尚未研究的课题。追求这一方向的挑战之一是缺少带注释的数据。为了解决这个问题，我们提出了基于查询的视频高光（QVHighlights）数据集。它由超过10000个YouTube视频组成，涵盖了广泛的主题，从生活方式视频中的日常活动和旅行到新闻视频中的社会和政治活动。数据集中的每个视频都有注释：（1）一个人类编写的自由形式NL查询，（2）视频中的相关时刻w.r.t.查询，以及（3）所有查询相关剪辑的五点标度显著性分数。这种全面的注释使我们能够开发和评估系统，检测相关的时刻以及各种灵活的用户查询的突出亮点。我们还为这项任务提供了一个强大的基线，即矩DETR，一个将矩检索视为直接集预测问题的转换器编码器-解码器模型，它将提取的视频和查询表示作为输入，并端到端地预测矩坐标和显著性分数。虽然我们的模型没有利用任何人类的先验知识，但我们表明，与精心设计的体系结构相比，它的性能具有竞争力。在使用ASR字幕的弱监督预训练中，矩DETR的性能明显优于以往的方法。最后，我们提出了一些消融和可视化的时刻DETR。数据和代码在https://github.com/jayleicn/moment_detr 摘要：Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment-DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr

【2】 Image-Hashing-Based Anomaly Detection for Privacy-Preserving Online Proctoring 标题：基于图像散列的隐私保护在线监听异常检测

作者：Waheeb Yaqub,Manoranjan Mohanty,Basem Suleiman 机构：CHAI Lab - School of Computer Science, The University of Sydney, Sydney, Australia, Center for Forensic Science, University of Technology Sydney 链接：https://arxiv.org/abs/2107.09373 摘要：在线监考已成为网络教学的必然。基于视频的众包在线监考解决方案正在使用，考试学生的视频被第三方监控，导致隐私问题。本文提出了一种隐私保护的在线监测系统。提出的基于图像哈希算法的系统能够检测出学生在考试中作弊时脸和身体的过度运动（即异常）。即使学生的脸在视频帧中模糊或被蒙蔽，也可以进行检测。在一个内部数据集上的实验表明了该系统的可用性。摘要：Online proctoring has become a necessity in online teaching. Video-based crowd-sourced online proctoring solutions are being used, where an exam-taking student's video is monitored by third parties, leading to privacy concerns. In this paper, we propose a privacy-preserving online proctoring system. The proposed image-hashing-based system can detect the student's excessive face and body movement (i.e., anomalies) that is resulted when the student tries to cheat in the exam. The detection can be done even if the student's face is blurred or masked in video frames. Experiment with an in-house dataset shows the usability of the proposed system.

【3】 Cell Detection from Imperfect Annotation by Pseudo Label Selection Using P-classification 标题：基于P-分类的伪标签选择在不完全标注中的细胞检测

作者：Kazuma Fujii,Suehiro Daiki,Nishimura Kazuya,Bise Ryoma 机构： Kyushu University, Fukuoka, Japan, RIKEN, AIP, Tokyo, Japan 备注：10 pages, 3 figures, Accepted in MICCAI2021 链接：https://arxiv.org/abs/2107.09289 摘要：细胞检测是细胞图像分析中的一项重要任务。近年来，基于深度学习的检测方法取得了很好的效果。通常，这些方法需要对整个图像中的单元格进行详尽的注释。如果某些单元格没有注释（注释不完善），由于标签噪声，检测性能会显著下降。这通常发生在与生物学家的真正合作中，甚至在公共数据集中。我们提出的方法采用伪标记的方法从不完美的注释数据中检测细胞。检测卷积神经网络（CNN）训练使用这种丢失的标记数据往往产生过度检测。我们将部分标记的细胞视为阳性样本，将除标记细胞外的检测部位视为未标记样本。然后，我们使用最新的机器学习技术从未标记的数据中选择可靠的伪标签；阳性和未标记（PU）学习和P分类。在五种不同条件下的显微图像实验证明了该方法的有效性。摘要：Cell detection is an essential task in cell image analysis. Recent deep learning-based detection methods have achieved very promising results. In general, these methods require exhaustively annotating the cells in an entire image. If some of the cells are not annotated (imperfect annotation), the detection performance significantly degrades due to noisy labels. This often occurs in real collaborations with biologists and even in public data-sets. Our proposed method takes a pseudo labeling approach for cell detection from imperfect annotated data. A detection convolutional neural network (CNN) trained using such missing labeled data often produces over-detection. We treat partially labeled cells as positive samples and the detected positions except for the labeled cell as unlabeled samples. Then we select reliable pseudo labels from unlabeled data using recent machine learning techniques; positive-and-unlabeled (PU) learning and P-classification. Experiments using microscopy images for five different conditions demonstrate the effectiveness of the proposed method.

【4】 S2Looking: A Satellite Side-Looking Dataset for Building Change Detection 标题：S2 Looking：一种用于建筑物变化检测的卫星侧视数据集

作者：Li Shen,Yao Lu,Hao Chen,Hao Wei,Donghai Xie,Jiabao Yue,Rui Chen,Yue Zhang,Ao Zhang,Shouye Lv,Bitao Jiang 机构：Beijing institution of remote sensing, Beijing, China, BeiHang University, tianjing university, Capital Normal University 链接：https://arxiv.org/abs/2107.09244 摘要：收集大规模带注释的卫星图像数据集对于基于深度学习的全球建筑物变化监测至关重要。特别是光学卫星的滚动成像模式，可以实现更大的观测范围和更短的重访周期，有利于高效的全球监视。然而，目前卫星变化探测数据集中的图像主要是在近底视角下获取的。在本文中，我们介绍了S2Looking，一个建筑物变化检测数据集，它包含了在不同的离底角下拍摄的大规模侧视卫星图像。我们的S2Looking数据集由5000个全球农村地区的注册双时态图像对（大小为1024*1024，0.5~0.8 m/像素）和65920多个带注释的变化实例组成。我们提供了两个标签图，分别表示数据集中每个样本的新建和拆除建筑区域。基于该数据集，我们建立了一个基准任务，即识别双时相图像中像素级的建筑物变化。我们在S2Looking数据集和（接近最低点的）LEVIR-CD+数据集上测试了几种最先进的方法。实验结果表明，最近的变化检测方法在s2c上的性能比在LEVIR-CD+上差得多。提出的S2Looking数据集面临三个主要挑战：1）视角变化大，2）光照变化大，3）在农村地区遇到的各种复杂场景特征。我们提出的数据集可以促进大离底角条件下卫星图像变化检测和配准算法的发展。数据集位于https://github.com/AnonymousForACMMM/. 摘要：Collecting large-scale annotated satellite imagery datasets is essential for deep-learning-based global building change surveillance. In particular, the scroll imaging mode of optical satellites enables larger observation ranges and shorter revisit periods, facilitating efficient global surveillance. However, the images in recent satellite change detection datasets are mainly captured at near-nadir viewing angles. In this paper, we introduce S2Looking, a building change detection dataset that contains large-scale side-looking satellite images captured at varying off-nadir angles. Our S2Looking dataset consists of 5000 registered bitemporal image pairs (size of 1024*1024, 0.5 ~ 0.8 m/pixel) of rural areas throughout the world and more than 65,920 annotated change instances. We provide two label maps to separately indicate the newly built and demolished building regions for each sample in the dataset. We establish a benchmark task based on this dataset, i.e., identifying the pixel-level building changes in the bi-temporal images. We test several state-of-the-art methods on both the S2Looking dataset and the (near-nadir) LEVIR-CD+ dataset. The experimental results show that recent change detection methods exhibit much poorer performance on the S2Looking than on LEVIR-CD+. The proposed S2Looking dataset presents three main challenges: 1) large viewing angle changes, 2) large illumination variances and 3) various complex scene characteristics encountered in rural areas. Our proposed dataset may promote the development of algorithms for satellite image change detection and registration under conditions of large off-nadir angles. The dataset is available at https://github.com/AnonymousForACMMM/.

【5】 A Comparison of Supervised and Unsupervised Deep Learning Methods for Anomaly Detection in Images 标题：图像异常检测的有监督和无监督深度学习方法比较

作者：Vincent Wilmet,Sauraj Verma,Tabea Redl,Håkon Sandaker,Zhenning Li 机构：CentraleSupélec, Gif-sur-Yvette, France 备注：8 pages, for FML 链接：https://arxiv.org/abs/2107.09204 摘要：图像中的异常检测在各个行业的许多应用中都扮演着重要的角色，例如医疗保健中的疾病诊断或制造业中的质量保证。人工检查图像，当延长一个单调重复的时间段是非常耗时的，并可能导致异常被忽略。人工神经网络已经证明自己非常成功的简单，重复的任务，在某些情况下甚至超过人类。因此，在本文中，我们研究了不同的深度学习方法，包括有监督和无监督的学习，用于质量保证用例的异常检测。我们利用MVTec异常数据集开发了三种不同的模型，CNN用于监督异常检测，KD-CAE用于自动编码器异常检测，NI-CAE用于噪声引起的异常检测，DCGAN用于生成重建图像。通过实验，我们发现KD-CAE在异常数据集上的表现优于CNN和NI-CAE，其中NI-CAE在晶体管数据集上的表现最好。我们还实现了一个DCGAN来创建新的训练数据，但是由于计算上的限制和缺乏对AnoGAN机制的外推，我们仅限于生成基于GAN的图像。我们的结论是，无监督方法对于图像中的异常检测更为有效，特别是在只有少量异常数据可用或数据未标记的情况下。摘要：Anomaly detection in images plays a significant role for many applications across all industries, such as disease diagnosis in healthcare or quality assurance in manufacturing. Manual inspection of images, when extended over a monotonously repetitive period of time is very time consuming and can lead to anomalies being overlooked.Artificial neural networks have proven themselves very successful on simple, repetitive tasks, in some cases even outperforming humans. Therefore, in this paper we investigate different methods of deep learning, including supervised and unsupervised learning, for anomaly detection applied to a quality assurance use case. We utilize the MVTec anomaly dataset and develop three different models, a CNN for supervised anomaly detection, KD-CAE for autoencoder anomaly detection, NI-CAE for noise induced anomaly detection and a DCGAN for generating reconstructed images. By experiments, we found that KD-CAE performs better on the anomaly datasets compared to CNN and NI-CAE, with NI-CAE performing the best on the Transistor dataset. We also implemented a DCGAN for the creation of new training data but due to computational limitation and lack of extrapolating the mechanics of AnoGAN, we restricted ourselves just to the generation of GAN based images. We conclude that unsupervised methods are more powerful for anomaly detection in images, especially in a setting where only a small amount of anomalous data is available, or the data is unlabeled.

【6】 Confidence Aware Neural Networks for Skin Cancer Detection 标题：置信度感知神经网络在皮肤癌检测中的应用

作者：Donya Khaledyan,AmirReza Tajally,Reza Sarkhosh,Afshar Shamsi,Hamzeh Asgharnezhad,Abbas Khosravi,Saeid Nahavandi 机构：Fellow, IEEEe, Beheshti University, Tehran, Iran, Department of Industrial Engineering, University of Tehran, Tehran, Iran, Department of Electrical and Computer, Isfahan University of Technology (IUT), Isfahan, Iran, Individual Researcher, Tehran, Iran 备注：21 Pages, 7 Figures, 2 Tables 链接：https://arxiv.org/abs/2107.09118 摘要：深度学习（DL）模型由于其良好的模式识别能力，在医学成像领域受到了广泛的关注。然而，深度神经网络（DNNs）需要大量的数据，而由于缺乏足够的数据，迁移学习是一个很好的解决方案。用于疾病诊断的DNNs一丝不苟地致力于提高预测的准确性，而没有提供预测可信度的数字。了解DNN模型对计算机辅助诊断模型的信心对于获得临床医生对基于DL的解决方案的信心和信任是必要的。为了解决这个问题，这项工作提出了三种不同的方法来量化皮肤癌检测图像的不确定性。它还使用新的不确定性相关度量来综合评估和比较这些dnn的性能。结果表明，预测不确定度估计方法能够用高不确定度估计来标记风险和错误的预测。我们还证明了集成方法在通过推理捕获不确定性方面更为可靠。摘要：Deep learning (DL) models have received particular attention in medical imaging due to their promising pattern recognition capabilities. However, Deep Neural Networks (DNNs) require a huge amount of data, and because of the lack of sufficient data in this field, transfer learning can be a great solution. DNNs used for disease diagnosis meticulously concentrate on improving the accuracy of predictions without providing a figure about their confidence of predictions. Knowing how much a DNN model is confident in a computer-aided diagnosis model is necessary for gaining clinicians' confidence and trust in DL-based solutions. To address this issue, this work presents three different methods for quantifying uncertainties for skin cancer detection from images. It also comprehensively evaluates and compares performance of these DNNs using novel uncertainty-related metrics. The obtained results reveal that the predictive uncertainty estimation methods are capable of flagging risky and erroneous predictions with a high uncertainty estimate. We also demonstrate that ensemble approaches are more reliable in capturing uncertainties through inference.

分类|识别相关(8篇)

【1】 Saliency for free: Saliency prediction as a side-effect of object recognition 标题：显著性免费：作为物体识别的副作用的显著性预测

作者：Carola Figueroa-Flores,David Berga,Joost van der Weijer,Bogdan Raducanu 机构：Computer Vision Center, Edifici “O” - Campus UAB, Bellaterra (Barcelona), Spain, Department of Computer Science and Information Technology, Universidad del B´ıo B´ıo, Chile 备注：None 链接：https://arxiv.org/abs/2107.09628 摘要：显著性是我们视觉系统将注意力（即凝视）集中在相关物体上的感知能力。用于显著性估计的神经网络需要地面真值显著性图进行训练，这些训练通常通过眼睛跟踪实验来实现。在本文中，我们证明了显著性映射可以作为训练具有显著性分支的对象识别深度神经网络的副作用而生成。在真实和合成显著性数据集上进行的大量实验表明，我们的方法能够生成精确的显著性图，与需要地面真实数据的方法相比，在合成数据集和真实数据集上都取得了有竞争力的结果。摘要：Saliency is the perceptual capacity of our visual system to focus our attention (i.e. gaze) on relevant objects. Neural networks for saliency estimation require ground truth saliency maps for training which are usually achieved via eyetracking experiments. In the current paper, we demonstrate that saliency maps can be generated as a side-effect of training an object recognition deep neural network that is endowed with a saliency branch. Such a network does not require any ground-truth saliency maps for training.Extensive experiments carried out on both real and synthetic saliency datasets demonstrate that our approach is able to generate accurate saliency maps, achieving competitive results on both synthetic and real datasets when compared to methods that do require ground truth data.

【2】 SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models 标题：SynthTIGER：面向更好文本识别模型的合成文本图像生成器

作者：Moonbin Yim,Yoonsik Kim,Han-Cheol Cho,Sungrae Park 机构： CLOVA AI Research, NAVER Corporation, Upstage AI Research 备注：Accepted at ICDAR 2021, 16 pages, 6 figures 链接：https://arxiv.org/abs/2107.09313 摘要：对于成功的场景文本识别（STR）模型，合成文本图像生成器缓解了现实世界中缺少注释文本图像的问题。具体来说，它们生成具有不同背景、字体样式和文本形状的多个文本图像，并使STR模型能够学习可能无法从手动注释的数据中访问的视觉模式。本文通过对文本图像合成技术的分析，将有效的文本图像合成技术集成到单一算法中，提出了一种新的合成文本图像生成器synthiger。此外，我们提出了两种技术来缓解长尾问题的长度和字符分布的训练数据。在我们的实验中，synthiger比合成数据集、MJSynth（MJ）和SynthText（ST）的组合具有更好的STR性能。我们的研究证明了使用synthiger子组件的好处，以及为STR模型生成合成文本图像的指导原则。我们的实现在https://github.com/clovaai/synthtiger. 摘要：For successful scene text recognition (STR) models, synthetic text image generators have alleviated the lack of annotated text images from the real world. Specifically, they generate multiple text images with diverse backgrounds, font styles, and text shapes and enable STR models to learn visual patterns that might not be accessible from manually annotated data. In this paper, we introduce a new synthetic text image generator, SynthTIGER, by analyzing techniques used for text image synthesis and integrating effective ones under a single algorithm. Moreover, we propose two techniques that alleviate the long-tail problem in length and character distributions of training data. In our experiments, SynthTIGER achieves better STR performance than the combination of synthetic datasets, MJSynth (MJ) and SynthText (ST). Our ablation study demonstrates the benefits of using sub-components of SynthTIGER and the guideline on generating synthetic text images for STR models. Our implementation is publicly available at https://github.com/clovaai/synthtiger.

【3】 Locality-aware Channel-wise Dropout for Occluded Face Recognition 标题：遮挡人脸识别中的位置感知通道丢弃算法

作者：Mingjie He,Jie Zhang,Shiguang Shan,Xiao Liu,Zhongqin Wu,Xilin Chen 机构： Chen are with the Key Laboratory ofIntelligent Information Processing of Chinese Academy of Sciences, Instituteof Computing Technology, and also withthe University of Chinese Academy of Sciences 链接：https://arxiv.org/abs/2107.09270 摘要：人脸识别在无约束场景下仍然是一项具有挑战性的任务，特别是当人脸被部分遮挡时。为了提高对遮挡的鲁棒性，利用人工遮挡增强训练图像是一种有效的方法。然而，这些人工遮挡通常是通过添加一个黑色矩形或多个物体模板（包括太阳镜、围巾和手机）来生成的，不能很好地模拟真实的遮挡。本文基于遮挡实质上是对一组神经元的损伤这一论点，提出了一种新颖而优雅的遮挡模拟方法，即在精心选择的通道中去除一组神经元的激活。具体来说，我们首先采用空间正则化来鼓励每个特征通道对局部和不同的人脸区域做出响应。这样，受局部区域中的遮挡影响的激活更可能位于单个特征通道中。然后，设计了位置感知的通道丢失（LCD）算法，通过丢失整个特征通道来模拟遮挡。此外，通过随机剔除多个特征通道，可以很好地模拟大面积遮挡。所提出的LCD可以鼓励其后续层最小化由遮挡引起的类内特征方差，从而提高对遮挡的鲁棒性。此外，我们还设计了一个辅助的空间注意模块，通过学习通道方向的注意向量来重新加权特征通道，从而提高了非遮挡区域的贡献率。在各种基准上的大量实验表明，该方法的性能优于现有的方法，并有显著的改进。摘要：Face recognition remains a challenging task in unconstrained scenarios, especially when faces are partially occluded. To improve the robustness against occlusion, augmenting the training images with artificial occlusions has been proved as a useful approach. However, these artificial occlusions are commonly generated by adding a black rectangle or several object templates including sunglasses, scarfs and phones, which cannot well simulate the realistic occlusions. In this paper, based on the argument that the occlusion essentially damages a group of neurons, we propose a novel and elegant occlusion-simulation method via dropping the activations of a group of neurons in some elaborately selected channel. Specifically, we first employ a spatial regularization to encourage each feature channel to respond to local and different face regions. In this way, the activations affected by an occlusion in a local region are more likely to be located in a single feature channel. Then, the locality-aware channel-wise dropout (LCD) is designed to simulate the occlusion by dropping out the entire feature channel. Furthermore, by randomly dropping out several feature channels, our method can well simulate the occlusion of larger area. The proposed LCD can encourage its succeeding layers to minimize the intra-class feature variance caused by occlusions, thus leading to improved robustness against occlusion. In addition, we design an auxiliary spatial attention module by learning a channel-wise attention vector to reweight the feature channels, which improves the contributions of non-occluded regions. Extensive experiments on various benchmarks show that the proposed method outperforms state-of-the-art methods with a remarkable improvement.

【4】 Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision 标题：基于自我监督的测试时间聚合不同专家的测试不可知性长尾识别

作者：Yifan Zhang,Bryan Hooi,Lanqing Hong,Jiashi Feng 机构：National University of Singapore, Huawei Noah’s Ark Lab 链接：https://arxiv.org/abs/2107.09249 摘要：现有的长尾识别方法都是从长尾数据中训练类平衡模型，一般假设模型是在均匀的测试类分布上进行评价的。然而，实际的测试类分布常常违背这样的假设（例如，长尾甚至是反长尾），这将导致现有方法在实际应用中失败。在这项工作中，我们研究了一个更实际的任务设置，称为测试不可知的长尾识别，其中训练类分布是长尾的，而测试类分布是未知的，并且可以任意倾斜。除了班级不平衡的问题外，这项任务还提出了另一个挑战：训练样本和测试样本之间的班级分布转移是未知的。为了解决这一问题，我们提出了一种新的方法，称为测试时间聚合多元专家（TADE），该方法提出了两种解决策略：（1）一种新的技能多元专家学习策略，该策略训练多元专家从单一的长尾训练分布中处理不同的测试分布(2）一种新的测试时专家聚合策略，利用自我监督聚合多个专家来处理各种测试分布。此外，我们从理论上证明了我们的方法具有模拟未知测试类分布的能力。在香草和测试不可知的长尾识别上的实验结果都验证了TADE的有效性。代码位于https://github.com/Vanint/TADE-AgnosticLT. 摘要：Existing long-tailed recognition methods, aiming to train class-balance models from long-tailed data, generally assume the models would be evaluated on the uniform test class distribution. However, the practical test class distribution often violates such an assumption (e.g., being long-tailed or even inversely long-tailed), which would lead existing methods to fail in real-world applications. In this work, we study a more practical task setting, called test-agnostic long-tailed recognition, where the training class distribution is long-tailed while the test class distribution is unknown and can be skewed arbitrarily. In addition to the issue of class imbalance, this task poses another challenge: the class distribution shift between the training and test samples is unidentified. To address this task, we propose a new method, called Test-time Aggregating Diverse Experts (TADE), that presents two solution strategies: (1) a novel skill-diverse expert learning strategy that trains diverse experts to excel at handling different test distributions from a single long-tailed training distribution; (2) a novel test-time expert aggregation strategy that leverages self-supervision to aggregate multiple experts for handling various test distributions. Moreover, we theoretically show that our method has provable ability to simulate unknown test class distributions. Promising results on both vanilla and test-agnostic long-tailed recognition verify the effectiveness of TADE. Code is available at https://github.com/Vanint/TADE-AgnosticLT.

【5】 Boosting few-shot classification with view-learnable contrastive learning 标题：用视图可学习的对比学习提高Few-Shot分类

作者：Xu Luo,Yuxuan Chen,Liangjian Wen,Lili Pan,Zenglin Xu 机构：University of Electronic Science and Technology of China, Chengdu, China, Harbin Institute of Technology Shenzhen, Shenzhen, China, Pengcheng Lab, Shenzhen, China 备注：8 pages, 4 figures 链接：https://arxiv.org/abs/2107.09242 摘要：Few-Shot分类的目标是在每个类中用很少的标记例子来分类新的类别。目前，基于度量的元学习方法在处理少数镜头分类问题时表现出了良好的性能。然而，以往的方法在没有细粒度标签的情况下很难在嵌入空间中识别出细粒度的子类别。这可能导致对细粒度子类别的不满意概括，从而影响模型解释。为了解决这个问题，我们将对比损失引入到小镜头分类中，在嵌入空间中学习潜在的细粒度结构。此外，为了克服目前对比学习中使用的随机图像变换产生噪声和不准确图像对（即视图）的缺点，我们开发了一种自学习算法来自动生成同一图像的不同视图。在标准Few-Shot学习基准上的大量实验证明了该方法的优越性。摘要：The goal of few-shot classification is to classify new categories with few labeled examples within each class. Nowadays, the excellent performance in handling few-shot classification problems is shown by metric-based meta-learning methods. However, it is very hard for previous methods to discriminate the fine-grained sub-categories in the embedding space without fine-grained labels. This may lead to unsatisfactory generalization to fine-grained subcategories, and thus affects model interpretation. To tackle this problem, we introduce the contrastive loss into few-shot classification for learning latent fine-grained structure in the embedding space. Furthermore, to overcome the drawbacks of random image transformation used in current contrastive learning in producing noisy and inaccurate image pairs (i.e., views), we develop a learning-to-learn algorithm to automatically generate different views of the same image. Extensive experiments on standard few-shot learning benchmarks demonstrate the superiority of our method.

【6】 Understanding Gender and Racial Disparities in Image Recognition Models 标题：理解图像识别模型中的性别和种族差异

作者：Rohan Mahadev,Anindya Chakravarti 机构：Department of Computer Science, New York University, New York, NY 链接：https://arxiv.org/abs/2107.09211 摘要：在像Imagenet这样的流行数据集上训练的大规模图像分类模型显示出了分布的偏差，这导致了人口统计学不同部分的预测精度的差异。为了解决这种分布偏斜，人们采用了很多方法，在训练前、训练后和训练过程中改变模型。我们研究了这样一种方法，即在包含图像数据集（openimagesv6数据集的一个子集）的多标签分类问题上，使用带交叉熵的多标签softmax损失作为损失函数，而不是二进制交叉熵。我们使用MR2数据集，其中包含具有自我识别性别和种族属性的人的图像来评估模型结果的公平性，并尝试通过查看模型激活来解释错误，并提出可能的修复方案。摘要：Large scale image classification models trained on top of popular datasets such as Imagenet have shown to have a distributional skew which leads to disparities in prediction accuracies across different subsections of population demographics. A lot of approaches have been made to solve for this distributional skew using methods that alter the model pre, post and during training. We investigate one such approach - which uses a multi-label softmax loss with cross-entropy as the loss function instead of a binary cross-entropy on a multi-label classification problem on the Inclusive Images dataset which is a subset of the OpenImages V6 dataset. We use the MR2 dataset, which contains images of people with self-identified gender and race attributes to evaluate the fairness in the model outcomes and try to interpret the mistakes by looking at model activations and suggest possible fixes.

【7】 Examining the Human Perceptibility of Black-Box Adversarial Attacks on Face Recognition 标题：检测人脸识别黑箱对抗性攻击的人类感知能力

作者：Benjamin Spetter-Goldstein,Nataniel Ruiz,Sarah Adel Bargal 机构： Recent widespread use of socialmedia websites has lead to billions of public images of 1Department of Computer Science, Boston University 备注：5 pages, 5 figures, submitted to AdvML @ ICML 2021 链接：https://arxiv.org/abs/2107.09126 摘要：现代的开放互联网在网络上包含了数十亿张人脸的公共图片，特别是在世界一半人口使用的社交媒体网站上。在这种情况下，人脸识别（FR）系统有可能将人脸与特定的姓名和身份进行匹配，从而产生明显的隐私问题。对抗性攻击是一种很有前途的方法，通过破坏用户识别人脸的能力，从FR系统中授予用户隐私。然而，人类观察者可以察觉到这种攻击，特别是在更具挑战性的黑箱威胁模型下。在文献中，这种攻击的不可察觉性的理由取决于边界度量，例如$\ellp$范数。然而，关于这些规范如何与人类认知相匹配的研究并不多。通过检验和测量最近的黑箱攻击在人脸识别环境中的有效性，以及通过调查数据对其相应的人类感知能力，我们证明了随着攻击变得更具攻击性，在感知能力方面的权衡。我们还展示了$\ellu 2$范数和其他度量与人类感知能力之间的线性关系，从而使这些范数在测量对抗性攻击感知能力时处于次优状态。摘要：The modern open internet contains billions of public images of human faces across the web, especially on social media websites used by half the world's population. In this context, Face Recognition (FR) systems have the potential to match faces to specific names and identities, creating glaring privacy concerns. Adversarial attacks are a promising way to grant users privacy from FR systems by disrupting their capability to recognize faces. Yet, such attacks can be perceptible to human observers, especially under the more challenging black-box threat model. In the literature, the justification for the imperceptibility of such attacks hinges on bounding metrics such as $\ell_p$ norms. However, there is not much research on how these norms match up with human perception. Through examining and measuring both the effectiveness of recent black-box attacks in the face recognition setting and their corresponding human perceptibility through survey data, we demonstrate the trade-offs in perceptibility that occur as attacks become more aggressive. We also show how the $\ell_2$ norm and other metrics do not correlate with human perceptibility in a linear fashion, thus making these norms suboptimal at measuring adversarial attack perceptibility.

【8】 DeepSMILE: Self-supervised heterogeneity-aware multiple instance learning for DNA damage response defect classification directly from H&E whole-slide images 标题：DeepSMILE：用于直接从H&E全片图像进行DNA损伤响应缺陷分类的自监督异质性感知多示例学习

作者：Yoni Schirris,Efstratios Gavves,Iris Nederlof,Hugo Mark Horlings,Jonas Teuwen 机构：Netherlands Cancer Institute, Plesmanlaan , CX Amsterdam, the Netherlands, University of Amsterdam, Science Park , XH Amsterdam, the Netherlands 备注：16 pages, 5 figures, 2 tables 链接：https://arxiv.org/abs/2107.09405 摘要：我们提出了一种基于深度学习的弱标记学习方法，用于分析苏木精-伊红（H&E）染色肿瘤细胞的全玻片图像（WSIs），该方法不需要像素级或瓦片级的注释，使用了自我监督的预训练和异质性感知的深度多实例学习（DeepSMILE）。我们将DeepSMILE应用于同源重组缺陷（HRD）和微卫星不稳定性（MSI）的预测。我们利用对比自监督学习对肿瘤组织病理切片的特征抽取器进行预训练。此外，我们使用可变性感知的深度多实例学习来学习瓷砖特征聚合函数，同时对肿瘤异质性进行建模。与最先进的基因组标签分类方法相比，在多中心乳腺癌和结直肠癌数据集中，DeepSMILE分别将HRD的分类性能从$70.43\pm4.10\%$提高到$83.79\pm1.25\%$AUC，MSI从$78.56\pm6.24\%$提高到$90.32\pm3.58\%$AUC。这些改进表明我们可以在不收集较大数据集的情况下提高基因组标签分类性能。在未来，这可能会减少对昂贵的基因组测序技术的需求，基于广泛可用的癌症组织的WSIs提供个性化的治疗建议，并通过更快的治疗决定改善患者护理-同样在没有获得基因组测序资源的医疗中心也是如此。摘要：We propose a Deep learning-based weak label learning method for analysing whole slide images (WSIs) of Hematoxylin and Eosin (H&E) stained tumorcells not requiring pixel-level or tile-level annotations using Self-supervised pre-training and heterogeneity-aware deep Multiple Instance LEarning (DeepSMILE). We apply DeepSMILE to the task of Homologous recombination deficiency (HRD) and microsatellite instability (MSI) prediction. We utilize contrastive self-supervised learning to pre-train a feature extractor on histopathology tiles of cancer tissue. Additionally, we use variability-aware deep multiple instance learning to learn the tile feature aggregation function while modeling tumor heterogeneity. Compared to state-of-the-art genomic label classification methods, DeepSMILE improves classification performance for HRD from $70.43\pm4.10\%$ to $83.79\pm1.25\%$ AUC and MSI from $78.56\pm6.24\%$ to $90.32\pm3.58\%$ AUC in a multi-center breast and colorectal cancer dataset, respectively. These improvements suggest we can improve genomic label classification performance without collecting larger datasets. In the future, this may reduce the need for expensive genome sequencing techniques, provide personalized therapy recommendations based on widely available WSIs of cancer tissue, and improve patient care with quicker treatment decisions - also in medical centers without access to genome sequencing resources.

分割|语义相关(7篇)

【1】 DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation 标题：DSP：无监督领域自适应语义分割的双软粘贴

作者：Li Gao,Jing Zhang,Lefei Zhang,Dacheng Tao 机构：Wuhan University, Wuhan, China, The University of Sydney, Sydney, Australia, Beijing, China 备注：Accepted by ACM MM2021 链接：https://arxiv.org/abs/2107.09600 摘要：无监督领域自适应（UDA）语义切分的目的是使在标记源领域训练的切分模型适应未标记的目标领域。现有的方法在学习域不变特征的同时，由于存在较大的域间距，使得不一致特征很难正确对齐，特别是在初始训练阶段。针对这一问题，本文提出了一种新的双软粘贴（DSP）方法。具体地说，DSP采用长尾类优先采样策略从源域图像中选择一些类，并在源训练图像和目标训练图像上用融合权值软粘贴相应的图像块。在技术上，我们采用mean-teacher框架进行域适配，粘贴的源图像和目标图像通过学生网络，而原始目标图像通过教师网络。输出级对齐是通过使用加权交叉熵损失对齐来自两个网络的目标融合图像的概率图来实现的。此外，利用加权最大平均差异损失对学生网络中的源图像和目标图像的特征映射进行特征级对齐。DSP有助于从中间域学习模型域不变特征，从而加快收敛速度，提高性能。在两个具有挑战性的基准测试上的实验证明了DSP相对于现有方法的优越性。代码位于\url{https://github.com/GaoLii/DSP}. 摘要：Unsupervised domain adaptation (UDA) for semantic segmentation aims to adapt a segmentation model trained on the labeled source domain to the unlabeled target domain. Existing methods try to learn domain invariant features while suffering from large domain gaps that make it difficult to correctly align discrepant features, especially in the initial training phase. To address this issue, we propose a novel Dual Soft-Paste (DSP) method in this paper. Specifically, DSP selects some classes from a source domain image using a long-tail class first sampling strategy and softly pastes the corresponding image patch on both the source and target training images with a fusion weight. Technically, we adopt the mean teacher framework for domain adaptation, where the pasted source and target images go through the student network while the original target image goes through the teacher network. Output-level alignment is carried out by aligning the probability maps of the target fused image from both networks using a weighted cross-entropy loss. In addition, feature-level alignment is carried out by aligning the feature maps of the source and target images from student network using a weighted maximum mean discrepancy loss. DSP facilitates the model learning domain-invariant features from the intermediate domains, leading to faster convergence and better performance. Experiments on two challenging benchmarks demonstrate the superiority of DSP over state-of-the-art methods. Code is available at \url{https://github.com/GaoLii/DSP}.

【2】 Critic Guided Segmentation of Rewarding Objects in First-Person Views 标题：评论家引导的第一人称视角下奖励对象的分割

作者：Andrew Melnik,Augustin Harter,Christian Limberg,Krishan Rana,Niko Suenderhauf,Helge Ritter 机构： CITEC, Bielefeld University, Germany, Centre for Robotics, Queensland University of Technology (QUT), Brisbane, Australia 链接：https://arxiv.org/abs/2107.09540 摘要：本文讨论了一种利用模拟学习数据集的稀疏奖励信号来掩盖图像中奖励对象的学习方法。为此，我们只使用来自评论家模型的反馈来训练沙漏网络。沙漏网络学习产生一个遮罩，通过在这两个图像之间交换遮罩区域来降低高分图像的批评家得分，并增加低分图像的批评家得分。我们在NeurIPS 2020 minel竞赛轨迹的模仿学习数据集上训练模型，在那里，我们的模型学习在复杂的交互式3D环境中用稀疏的奖励信号掩盖奖励对象。这种方法是本次比赛第一名解决方案的一部分。视频演示和代码：https://rebrand.ly/critic-guided-segmentation 摘要：This work discusses a learning approach to mask rewarding objects in images using sparse reward signals from an imitation learning dataset. For that, we train an Hourglass network using only feedback from a critic model. The Hourglass network learns to produce a mask to decrease the critic's score of a high score image and increase the critic's score of a low score image by swapping the masked areas between these two images. We trained the model on an imitation learning dataset from the NeurIPS 2020 MineRL Competition Track, where our model learned to mask rewarding objects in a complex interactive 3D environment with a sparse reward signal. This approach was part of the 1st place winning solution in this competition. Video demonstration and code: https://rebrand.ly/critic-guided-segmentation

【3】 Attention-Guided NIR Image Colorization via Adaptive Fusion of Semantic and Texture Clues 标题：基于语义和纹理线索自适应融合的注意力引导近红外图像彩色化

作者：Xingxing Yang,Jie Chen,Zaifeng Yang,Zhenghua Chen 机构：Department of Computer Science, Hong Kong Baptist University, Hong Kong, Department of Electronics and Photonics, Institute of High Performance Computing, Agency for, Science, Technology and Research, Singapore 链接：https://arxiv.org/abs/2107.09237 摘要：近红外（NIR）成像已广泛应用于微光成像场景；然而，在无色近红外波段，人类和算法很难感知真实场景。生成性对抗网络（Generative anterparial Network，GAN）已被广泛应用于各种图像彩色处理任务中，但对于一种直接映射机制（如传统的GAN）来说，如何同时以正确的语义推理、良好的纹理和生动的颜色组合将图像从近红外域转换到RGB域是一个挑战。在这项工作中，我们提出了一个新的基于注意的近红外图像彩色化框架，通过自适应融合语义和纹理线索，旨在实现这些目标在同一框架内。纹理传递和语义推理的任务在两个独立的网络块中执行。具体而言，纹理传递块（TTB）的目的是从近红外图像的拉普拉斯分量中提取纹理特征，并将其传递给后续的颜色融合。语义推理块（SRB）提取语义线索，并将NIR像素值映射到RGB域。最后，提出了一种融合注意块（FAB）来自适应地融合两个分支的特征，得到优化的彩色化结果。为了提高网络在语义推理方面的学习能力和在纹理转换方面的映射精度，我们提出了剩余坐标注意块（RCAB），它将坐标注意融入到剩余学习框架中，使网络能够捕捉沿信道方向的长距离依赖关系，同时沿空间方向保留精确的位置信息。RCAB也被纳入晶圆厂，以促进准确的纹理对齐过程中融合。定量和定性评价结果表明，该方法优于现有的近红外图像彩色化方法。摘要：Near infrared (NIR) imaging has been widely applied in low-light imaging scenarios; however, it is difficult for human and algorithms to perceive the real scene in the colorless NIR domain. While Generative Adversarial Network (GAN) has been widely employed in various image colorization tasks, it is challenging for a direct mapping mechanism, such as a conventional GAN, to transform an image from the NIR to the RGB domain with correct semantic reasoning, well-preserved textures, and vivid color combinations concurrently. In this work, we propose a novel Attention-based NIR image colorization framework via Adaptive Fusion of Semantic and Texture clues, aiming at achieving these goals within the same framework. The tasks of texture transfer and semantic reasoning are carried out in two separate network blocks. Specifically, the Texture Transfer Block (TTB) aims at extracting texture features from the NIR image's Laplacian component and transferring them for subsequent color fusion. The Semantic Reasoning Block (SRB) extracts semantic clues and maps the NIR pixel values to the RGB domain. Finally, a Fusion Attention Block (FAB) is proposed to adaptively fuse the features from the two branches and generate an optimized colorization result. In order to enhance the network's learning capacity in semantic reasoning as well as mapping precision in texture transfer, we have proposed the Residual Coordinate Attention Block (RCAB), which incorporates coordinate attention into a residual learning framework, enabling the network to capture long-range dependencies along the channel direction and meanwhile precise positional information can be preserved along spatial directions. RCAB is also incorporated into FAB to facilitate accurate texture alignment during fusion. Both quantitative and qualitative evaluations show that the proposed method outperforms state-of-the-art NIR image colorization methods.

【4】 SynthSeg: Domain Randomisation for Segmentation of Brain MRI Scans of any Contrast and Resolution 标题：SynthSeg：任意对比度和分辨率的脑MRI扫描分割的域随机化

作者：Benjamin Billot,Douglas N. Greve,Oula Puonti,Axel Thielscher,Koen Van Leemput,Bruce Fischl,Adrian V. Dalca,Juan Eugenio Iglesias 机构： Mas-sachusetts General Hospital and Harvard Medical School, Copenhagen University Hospital Amagerand Hvidovre, Denmark 4Department of Health Technology, Technical University of Denmark 5Computer Science and Artificial Intelligence Laboratory 备注：20 pages, 11 figures, currently under review 链接：https://arxiv.org/abs/2107.09559 摘要：尽管在数据增强和转移学习方面取得了进展，卷积神经网络（CNNs）仍然难以推广到未知的目标域。当应用于脑MRI扫描的分割时，CNNs对分辨率和对比度的变化非常敏感：即使在相同的MR成像模式中，也可以在数据集中观察到性能的下降。我们介绍SynthSeg，第一个分割CNN不可知的脑部MRI扫描的任何对比度和分辨率。SynthSeg是用从贝叶斯分割启发的生成模型中采样的合成数据来训练的。关键的是，我们采用了一种\textit{domain randomission}策略，在该策略中，我们完全随机生成参数，以最大限度地提高训练数据的可变性。因此，SynthSeg可以分割任何目标域的预处理和未处理的真实扫描，而无需重新训练或微调。由于SynthSeg只需要对分割进行训练（没有图像），因此它可以从从不同人群（如萎缩和病变）的现有数据集中自动获得的标签图中学习，从而实现对广泛形态变异的鲁棒性。我们在5500次6种模式和10种分辨率的扫描上展示了SynthSeg，与有监督的CNNs、测试时间自适应和贝叶斯分割相比，SynthSeg具有无与伦比的通用性。代码和训练模型可在https://github.com/BBillot/SynthSeg. 摘要：Despite advances in data augmentation and transfer learning, convolutional neural networks (CNNs) have difficulties generalising to unseen target domains. When applied to segmentation of brain MRI scans, CNNs are highly sensitive to changes in resolution and contrast: even within the same MR modality, decreases in performance can be observed across datasets. We introduce SynthSeg, the first segmentation CNN agnostic to brain MRI scans of any contrast and resolution. SynthSeg is trained with synthetic data sampled from a generative model inspired by Bayesian segmentation. Crucially, we adopt a \textit{domain randomisation} strategy where we fully randomise the generation parameters to maximise the variability of the training data. Consequently, SynthSeg can segment preprocessed and unpreprocessed real scans of any target domain, without retraining or fine-tuning. Because SynthSeg only requires segmentations to be trained (no images), it can learn from label maps obtained automatically from existing datasets of different populations (e.g., with atrophy and lesions), thus achieving robustness to a wide range of morphological variability. We demonstrate SynthSeg on 5,500 scans of 6 modalities and 10 resolutions, where it exhibits unparalleled generalisation compared to supervised CNNs, test time adaptation, and Bayesian segmentation. The code and trained model are available at https://github.com/BBillot/SynthSeg.

【5】 Automated Segmentation and Volume Measurement of Intracranial Carotid Artery Calcification on Non-Contrast CT 标题：颅内颈动脉钙化的非增强CT自动分割及体积测量

作者：Gerda Bortsova,Daniel Bos,Florian Dubost,Meike W. Vernooij,M. Kamran Ikram,Gijs van Tulder,Marleen de Bruijne 机构：From the aBiomedical Imaging Group Rotterdam, Department of Radiology and Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands; bDepartment of Epidemiology, Erasmus MC, Rotterdam 备注：Accepted for publication in Radiology: Artificial Intelligence (this https URL), which is published by the Radiological Society of North America (RSNA) 链接：https://arxiv.org/abs/2107.09442 摘要：目的：评价一种基于全自动深度学习的颅内颈动脉钙化（ICAC）评估方法。方法：两名观察者在2319例（平均年龄69（SD 7）岁）受试者的CT平扫中手工勾画ICAC；这些数据被用来回顾性地发展和验证一种基于深度学习的方法，用于自动化廉政公署划定和容积测量。为了评估该方法，我们比较了手动和自动评估（使用十倍交叉验证计算）与1）独立观察者评估（47次扫描的随机子集）的一致性；2）以专家盲视比较方式判断廉署的准确度；3）从扫描日期到2012年与首次中风发病率的关联。所有方法性能指标均采用10倍交叉验证计算。结果：ICAC的自动描绘灵敏度为83.8%，阳性预测值（PPV）为88%。自动和手动ICAC容积测量的组内相关性为0.98（95%CI:0.97，0.98；在整个数据集中计算）。独立观察者评估之间的敏感度为73.9%，PPV为89.5%，组内相关性为0.91（95%CI:0.84，0.95；在47扫描子集中计算）。在盲法视觉比较中，自动描绘比手动描绘更准确（p值=0.01）。对于自动测量（95%CI:1.12，1.75）和手动测量的容量（95%CI:1.48，1.20，1.87），ICAC容量与中风事件的关联性相似。结论：该模型能够自动分割和量化ICAC的体积，准确度与人类专家相当。摘要：Purpose: To evaluate a fully-automated deep-learning-based method for assessment of intracranial carotid artery calcification (ICAC). Methods: Two observers manually delineated ICAC in non-contrast CT scans of 2,319 participants (mean age 69 (SD 7) years; 1154 women) of the Rotterdam Study, prospectively collected between 2003 and 2006. These data were used to retrospectively develop and validate a deep-learning-based method for automated ICAC delineation and volume measurement. To evaluate the method, we compared manual and automatic assessment (computed using ten-fold cross-validation) with respect to 1) the agreement with an independent observer's assessment (available in a random subset of 47 scans); 2) the accuracy in delineating ICAC as judged via blinded visual comparison by an expert; 3) the association with first stroke incidence from the scan date until 2012. All method performance metrics were computed using 10-fold cross-validation. Results: The automated delineation of ICAC reached sensitivity of 83.8% and positive predictive value (PPV) of 88%. The intraclass correlation between automatic and manual ICAC volume measures was 0.98 (95% CI: 0.97, 0.98; computed in the entire dataset). Measured between the assessments of independent observers, sensitivity was 73.9%, PPV was 89.5%, and intraclass correlation was 0.91 (95% CI: 0.84, 0.95; computed in the 47-scan subset). In the blinded visual comparisons, automatic delineations were more accurate than manual ones (p-value = 0.01). The association of ICAC volume with incident stroke was similarly strong for both automated (hazard ratio, 1.38 (95% CI: 1.12, 1.75) and manually measured volumes (hazard ratio, 1.48 (95% CI: 1.20, 1.87)). Conclusions: The developed model was capable of automated segmentation and volume quantification of ICAC with accuracy comparable to human experts.

【6】 Protecting Semantic Segmentation Models by Using Block-wise Image Encryption with Secret Key from Unauthorized Access 标题：基于挡路密钥的图像加密保护语义分割模型不被非授权访问

作者：Hiroki Ito,MaungMaung AprilPyone,Hitoshi Kiya 机构：Tokyo Metropolitan University, Japan 备注：To appear in 2021 International Workshop on Smart Info-Media Systems in Asia (SISA 2021) 链接：https://arxiv.org/abs/2107.09362 摘要：由于生产级训练的深度神经网络（DNN）具有巨大的商业价值，因此保护DNN模型免受版权侵犯和未经授权的访问的需求越来越大。然而，传统的模型保护方法只关注图像的分类任务，这些保护方法虽然有越来越多的应用，但从未应用于语义分割。本文首次提出利用带密钥的分块变换来保护语义切分模型不被非法访问。利用变换后的图像对保护模型进行训练。实验结果表明，所提出的保护方法允许合法用户使用正确的密钥访问模型，使未授权用户的性能下降。但是，与非保护模型相比，受保护模型的分割性能略有下降。摘要：Since production-level trained deep neural networks (DNNs) are of a great business value, protecting such DNN models against copyright infringement and unauthorized access is in a rising demand. However, conventional model protection methods focused only the image classification task, and these protection methods were never applied to semantic segmentation although it has an increasing number of applications. In this paper, we propose to protect semantic segmentation models from unauthorized access by utilizing block-wise transformation with a secret key for the first time. Protected models are trained by using transformed images. Experiment results show that the proposed protection method allows rightful users with the correct key to access the model to full capacity and deteriorate the performance for unauthorized users. However, protected models slightly drop the segmentation performance compared to non-protected models.

【7】 Convolutional module for heart localization and segmentation in MRI 标题：用于MRI心脏定位和分割的卷积模块

作者：Daniel Lima,Catharine Graves,Marco Gutierrez,Bruno Brandoli,Jose Rodrigues-Jr 机构：( 1) ICMC-USP – Institute of Mathematical and Computer Sciences, University of Sao PauloAvenida Trabalhador Sao-carlense 400, FMUSP – Heart Institute, University of Sao PauloAvenida Doutor Eneas de Carvalho Aguiar 4 4, BR( 3) Dalhousie University 备注：Submitted to CMIG 链接：https://arxiv.org/abs/2107.09134 摘要：磁共振成像（MRI）是一种广泛应用于评估心脏功能的医学成像技术。深度学习（DL）模型在心脏MRI（CMR）图像中具有很好的分割、估计和疾病检测等功能。许多基于卷积神经网络（CNN）的DL模型通过自动或手动检测感兴趣区域（ROI）进行了改进。在本文中，我们描述了视觉运动聚焦（VMF），一个在4dmri序列中检测心脏运动的模块，并通过在估计的运动场上聚焦径向基函数（RBF）来突出roi。我们在三个CMR数据集上对VMF进行了实验和评估，观察到提出的ROI覆盖了99.7%的数据标签（回忆分数），在ROI提取后CNN分割（平均骰子分数）提高了1.7（p<0.001），整体训练速度提高了2.5倍（+150%）。摘要：Magnetic resonance imaging (MRI) is a widely known medical imaging technique used to assess the heart function. Deep learning (DL) models perform several tasks in cardiac MRI (CMR) images with good efficacy, such as segmentation, estimation, and detection of diseases. Many DL models based on convolutional neural networks (CNN) were improved by detecting regions-of-interest (ROI) either automatically or by hand. In this paper we describe Visual-Motion-Focus (VMF), a module that detects the heart motion in the 4D MRI sequence, and highlights ROIs by focusing a Radial Basis Function (RBF) on the estimated motion field. We experimented and evaluated VMF on three CMR datasets, observing that the proposed ROIs cover 99.7% of data labels (Recall score), improved the CNN segmentation (mean Dice score) by 1.7 (p < .001) after the ROI extraction, and improved the overall training speed by 2.5 times (+150%).

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 Self-Supervised Domain Adaptation for Diabetic Retinopathy Grading using Vessel Image Reconstruction 标题：基于血管图像重建的糖尿病视网膜病变分级的自监督域自适应

作者：Duy M. H. Nguyen,Truong T. N. Mai,Ngoc T. T. Than,Alexander Prange,Daniel Sonntag 机构： German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus Saarbr¨ucken, Germany, Department of Multimedia Engineering, Dongguk University, South Korea, Byers Eye Institute, Stanford University, United States 链接：https://arxiv.org/abs/2107.09372 摘要：本文研究糖尿病视网膜病变（DR）分级的域适应问题。在医学领域知识的启发下，我们通过定义一个新的基于视网膜血管图像重建的自监督任务来学习不变的目标域特征。在此基础上，提出了一种基于DR问题的无监督域自适应方法。结果表明，该方法优于现有的域自适应策略。此外，当利用目标域中的整个训练数据时，我们仅通过应用标准网络结构和使用图像级标签就能够在最终分类精度上与几种最先进的方法竞争。摘要：This paper investigates the problem of domain adaptation for diabetic retinopathy (DR) grading. We learn invariant target-domain features by defining a novel self-supervised task based on retinal vessel image reconstructions, inspired by medical domain knowledge. Then, a benchmark of current state-of-the-art unsupervised domain adaptation methods on the DR problem is provided. It can be shown that our approach outperforms existing domain adaption strategies. Furthermore, when utilizing entire training data in the target domain, we are able to compete with several state-of-the-art approaches in final classification accuracy just by applying standard network architectures and using image-level labels.

【2】 Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion 标题：Audio2Head：音频驱动的具有自然头部运动的一次说话头部生成

作者：Suzhen Wang,Lincheng Li,Yu Ding,Changjie Fan,Xin Yu 机构： Virtual Human Group, Netease Fuxi AI Lab, China, University of Technology Sydney 备注：None 链接：https://arxiv.org/abs/2107.09293 摘要：提出了一种基于音频驱动的说话人视频生成方法。在这项工作中，我们解决了两个关键的挑战：（i）产生与语音韵律相匹配的自然头部运动，以及（ii）在稳定非面部区域的同时保持说话人在大头部运动中的外观。我们首先设计了一个头部姿态预测器，该预测器通过一个运动感知的递归神经网络（RNN）来模拟刚性的6D头部运动。通过这种方式，预测的头部姿势充当会说话的头部的低频整体运动，从而使我们的后一个网络能够专注于详细的面部运动生成。为了描述由音频引起的整个图像运动，我们采用了一种基于关键点的稠密运动场表示方法。然后，我们开发一个运动场产生器，从输入音频、头部姿势和参考图像中产生密集的运动场。由于这种基于关键点的表示方法对人脸区域、头部和背景的运动进行了完整的建模，因此可以更好地约束生成视频的时空一致性。最后，利用图像生成网络从基于关键点的运动场估计值和输入的参考图像中生成逼真的说话人视频。大量的实验表明，我们的方法产生的视频具有合理的头部运动，同步的面部表情和稳定的背景，并优于国家的最先进的。摘要：We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

半弱无监督|主动学习|不确定性(1篇)

【1】 ReSSL: Relational Self-Supervised Learning with Weak Augmentation 标题：ReSSL：弱增强的关系型自监督学习

作者：Mingkai Zheng,Shan You,Fei Wang,Chen Qian,Changshui Zhang,Xiaogang Wang,Chang Xu 机构：SenseTime Research, Department of Automation, Tsinghua University, Institute for Artificial Intelligence, Tsinghua University (THUAI), Beijing National Research Center for Information Science and Technology (BNRist), The Chinese University of Hong Kong 链接：https://arxiv.org/abs/2107.09282 摘要：包括主流对比学习在内的自监督学习（SSL）在无数据标注的视觉表征学习中取得了巨大的成功。然而，大多数方法主要关注实例级信息（即同一实例的不同增广图像应具有相同的特征或聚类到同一类中），而缺乏对不同实例之间关系的关注。在本文中，我们引入了一种新的SSL范式，我们称之为关系自监督学习（relational self-supervised learning，ReSSL）框架，它通过建模不同实例之间的关系来学习表示。具体地说，我们提出的方法使用不同实例间成对相似性的锐化分布作为\textit{relation}度量，从而用于匹配不同增广的特征嵌入。此外，为了提高性能，我们认为弱增广关系表示一个更可靠的关系，并利用动量策略的实际效率。实验结果表明，本文提出的ReSSL算法在性能和训练效率上都明显优于现有的算法。代码位于\url{https://github.com/KyleZheng1997/ReSSL}. 摘要：Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most of methods mainly focus on the instance level information (\ie, the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduced a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as \textit{relation} metric, which is thus utilized to match the feature embeddings of different augmentations. Moreover, to boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. Experimental results show that our proposed ReSSL significantly outperforms the previous state-of-the-art algorithms in terms of both performance and training efficiency. Code is available at \url{https://github.com/KyleZheng1997/ReSSL}.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos 标题：基于多模态时间卷积网络的自我中心视频动作预测

作者：Olga Zatsarynna,Yazan Abu Farha,Juergen Gall 机构：University of Bonn, Germany 备注：CVPR Precognition Workshop 链接：https://arxiv.org/abs/2107.09504 摘要：预测人类行为是开发可靠的智能代理（如自动驾驶汽车或机器人助手）需要解决的一项重要任务。虽然高精度地做出未来预测的能力对于设计预测方法至关重要，但推理的速度同样重要。准确但不够快的方法会给决策过程带来很高的延迟。因此，这将增加基础系统的反应时间。这给诸如自动驾驶之类的领域带来了一个问题，其中反应时间至关重要。在这项工作中，我们提出了一个简单而有效的基于时间卷积的多模态结构。我们的方法堆叠了一个时间卷积层的层次结构，并且不依赖于循环层来确保快速预测。我们进一步介绍了一种多模式融合机制，该机制捕获RGB、流和对象模式之间的成对交互。在EPIC-Kitchens-55和EPIC-Kitchens-100两个以自我为中心的大型视频数据集上的结果表明，我们的方法在速度显著加快的同时，取得了与最先进方法相当的性能。摘要：Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. Thus, this will increase the reaction time of the underlying system. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. In this work, we propose a simple and effective multi-modal architecture based on temporal convolutions. Our approach stacks a hierarchy of temporal convolutional layers and does not rely on recurrent layers to ensure a fast prediction. We further introduce a multi-modal fusion mechanism that captures the pairwise interactions between RGB, flow, and object modalities. Results on two large-scale datasets of egocentric videos, EPIC-Kitchens-55 and EPIC-Kitchens-100, show that our approach achieves comparable performance to the state-of-the-art approaches while being significantly faster.

【2】 FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos 标题：FoleyGan：基于视觉引导的生成性对抗性网络无声视频同步发声

作者：Sanchita Ghose,John J. Prevost 机构：edu) are with the Department of Electrical and ComputerEngineering, The University of Texas at San Antonio, - Indicates the Corresponding AuthorThis research is supported by the Open Cloud Institute at UTSA 备注：This article is under review in IEEE Transaction on Multimedia. It contains total 12 pages, 6 figures, 4 tables 链接：https://arxiv.org/abs/2107.09262 摘要：基于深度学习的视音频生成系统的开发，尤其需要考虑到视音频特征随时间的同步性。在这项研究中，我们提出了一个新的任务，即利用视频输入的时间视觉信息引导类条件生成对抗网络，以适应视听模式之间的同步特性。我们提出的FoleyGAN模型能够调节视觉事件的动作序列，从而产生视觉对齐的真实音轨。我们扩展了之前提出的自动Foley数据集，用FoleyGAN进行训练，并通过人类调查来评估我们的合成声音，显示出值得注意的（平均81\%）视听同步性能。我们的方法在统计实验中也优于其他的基线模型和视听数据集。摘要：Deep learning based visual to sound generation systems essentially need to be developed particularly considering the synchronicity aspects of visual and audio features with time. In this research we introduce a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task adapting the synchronicity traits between audio-visual modalities. Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading towards generating visually aligned realistic sound tracks. We expand our previously proposed Automatic Foley dataset to train with FoleyGAN and evaluate our synthesized sound through human survey that shows noteworthy (on average 81\%) audio-visual synchronicity performance. Our approach also outperforms in statistical experiments compared with other baseline models and audio-visual datasets.

医学相关(3篇)

【1】 Towards Privacy-preserving Explanations in Medical Image Analysis 标题：医学图像分析中的隐私保护解释

作者：H. Montenegro,W. Silva,J. S. Cardoso 机构： various methods havebeen developed in the scientific community to provide ex- 1UniversityofPorto 备注：7 pages, 5 figures, accepted at Workshop on Interpretable ML in Healthcare at ICML2021 链接：https://arxiv.org/abs/2107.09652 摘要：深度学习在医学领域的应用由于缺乏可解释性而受到阻碍。基于案例的可解释性策略可以为深度学习模型的决策提供直观的解释，从而增强信任。然而，由此产生的解释威胁到病人的隐私，促使发展隐私保护方法兼容的具体医疗数据。在这项工作中，我们分析了现有的隐私保护方法及其各自的能力，以匿名医疗数据，同时保留疾病相关的语义特征。我们发现，PPRL-VGAN深度学习方法在保留与疾病相关的语义特征的同时保证了较高的保密性。然而，我们强调需要改进医学影像的隐私保护方法，因为我们发现了所有现有隐私保护方法的相关缺陷。摘要：The use of Deep Learning in the medical field is hindered by the lack of interpretability. Case-based interpretability strategies can provide intuitive explanations for deep learning models' decisions, thus, enhancing trust. However, the resulting explanations threaten patient privacy, motivating the development of privacy-preserving methods compatible with the specifics of medical data. In this work, we analyze existing privacy-preserving methods and their respective capacity to anonymize medical data while preserving disease-related semantic features. We find that the PPRL-VGAN deep learning method was the best at preserving the disease-related semantic features while guaranteeing a high level of privacy among the compared state-of-the-art methods. Nevertheless, we emphasize the need to improve privacy-preserving methods for medical imaging, as we identified relevant drawbacks in all existing privacy-preserving approaches.

【2】 Medical Imaging with Deep Learning for COVID- 19 Diagnosis: A Comprehensive Review 标题：深度学习医学成像在冠状病毒诊断中的应用综述

作者：Subrato Bharati,Prajoy Podder,M. Rubaiyat Hossain Mondal,V. B. Surya Prasath 机构： Institute of Information and Communication Technology, Bangladesh University of Engineering and Technology, Dhaka-, Bangladesh, Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati , USA 备注：None 链接：https://arxiv.org/abs/2107.09602 摘要：新型冠状病毒病（COVID-19）的爆发已经夺去了数百万人的生命，并影响到人类生活的方方面面。本文主要研究深度学习（DL）模型在医学影像学和药物发现中的应用。在这篇文章中，我们详细介绍了各种基于医学影像学的研究，如X射线和计算机断层扫描（CT）图像，以及分类COVID-19感染与肺炎的DL方法。本文从图像定位、分割、配准和分类等方面进一步阐述了DL技术在医学图像中的应用。最近的文献表明，将InstaCovNet-19dl方法应用于361例COVID-19患者、362例肺炎患者和365例正常人的X线数据集，分类准确率最高，达99.80%。将EDL\u-COVID-DL方法应用于7500例COVID-19患者、肺肿瘤患者和正常人数量相等的CT图像数据集，可获得99.054%的分类准确率。此外，我们还阐述了DL技术在抗冠状病毒药物或疫苗研究中的潜在应用。最后，我们讨论了一些与COVID-19的DL应用相关的问题、关注点和未来的研究方向。摘要：The outbreak of novel coronavirus disease (COVID- 19) has claimed millions of lives and has affected all aspects of human life. This paper focuses on the application of deep learning (DL) models to medical imaging and drug discovery for managing COVID-19 disease. In this article, we detail various medical imaging-based studies such as X-rays and computed tomography (CT) images along with DL methods for classifying COVID-19 affected versus pneumonia. The applications of DL techniques to medical images are further described in terms of image localization, segmentation, registration, and classification leading to COVID-19 detection. The reviews of recent papers indicate that the highest classification accuracy of 99.80% is obtained when InstaCovNet-19 DL method is applied to an X-ray dataset of 361 COVID-19 patients, 362 pneumonia patients and 365 normal people. Furthermore, it can be seen that the best classification accuracy of 99.054% can be achieved when EDL_COVID DL method is applied to a CT image dataset of 7500 samples where COVID-19 patients, lung tumor patients and normal people are equal in number. Moreover, we illustrate the potential DL techniques in drug or vaccine discovery in combating the coronavirus. Finally, we address a number of problems, concerns and future research directions relevant to DL applications for COVID-19.

【3】 A Review of Generative Adversarial Networks in Cancer Imaging: New Applications, New Solutions 标题：生成性对抗网络在肿瘤影像中的新应用、新解决方案综述

作者：Richard Osuala,Kaisar Kushibar,Lidia Garrucho,Akis Linardos,Zuzanna Szafranowska,Stefan Klein,Ben Glocker,Oliver Diaz,Karim Lekadir 机构：Artificial Intelligence in Medicine Lab (BCN-AIM), University of Barcelona, Spain, Biomedical Imaging Group Rotterdam, Department of Radiology & Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands 备注：64 pages, v1, preprint submitted to Elsevier, Oliver Diaz and Karim Lekadir contributed equally to this work 链接：https://arxiv.org/abs/2107.09543 摘要：尽管有技术和医学上的进步，基于成像数据的癌症检测、解释和治疗仍然是一个巨大的挑战。这些包括观察者之间的高度变异性、小病灶检测的困难、结节的解释和恶性程度的确定、肿瘤间和肿瘤内的异质性、分类不平衡、分割不准确和治疗效果的不确定性。近年来，计算机视觉和医学成像领域中的生成性对抗网络（Generative敌对网络，GANs）的发展为增强肿瘤检测和分析能力提供了基础。在这篇综述中，我们评估了GANs在解决癌症成像的一些关键挑战方面的潜力，包括数据匮乏和不平衡、领域和数据集转移、数据访问和隐私、数据注释和量化以及癌症检测、肿瘤分析和治疗计划。我们提供了一个关键的评估现有文献的GANs应用于癌症图像，连同未来的研究方向，以解决这些挑战的建议。我们分析和讨论了163篇论文，应用对抗性训练技术的背景下，癌症成像和阐述他们的方法，优势和局限性。通过这项工作，我们努力弥合临床癌症影像学界的需求与人工智能界对GANs的当前和未来研究之间的鸿沟。摘要：Despite technological and medical advances, the detection, interpretation, and treatment of cancer based on imaging data continue to pose significant challenges. These include high inter-observer variability, difficulty of small-sized lesion detection, nodule interpretation and malignancy determination, inter- and intra-tumour heterogeneity, class imbalance, segmentation inaccuracies, and treatment effect uncertainty. The recent advancements in Generative Adversarial Networks (GANs) in computer vision as well as in medical imaging may provide a basis for enhanced capabilities in cancer detection and analysis. In this review, we assess the potential of GANs to address a number of key challenges of cancer imaging, including data scarcity and imbalance, domain and dataset shifts, data access and privacy, data annotation and quantification, as well as cancer detection, tumour profiling and treatment planning. We provide a critical appraisal of the existing literature of GANs applied to cancer imagery, together with suggestions on future research directions to address these challenges. We analyse and discuss 163 papers that apply adversarial training techniques in the context of cancer imaging and elaborate their methodologies, advantages and limitations. With this work, we strive to bridge the gap between the needs of the clinical cancer imaging community and the current and prospective research on GANs in the artificial intelligence community.

GAN|对抗|攻击|生成相关(2篇)

【1】 RankSRGAN: Super Resolution Generative Adversarial Networks with Learning to Rank 标题：RankSRGan：学习排序的超分辨率生成性对抗性网络

作者：Wenlong Zhang,Yihao Liu,Chao Dong,Yu Qiao 机构：Shenzhen Institutes of Advanced Technology 备注：IEEE PAMI accepted. arXiv admin note: substantial text overlap with arXiv:1908.06382 链接：https://arxiv.org/abs/2107.09427 摘要：生成对抗网络（GAN）已经证明了在单图像超分辨率（SISR）中恢复真实细节的潜力。为了进一步提高超分辨结果的视觉质量，PIRM2018-SR Challenge采用了感知度量来评估感知质量，如PI、NIQE和Ma。然而，现有的方法无法直接优化这些与人类评分高度相关的无差别感知指标。为了解决这个问题，我们提出了一种基于Ranker（RankSRGAN）的超分辨率生成对抗网络，在不同感知度量的方向上对生成元进行优化。具体地说，我们首先训练一个能够学习感知度量行为的等级者，然后引入一种新的等级内容损失来优化感知质量。最吸引人的部分是，该方法可以结合不同SR方法的优点，产生更好的结果。此外，我们将我们的方法扩展到多个ranker，为生成器提供多维约束。大量的实验表明，RankSRGAN在视觉上达到了令人满意的效果，并且在感知度量和质量方面达到了最先进的水平。项目页面：https://wenlongzhang0517.github.io/Projects/RankSRGAN 摘要：Generative Adversarial Networks (GAN) have demonstrated the potential to recover realistic details for single image super-resolution (SISR). To further improve the visual quality of super-resolved results, PIRM2018-SR Challenge employed perceptual metrics to assess the perceptual quality, such as PI, NIQE, and Ma. However, existing methods cannot directly optimize these indifferentiable perceptual metrics, which are shown to be highly correlated with human ratings. To address the problem, we propose Super-Resolution Generative Adversarial Networks with Ranker (RankSRGAN) to optimize generator in the direction of different perceptual metrics. Specifically, we first train a Ranker which can learn the behaviour of perceptual metrics and then introduce a novel rank-content loss to optimize the perceptual quality. The most appealing part is that the proposed method can combine the strengths of different SR methods to generate better results. Furthermore, we extend our method to multiple Rankers to provide multi-dimension constraints for the generator. Extensive experiments show that RankSRGAN achieves visually pleasing results and reaches state-of-the-art performance in perceptual metrics and quality. Project page: https://wenlongzhang0517.github.io/Projects/RankSRGAN

【2】 Discriminator-Free Generative Adversarial Attack 标题：无鉴别器生成性对抗性攻击

作者：Shaohao Lu,Yuqiao Xian,Ke Yan,Yi Hu,Xing Sun,Xiaowei Guo,Feiyue Huang,Wei-Shi Zheng 机构：School of Computer Science and, Engineering, Guangzhou, China, Tencent Youtu Lab, Shanghai, China 备注：9 pages, 6 figures, 4 tables 链接：https://arxiv.org/abs/2107.09225 摘要：深层神经网络很容易受到人为干扰的影响（图1），使得基于DNNs的系统由于在图像中添加不明显的扰动而崩溃。现有的对抗性攻击方法大多是基于梯度的，从延迟效率和GPU内存负载两方面考虑。基于生成的对抗攻击可以克服这一局限性，一些相关的研究提出了基于GAN的攻击方法，但由于GAN训练的收敛性困难，对抗实例要么攻击能力差，要么视觉质量差。在这项工作中，我们发现基于生成的对抗攻击不需要鉴别器，并提出了基于对称显著性的自动编码器（SSAE）来产生扰动，它由显著性映射模块和特征的角模解纠缠模块组成。该方法的优点在于不依赖于鉴别器，并利用生成显著性图对相关区域进行标记。在各种任务、数据集和模型之间的大量实验表明，SSAE生成的对抗性示例不仅使广泛使用的模型崩溃，而且获得了良好的视觉效果https://github.com/BravoLu/SSAE. 摘要：The Deep Neural Networks are vulnerable toadversarial exam-ples(Figure 1), making the DNNs-based systems collapsed byadding the inconspicuous perturbations to the images. Most of the existing works for adversarial attack are gradient-based and suf-fer from the latency efficiencies and the load on GPU memory. Thegenerative-based adversarial attacks can get rid of this limitation,and some relative works propose the approaches based on GAN.However, suffering from the difficulty of the convergence of train-ing a GAN, the adversarial examples have either bad attack abilityor bad visual quality. In this work, we find that the discriminatorcould be not necessary for generative-based adversarial attack, andpropose theSymmetric Saliency-based Auto-Encoder (SSAE)to generate the perturbations, which is composed of the saliencymap module and the angle-norm disentanglement of the featuresmodule. The advantage of our proposed method lies in that it is notdepending on discriminator, and uses the generative saliency map to pay more attention to label-relevant regions. The extensive exper-iments among the various tasks, datasets, and models demonstratethat the adversarial examples generated by SSAE not only make thewidely-used models collapse, but also achieves good visual quality.The code is available at https://github.com/BravoLu/SSAE.

人脸|人群计数(1篇)

【1】 DeepSocNav: Social Navigation by Imitating Human Behaviors 标题：DeepSocNav：模仿人类行为的社交导航

作者：Juan Pablo de Vicente,Alvaro Soto 备注：6 pages, Accepted paper at the RSS Workshop on Social Robot Navigation 2021 链接：https://arxiv.org/abs/2107.09170 摘要：当前用于训练社会行为的数据集通常是从监视应用程序中借用的，这些应用程序从鸟瞰的角度捕捉视觉数据。这就撇开了珍贵的关系和视觉线索，可以通过第一人称视角捕捉到一个场景。在这项工作中，我们提出了一个策略，利用现有的游戏引擎的力量，如统一，将现有的鸟瞰视图数据集转换成第一人称视图，特别是深度视图。使用这种策略，我们能够生成大量的合成数据，这些数据可以用来预先训练社交导航模型。为了验证我们的想法，我们提出了DeepSocNav，这是一个基于深度学习的模型，利用提出的方法生成合成数据。此外，DeepSocNav还包括一个作为辅助任务的自监督策略。这包括预测代理将面对的下一个深度帧。我们的实验表明，所提出的模型能够在社会导航得分方面优于相关基线。摘要：Current datasets to train social behaviors are usually borrowed from surveillance applications that capture visual data from a bird's-eye perspective. This leaves aside precious relationships and visual cues that could be captured through a first-person view of a scene. In this work, we propose a strategy to exploit the power of current game engines, such as Unity, to transform pre-existing bird's-eye view datasets into a first-person view, in particular, a depth view. Using this strategy, we are able to generate large volumes of synthetic data that can be used to pre-train a social navigation model. To test our ideas, we present DeepSocNav, a deep learning based model that takes advantage of the proposed approach to generate synthetic data. Furthermore, DeepSocNav includes a self-supervised strategy that is included as an auxiliary task. This consists of predicting the next depth frame that the agent will face. Our experiments show the benefits of the proposed model that is able to outperform relevant baselines in terms of social navigation scores.

蒸馏|知识提取(1篇)

【1】 Follow Your Path: a Progressive Method for Knowledge Distillation 标题：走自己的路：一种渐进式的知识提炼方法

作者：Wenxian Shi,Yuxuan Song,Hao Zhou,Bohan Li,Lei Li 机构：Bytedance AI Lab, Shanghai, China 备注：Accepted by ECML-PKDD 2021 链接：https://arxiv.org/abs/2107.09305 摘要：深度神经网络通常具有大量的参数，这给在内存和计算能力有限的应用场景中部署带来了挑战。知识提炼是从大模型中导出紧凑模型的一种方法。然而，已有研究发现，一个收敛的重教师模型对于学习一个紧凑的学生网络具有很强的约束性，并且会使优化问题陷入很差的局部最优解。本文提出了一种新的模型不可知方法ProKT，它将教师模型的监督信号投影到学生的参数空间中。这种投影是通过使用近似镜像下降技术将训练目标分解为局部中间目标来实现的。该方法对优化过程中的奇异点不太敏感，可以得到更好的局部最优解。在图像和文本数据集上的实验表明，与现有的知识提取方法相比，本文提出的方法具有更好的性能。摘要：Deep neural networks often have a huge number of parameters, which posts challenges in deployment in application scenarios with limited memory and computation capacity. Knowledge distillation is one approach to derive compact models from bigger ones. However, it has been observed that a converged heavy teacher model is strongly constrained for learning a compact student network and could make the optimization subject to poor local optima. In this paper, we propose ProKT, a new model-agnostic method by projecting the supervision signals of a teacher model into the student's parameter space. Such projection is implemented by decomposing the training objective into local intermediate targets with an approximate mirror descent technique. The proposed method could be less sensitive with the quirks during optimization which could result in a better local optimum. Experiments on both image and text datasets show that our proposed ProKT consistently achieves superior performance compared to other existing knowledge distillation methods.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Separating Skills and Concepts for Novel Visual Question Answering 标题：新颖视觉问答系统中概念与技巧的分离

作者：Spencer Whitehead,Hui Wu,Heng Ji,Rogerio Feris,Kate Saenko 机构：UIUC, MIT-IBM Watson AI Lab, IBM Research, Boston University 备注：Paper at CVPR 2021. 14 pages, 7 figures 链接：https://arxiv.org/abs/2107.09106 摘要：非分布数据的泛化一直是可视化问答（VQA）模型的一个难题。为了衡量新问题的泛化程度，我们建议将其分为“技能”和“概念”“技能”是视觉任务，如计数或属性识别，并适用于问题中提到的“概念”，如对象和人。VQA方法应该能够以新颖的方式组合技能和概念，而不管具体的组合是否在训练中出现过，但是我们证明了现有的模型在处理新的组合时还有很多需要改进的地方。我们提出了一种学习组合技能和概念的新方法，通过学习扎根的概念表示和分离技能编码和概念编码，将这两个因素隐含地分离在一个模型中。我们通过一个新的对比学习过程来强化这些属性，这个过程不依赖于外部注释，并且可以从未标记的图像-问题对中学习。实验证明了该方法对提高合成和接地性能的有效性。摘要：Generalization to out-of-distribution data has been a problem for Visual Question Answering (VQA) models. To measure generalization to novel questions, we propose to separate them into "skills" and "concepts". "Skills" are visual tasks, such as counting or attribute recognition, and are applied to "concepts" mentioned in the question, such as objects and people. VQA methods should be able to compose skills and concepts in novel ways, regardless of whether the specific composition has been seen in training, yet we demonstrate that existing models have much to improve upon towards handling new compositions. We present a novel method for learning to compose skills and concepts that separates these two factors implicitly within a model by learning grounded concept representations and disentangling the encoding of skills from that of concepts. We enforce these properties with a novel contrastive learning procedure that does not rely on external annotations and can be learned from unlabeled image-question pairs. Experiments demonstrate the effectiveness of our approach for improving compositional and grounding performance.

3D|3D重建等相关(1篇)

【1】 Active 3D Shape Reconstruction from Vision and Touch 标题：基于视觉和触觉的主动三维形状重建

作者：Edward J. Smith,David Meger,Luis Pineda,Roberto Calandra,Jitendra Malik,Adriana Romero,Michal Drozdzal 机构： Facebook AI Research, McGill University, University of California, Berkeley 链接：https://arxiv.org/abs/2107.09584 摘要：人类通过主动探索物体，共同使用视觉和触觉，建立对世界的三维理解。然而，在三维形状重建中，最近的进展依赖于有限的感官数据（如RGB图像、深度图或触觉读数）的静态数据集，使得对形状的积极探索在很大程度上未被探索。在主动触觉三维重建中，目标是主动选择触觉读数，最大限度地提高形状重建精度。然而，基于深度学习的主动触摸模型的发展在很大程度上受到了形状探索框架的限制。本文针对这一问题，介绍了一个由以下部分组成的系统：1）基于高空间分辨率视觉的触觉传感器的触觉模拟器，用于三维物体的主动触摸；2）基于网格的三维形状重建模型，该模型依赖于触觉或视觉信号；以及3）一组数据驱动的解决方案，具有触觉或视觉先验知识，用于指导形状探索。我们的框架支持开发第一个完全数据驱动的解决方案，以便在学习的对象理解模型之上进行主动接触。我们的实验表明，这种解决方案在三维形状理解任务中的优势，我们的模型始终优于自然基线。我们提供我们的框架作为一个工具，以促进这一方向的未来研究。摘要：Humans build 3D understandings of the world through active object exploration, using jointly their senses of vision and touch. However, in 3D shape reconstruction, most recent progress has relied on static datasets of limited sensory data such as RGB images, depth maps or haptic readings, leaving the active exploration of the shape largely unexplored. In active touch sensing for 3D reconstruction, the goal is to actively select the tactile readings that maximize the improvement in shape reconstruction accuracy. However, the development of deep learning-based active touch models is largely limited by the lack of frameworks for shape exploration. In this paper, we focus on this problem and introduce a system composed of: 1) a haptic simulator leveraging high spatial resolution vision-based tactile sensors for active touching of 3D objects; 2) a mesh-based 3D shape reconstruction model that relies on tactile or visuotactile signals; and 3) a set of data-driven solutions with either tactile or visuotactile priors to guide the shape exploration. Our framework enables the development of the first fully data-driven solutions to active touch on top of learned models for object understanding. Our experiments show the benefits of such solutions in the task of 3D shape understanding where our models consistently outperform natural baselines. We provide our framework as a tool to foster future research in this direction.

其他神经网络|深度学习|模型|建模(6篇)

【1】 Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning 标题：深度度量学习中非分布移位下泛化的刻画

作者：Timo Milbich,Karsten Roth,Samarth Sinha,Ludwig Schmidt,Marzyeh Ghassemi,Björn Ommer 机构：IWR, Heidelberg University, University of Tübingen, University of Washington, MIT, University of Toronto, Vector 链接：https://arxiv.org/abs/2107.09562 摘要：深度度量学习（Deep Metric Learning，DML）的目的是寻找适合于Zero-Shot转移到先验未知测试分布的表示。然而，通用的评估协议只测试一个单一的、固定的数据分割，其中训练和测试类是随机分配的。更现实的评价应该考虑具有潜在不同程度和难度的广泛分布的转移。在这项工作中，我们系统地构造了越来越困难的列车测试分裂，并提出了ooDML基准来描述DML中分布外移位下的泛化。ooDML的目的是在更具挑战性的、多样化的列车上测试分布转移的泛化性能。基于我们的新基准，我们对最新的DML方法进行了深入的实证分析。我们发现，虽然泛化倾向于困难地持续退化，但随着分布偏移的增加，一些方法在保持性能方面更好。最后，我们提出Few-ShotDML作为一种有效的方法来持续改进泛化，以响应ooDML中出现的未知测试移位。此处提供代码：https://github.com/Confusezius/Characterizing_Generalization_in_DeepMetricLearning. 摘要：Deep Metric Learning (DML) aims to find representations suitable for zero-shot transfer to a priori unknown test distributions. However, common evaluation protocols only test a single, fixed data split in which train and test classes are assigned randomly. More realistic evaluations should consider a broad spectrum of distribution shifts with potentially varying degree and difficulty. In this work, we systematically construct train-test splits of increasing difficulty and present the ooDML benchmark to characterize generalization under out-of-distribution shifts in DML. ooDML is designed to probe the generalization performance on much more challenging, diverse train-to-test distribution shifts. Based on our new benchmark, we conduct a thorough empirical analysis of state-of-the-art DML methods. We find that while generalization tends to consistently degrade with difficulty, some methods are better at retaining performance as the distribution shift increases. Finally, we propose few-shot DML as an efficient way to consistently improve generalization in response to unknown test shifts presented in ooDML. Code available here: https://github.com/Confusezius/Characterizing_Generalization_in_DeepMetricLearning.

【2】 Data Hiding with Deep Learning: A Survey Unifying Digital Watermarking and Steganography 标题：基于深度学习的数据隐藏：数字水印与隐写术相结合的研究综述

作者：Olivia Byrnes,Wendy La,Hu Wang,Congbo Ma,Minhui Xue,Qi Wu 机构： The University of Adelaide 链接：https://arxiv.org/abs/2107.09287 摘要：数据隐藏是将信息嵌入到抗噪信号（如音频、视频或图像）中的过程。数字水印是一种数据隐藏的形式，它将识别数据牢固地嵌入其中，以抵抗篡改，并用于识别媒体的原始所有者。隐写术，另一种形式的数据隐藏，嵌入数据的目的是安全和保密的通信。本文综述了用于数字水印和隐写术的数据隐藏深度学习技术的最新进展，并根据模型结构和噪声注入方法对其进行了分类。全面总结了用于训练这些数据隐藏模型的目标函数、评估指标和数据集。最后，我们提出并讨论了深层数据隐藏技术未来可能的研究方向。摘要：Data hiding is the process of embedding information into a noise-tolerant signal such as a piece of audio, video, or image. Digital watermarking is a form of data hiding where identifying data is robustly embedded so that it can resist tampering and be used to identify the original owners of the media. Steganography, another form of data hiding, embeds data for the purpose of secure and secret communication. This survey summarises recent developments in deep learning techniques for data hiding for the purposes of watermarking and steganography, categorising them based on model architectures and noise injection methods. The objective functions, evaluation metrics, and datasets used for training these data hiding models are comprehensively summarised. Finally, we propose and discuss possible future directions for research into deep data hiding techniques.

【3】 Accelerating deep neural networks for efficient scene understanding in automotive cyber-physical systems 标题：在汽车网络物理系统中加速深度神经网络以实现有效的场景理解

作者：Stavros Nousias,Erion-Vasilis Pikoulis,Christos Mavrokefalidis,Aris S. Lalos 机构：Industrial Systems Institute, Athena Research Center, Patras Science Park, Greece, Computer Engineering and Informatics Dept., University of Patras, Greece 链接：https://arxiv.org/abs/2107.09101 摘要：在过去的几十年中，汽车网络物理系统（ACPS）引起了人们极大的兴趣，而这些系统中最关键的操作之一就是对环境的感知。深度学习，特别是深度神经网络（DNNs）的使用，在从视觉数据分析和理解复杂的动态场景方面提供了令人印象深刻的结果。这些感知系统的预测视界很短，通常必须实时进行推理，强调需要利用模型压缩和加速（MCA）技术将原来的大型预训练网络转化为新的小型模型。我们在这项工作中的目标是调查最佳实践，以适当地应用新的重量分担技术，优化可用的变量和训练程序，以显著加快广泛采用的DNNs。在目标检测和跟踪实验中使用各种最先进的DNN模型进行了广泛的评估研究，提供了应用权重共享技术后出现的错误类型的详细信息，导致显著的加速增益，而精度损失可以忽略不计。摘要：Automotive Cyber-Physical Systems (ACPS) have attracted a significant amount of interest in the past few decades, while one of the most critical operations in these systems is the perception of the environment. Deep learning and, especially, the use of Deep Neural Networks (DNNs) provides impressive results in analyzing and understanding complex and dynamic scenes from visual data. The prediction horizons for those perception systems are very short and inference must often be performed in real time, stressing the need of transforming the original large pre-trained networks into new smaller models, by utilizing Model Compression and Acceleration (MCA) techniques. Our goal in this work is to investigate best practices for appropriately applying novel weight sharing techniques, optimizing the available variables and the training procedures towards the significant acceleration of widely adopted DNNs. Extensive evaluation studies carried out using various state-of-the-art DNN models in object detection and tracking experiments, provide details about the type of errors that manifest after the application of weight sharing techniques, resulting in significant acceleration gains with negligible accuracy losses.

【4】 Learning a Sensor-invariant Embedding of Satellite Data: A Case Study for Lake Ice Monitoring 标题：学习传感器不变的卫星数据嵌入方法--以湖冰监测为例

作者：Manu Tom,Yuchang Jiang,Emmanuel Baltsavias,Konrad Schindler 机构： Swiss Federal Institute of Technology, SwissFederalInstituteofAquaticScienceandTechnology, University of Zurich 链接：https://arxiv.org/abs/2107.09092 摘要：融合不同传感器获取的卫星图像一直是地球观测的一个长期挑战，特别是在光学和合成孔径雷达（SAR）图像等不同模式下。在这里，我们从表征学习的角度探讨了不同传感器图像的联合分析：我们建议在深层神经网络中学习一个联合的、传感器不变的嵌入（特征表征）。我们的应用问题是高山湖泊冰的监测。为了达到瑞士全球气候观测系统（GCOS）办公室的时间分辨率要求，我们结合了三种图像源：Sentinel-1sar（S1-SAR）、Terra-MODIS和Suomi-NPP-VIIRS。光学域和SAR域之间以及传感器分辨率之间的巨大差距使得这成为传感器融合问题的一个具有挑战性的实例。我们的方法可以被归类为一种以数据驱动的方式学习的特征级融合。所提出的网络结构为每个图像传感器提供了独立的编码分支，这些编码分支反馈到单个潜在嵌入中。即，由所有输入共享的公共特征表示，使得后续处理步骤提供可比较的输出，而不管使用哪种类型的输入图像。通过融合卫星数据，我们以<1.5天的时间分辨率绘制了湖冰图。该网络生成了空间清晰的湖泊冰图，其像素精度>91.3%（分别为mIoU得分>60.7%），并在不同湖泊和冬季具有良好的通用性。此外，它还为确定目标湖泊的重要冰开和冰关日期设定了新的技术水平，在许多情况下都符合全球气候观测系统的要求。摘要：Fusing satellite imagery acquired with different sensors has been a long-standing challenge of Earth observation, particularly across different modalities such as optical and Synthetic Aperture Radar (SAR) images. Here, we explore the joint analysis of imagery from different sensors in the light of representation learning: we propose to learn a joint, sensor-invariant embedding (feature representation) within a deep neural network. Our application problem is the monitoring of lake ice on Alpine lakes. To reach the temporal resolution requirement of the Swiss Global Climate Observing System (GCOS) office, we combine three image sources: Sentinel-1 SAR (S1-SAR), Terra MODIS and Suomi-NPP VIIRS. The large gaps between the optical and SAR domains and between the sensor resolutions make this a challenging instance of the sensor fusion problem. Our approach can be classified as a feature-level fusion that is learnt in a data-driven manner. The proposed network architecture has separate encoding branches for each image sensor, which feed into a single latent embedding. I.e., a common feature representation shared by all inputs, such that subsequent processing steps deliver comparable output irrespective of which sort of input image was used. By fusing satellite data, we map lake ice at a temporal resolution of <1.5 days. The network produces spatially explicit lake ice maps with pixel-wise accuracies >91.3% (respectively, mIoU scores >60.7%) and generalises well across different lakes and winters. Moreover, it sets a new state-of-the-art for determining the important ice-on and ice-off dates for the target lakes, in many cases meeting the GCOS requirement.

【5】 OSLO: On-the-Sphere Learning for Omnidirectional images and its application to 360-degree image compression 标题：Oslo：全方位图像的On-the-Sphere学习及其在360度图像压缩中的应用

作者：Navid Mahmoudian Bidgoli,Roberto G. de A. Azevedo,Thomas Maugey,Aline Roumy,Pascal Frossard 链接：https://arxiv.org/abs/2107.09179 摘要：最先进的二维图像压缩方案依赖于卷积神经网络（CNNs）的能力。尽管cnn为2D图像压缩提供了很好的前景，但将这种模型扩展到全方位图像并不是一件容易的事。首先，全方位图像具有特定的空间和统计特性，目前的CNN模型无法完全捕捉到这些特性。第二，构成CNN架构的基本数学运算，例如平移和采样，在球体上没有很好的定义。本文研究了全向图像表示模型的学习，提出利用HEALPix均匀采样的性质重新定义全向图像深度学习模型的数学工具。特别地，我们：i）在球面上定义了一种新的卷积运算，保持了经典二维卷积运算的高表达性和低复杂度；ii）将标准CNN技术（如步幅、迭代聚合和像素洗牌）适应于球形域；然后iii）将我们的新框架应用到全方位图像压缩任务中。我们的实验表明，我们提出的球面上的解决方案导致了更好的压缩增益，可以节省13.7%的比特率相比，类似的学习模型适用于等矩形图像。此外，与基于图卷积网络的学习模型相比，我们的解决方案支持更具表现力的滤波器，能够保持高频率，并提供更好的压缩图像感知质量。这些结果证明了该框架的有效性，为其他全方位视觉任务在球面流形上的有效实现开辟了新的研究空间。摘要：State-of-the-art 2D image compression schemes rely on the power of convolutional neural networks (CNNs). Although CNNs offer promising perspectives for 2D image compression, extending such models to omnidirectional images is not straightforward. First, omnidirectional images have specific spatial and statistical properties that can not be fully captured by current CNN models. Second, basic mathematical operations composing a CNN architecture, e.g., translation and sampling, are not well-defined on the sphere. In this paper, we study the learning of representation models for omnidirectional images and propose to use the properties of HEALPix uniform sampling of the sphere to redefine the mathematical tools used in deep learning models for omnidirectional images. In particular, we: i) propose the definition of a new convolution operation on the sphere that keeps the high expressiveness and the low complexity of a classical 2D convolution; ii) adapt standard CNN techniques such as stride, iterative aggregation, and pixel shuffling to the spherical domain; and then iii) apply our new framework to the task of omnidirectional image compression. Our experiments show that our proposed on-the-sphere solution leads to a better compression gain that can save 13.7% of the bit rate compared to similar learned models applied to equirectangular images. Also, compared to learning models based on graph convolutional networks, our solution supports more expressive filters that can preserve high frequencies and provide a better perceptual quality of the compressed images. Such results demonstrate the efficiency of the proposed framework, which opens new research venues for other omnidirectional vision tasks to be effectively implemented on the sphere manifold.

【6】 Quality and Complexity Assessment of Learning-Based Image Compression Solutions 标题：基于学习的图像压缩方案的质量和复杂度评估

作者：João Dick,Brunno Abreu,Mateus Grellert,Sergio Bampi 机构： Informatics Institute (PGMICRO), Federal University of Rio Grande do Sul, Porto Alegre, Brazil, Graduate Program in Computer Science, Federal University of Santa Catarina, Florian´opolis, Brazil 备注：Paper accepted at ICIP2021 链接：https://arxiv.org/abs/2107.09136 摘要：本文分析了基于学习的图像压缩技术。我们使用KODAK数据集，比较了Tensorflow压缩包中8种可用模型的视觉质量指标和处理时间。结果与较好的可移植图形（BPG）和JPEG2000编解码器进行了比较。结果表明，与最快的基于学习的模型相比，JPEG2000的执行时间最少，压缩加速比为1.46倍，解压加速比为30倍。然而，基于学习的模型在质量方面比JPEG2000有了改进，特别是在较低的比特率方面。我们的研究结果还表明，BPG在PSNR方面更有效，但对于其他质量指标，学习模型更好，有时甚至更快。结果表明，基于学习的压缩方法是未来主流压缩方法的一种解决方案。摘要：This work presents an analysis of state-of-the-art learning-based image compression techniques. We compare 8 models available in the Tensorflow Compression package in terms of visual quality metrics and processing time, using the KODAK data set. The results are compared with the Better Portable Graphics (BPG) and the JPEG2000 codecs. Results show that JPEG2000 has the lowest execution times compared with the fastest learning-based model, with a speedup of 1.46x in compression and 30x in decompression. However, the learning-based models achieved improvements over JPEG2000 in terms of quality, specially for lower bitrates. Our findings also show that BPG is more efficient in terms of PSNR, but the learning models are better for other quality metrics, and sometimes even faster. The results indicate that learning-based techniques are promising solutions towards a future mainstream compression method.

其他(3篇)

【1】 Built-in Elastic Transformations for Improved Robustness 标题：内置弹性变换，提高了健壮性

作者：Sadaf Gulshad,Ivan Sosnovik,Arnold Smeulders 机构：UvA-Bosch Delta Lab, University of Amsterdam, The Netherlands 链接：https://arxiv.org/abs/2107.09391 摘要：我们致力于在神经视觉分类器的卷积中建立鲁棒性，特别是针对诸如弹性变形、遮挡和高斯噪声等自然扰动。现有的cnn在干净的图像上表现出优异的性能，但不能处理自然发生的扰动。本文从弹性摄动出发，近似地描述了物体的局部视点变化。我们提出弹性增强卷积（EAConv）通过参数化滤波器作为固定弹性扰动基函数和可训练权值的组合，以整合CNN中看不见的视点。我们在CIFAR-10和STL-10数据集上表明，我们的方法对不可见遮挡和高斯扰动的鲁棒性得到了提高，而在没有进行任何数据增强的情况下，甚至在干净图像上的性能也得到了轻微的提高。摘要：We focus on building robustness in the convolutions of neural visual classifiers, especially against natural perturbations like elastic deformations, occlusions and Gaussian noise. Existing CNNs show outstanding performance on clean images, but fail to tackle naturally occurring perturbations. In this paper, we start from elastic perturbations, which approximate (local) view-point changes of the object. We present elastically-augmented convolutions (EAConv) by parameterizing filters as a combination of fixed elastically-perturbed bases functions and trainable weights for the purpose of integrating unseen viewpoints in the CNN. We show on CIFAR-10 and STL-10 datasets that the general robustness of our method on unseen occlusion and Gaussian perturbations improves, while even improving the performance on clean images slightly without performing any data augmentation.

【2】 Monocular Visual Analysis for Electronic Line Calling of Tennis Games 标题：网球比赛电子呼线的单目视觉分析

作者：Yuanzhou Chen,Shaobo Cai,Yuxin Wang,Junchi Yan 机构： YK Pao School, Shanghai , China, Shanghai Jiao Tong University,Shanghai ,，China 备注：12 pages, 6 figures 链接：https://arxiv.org/abs/2107.09255 摘要：电子点播是一种基于双目视觉技术的网球比赛辅助裁判系统。在ELC得到广泛应用的同时，还存在着安装维护复杂、成本高等问题，本文提出了一种基于单目视觉技术的ELC方法。该方法包括以下步骤。首先，确定网球的轨迹。提出了一种结合背景减除和颜色区域滤波的多级网球定位方法。提出了一种基于不确定点拟合损失最小化的弹跳点预测方法。最后根据弹跳点与球场边线在二维图像中的相对位置来判断球的弹跳点是否越界。我们采集并标记了394个样本，准确率为99.4%，11个样本中有81.8%的样本有弹跳点，实验结果表明，我们的方法对单目视觉判断球是否出场是可行的，大大降低了双目视觉ELC系统的复杂安装和成本。摘要：Electronic Line Calling is an auxiliary referee system used for tennis matches based on binocular vision technology. While ELC has been widely used, there are still many problems, such as complex installation and maintenance, high cost and etc. We propose a monocular vision technology based ELC method. The method has the following steps. First, locate the tennis ball's trajectory. We propose a multistage tennis ball positioning approach combining background subtraction and color area filtering. Then we propose a bouncing point prediction method by minimizing the fitting loss of the uncertain point. Finally, we find out whether the bouncing point of the ball is out of bounds or not according to the relative position between the bouncing point and the court side line in the two dimensional image. We collected and tagged 394 samples with an accuracy rate of 99.4%, and 81.8% of the 11 samples with bouncing points.The experimental results show that our method is feasible to judge if a ball is out of the court with monocular vision and significantly reduce complex installation and costs of ELC system with binocular vision.

【3】 LAPNet: Non-rigid Registration derived in k-space for Magnetic Resonance Imaging 标题：LAPNet：基于k空间的磁共振成像非刚性配准

作者：Thomas Küstner,Jiazhen Pan,Haikun Qi,Gastao Cruz,Christopher Gilliam,Thierry Blu,Bin Yang,Sergios Gatidis,René Botnar,Claudia Prieto 链接：https://arxiv.org/abs/2107.09060 摘要：生理运动，如心脏和呼吸运动，在磁共振（MR）图像采集过程中可能会导致图像伪影。运动校正技术已被提出，以补偿这些类型的运动在胸部扫描，依靠准确的运动估计从欠采样运动分辨重建。一个特别的兴趣和挑战在于从欠采样的运动分辨数据中导出可靠的非刚体运动场。运动估计通常通过扩散、参数样条或光流方法在图像空间中进行。然而，基于图像的配准可能会由于欠采样运动分辨重建而受到残余混叠伪影的影响。在这项工作中，我们描述了一种直接在采样Fourier空间（即k空间）中执行非刚性配准的形式。我们提出了一种基于深度学习的方法，从欠采样的k空间数据中快速准确地进行非刚性配准。其基本工作原理源于局部全通（LAP）技术，这是最近引入的一种基于光流的配准技术。在40名疑似肝或肺转移患者和25名健康受试者的队列中，将所提出的LAPNet与传统和基于图像的深度学习注册进行比较，并在完全采样和高度加速（两种欠采样策略）的3D呼吸运动分辨MR图像上进行测试。在不同的采样轨迹和加速因子下，所提出的LAPNet比基于图像的方法提供了一致和优越的性能。摘要：Physiological motion, such as cardiac and respiratory motion, during Magnetic Resonance (MR) image acquisition can cause image artifacts. Motion correction techniques have been proposed to compensate for these types of motion during thoracic scans, relying on accurate motion estimation from undersampled motion-resolved reconstruction. A particular interest and challenge lie in the derivation of reliable non-rigid motion fields from the undersampled motion-resolved data. Motion estimation is usually formulated in image space via diffusion, parametric-spline, or optical flow methods. However, image-based registration can be impaired by remaining aliasing artifacts due to the undersampled motion-resolved reconstruction. In this work, we describe a formalism to perform non-rigid registration directly in the sampled Fourier space, i.e. k-space. We propose a deep-learning based approach to perform fast and accurate non-rigid registration from the undersampled k-space data. The basic working principle originates from the Local All-Pass (LAP) technique, a recently introduced optical flow-based registration. The proposed LAPNet is compared against traditional and deep learning image-based registrations and tested on fully-sampled and highly-accelerated (with two undersampling strategies) 3D respiratory motion-resolved MR images in a cohort of 40 patients with suspected liver or lung metastases and 25 healthy subjects. The proposed LAPNet provided consistent and superior performance to image-based approaches throughout different sampling trajectories and acceleration factors.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-21，如有侵权请联系 cloudcommunity@tencent.com 删除

linux