计算机视觉学术速递[7.9]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:44:44

2K0

发布于 2021-07-27 10:44:44

cs.CV 方向，今日共计54篇

Transformer(1篇)

【1】 Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers 标题：学习视觉引导的四足运动与跨模态Transformer的端到端运动

作者：Ruihan Yang,Minghao Zhang,Nicklas Hansen,Huazhe Xu,Xiaolong Wang 机构：UC San Diego, Tsinghua University, UC Berkeley 备注：Our project page with videos is at this https URL 链接：https://arxiv.org/abs/2107.03996 摘要：我们建议使用强化学习（RL）和基于Transformer的模型来处理四足动物的运动任务，该模型学习结合本体感知信息和高维深度传感器输入。基于学习的运动训练在RL领域有了很大的发展，但大多数方法仍然依赖于领域随机化来训练具有挑战性的盲智能体。我们的关键洞察是，本体感觉状态仅提供即时反应的接触测量，而配备视觉感官观察的代理可以通过预测前方环境的变化，学会主动操纵有障碍物和不平坦地形的环境。本文介绍了一种端到端的四足运动RL方法LocoTransformer，它利用基于Transformer的模型融合本体感觉状态和视觉观察。我们在具有不同障碍物和不平坦地形的模拟环境中评估了我们的方法。结果表明，与仅使用本体感知状态输入的策略相比，该方法取得了显著的改进，基于Transformer的模型进一步提高了跨环境的泛化能力。我们的视频项目页面位于https://RchalYang.github.io/LocoTransformer . 摘要：We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that leverages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs, and that Transformer-based models further improve generalization across environments. Our project page with videos is at https://RchalYang.github.io/LocoTransformer .

检测相关(4篇)

【1】 Multi-Modality Task Cascade for 3D Object Detection 标题：基于多模态任务级联的三维目标检测

作者：Jinhyung Park,Xinshuo Weng,Yunze Man,Kris Kitani 机构：Carnegie Mellon University 链接：https://arxiv.org/abs/2107.04013 摘要：点云和RGB图像自然是3D视觉理解的互补模式——前者提供了稀疏但精确的对象上点的位置，而后者包含了密集的颜色和纹理信息。尽管这种潜在的密切传感器融合，许多方法训练两个模型在隔离和使用简单的特征拼接来表示三维传感器数据。这种分离的训练方案会导致潜在的次优性能，并阻止3D任务被用来帮助2D任务，而2D任务本身通常是有用的。为了提供一种更为综合的方法，我们提出了一种新的多模态任务级联网络（MTC-RCNN），该网络利用3D盒子方案来改进2D分割预测，然后使用3D盒子进一步细化3D盒子。我们发现在两个阶段的3D模块之间加入2D网络可以显著提高2D和3D任务的性能。此外，为了防止3D模块过度依赖于过度拟合的2D预测，我们提出了一种双头2D分割训练和推理方案，允许第二个3D模块学习解释不完美的2D分割预测。在具有挑战性的SUN RGB-D数据集上评估我们的模型，我们大大改进了单模态和融合网络的最新结果（$\textbf{+3.8}$mAP@0.5). 代码将被释放$\href{https://github.com/Divadi/MTC_RCNN}{\text{这里。}}$ 摘要：Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. This separated training scheme results in potentially sub-optimal performance and prevents 3D tasks from being used to benefit 2D tasks that are often useful on their own. To provide a more integrated approach, we propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions, which are then used to further refine the 3D boxes. We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance. Moreover, to prevent the 3D module from over-relying on the overfitted 2D predictions, we propose a dual-head 2D segmentation training and inference scheme, allowing the 2nd 3D module to learn to interpret imperfect 2D segmentation predictions. Evaluating our model on the challenging SUN RGB-D dataset, we improve upon state-of-the-art results of both single modality and fusion networks by a large margin ($\textbf{+3.8}$ mAP@0.5). Code will be released $\href{https://github.com/Divadi/MTC_RCNN}{\text{here.}}$

【2】 Optimizing Data Processing in Space for Object Detection in Satellite Imagery 标题：卫星图像目标检测中空间数据处理的优化

作者：Martina Lofqvist,José Cano 机构：School of Computing Science, University of Glasgow, Lilybank Gardens, Glasgow, United Kingdom; +, (,) 备注：Published as a workshop paper at SmallSat 2021 - The 35th Annual Small Satellite Conference. 9 pages, 10 figures. arXiv admin note: text overlap with arXiv:2007.11089 链接：https://arxiv.org/abs/2107.03774 摘要：每年发射的卫星数量激增，导致每天下行数TB的数据。地面站接收到的数据通常是未经处理的，考虑到数据量很大，而且并非所有数据都有用，因此这是一个昂贵的过程。这一点，加上对实时数据处理日益增长的需求，导致对在轨处理解决方案的需求日益增长。在这项工作中，我们通过对卫星数据应用不同的图像压缩技术来研究基于CNN的目标检测器在受限设备上的性能。我们检查Nvidia Jethon Nano和英伟达JETSON AGX沙维尔的能力；低功耗、高性能计算机，集成了gpu，体积小到可以装上纳米卫星。我们仔细研究了目标检测网络，包括单次激发多盒检测器（SSD）和基于区域的完全卷积网络（R-FCN）模型，这些模型是在DOTA上预先训练的，DOTA是一种用于航空图像中目标检测的大规模数据集。性能以执行时间、内存消耗和准确性来衡量，并与包含两个强大gpu的服务器的基线进行比较。结果表明，通过应用图像压缩技术，我们可以提高执行时间和内存消耗，实现完全可运行的数据集。无损压缩技术实现了大约10%的执行时间减少和大约3%的内存消耗减少，而不影响精度。而有损压缩技术将执行时间提高了144%，内存消耗减少了97%。但是，它对精度有很大的影响，这取决于压缩比。因此，这些压缩技术的应用和比率可能根据特定任务所需的精度水平而不同。摘要：There is a proliferation in the number of satellites launched each year, resulting in downlinking of terabytes of data each day. The data received by ground stations is often unprocessed, making this an expensive process considering the large data sizes and that not all of the data is useful. This, coupled with the increasing demand for real-time data processing, has led to a growing need for on-orbit processing solutions. In this work, we investigate the performance of CNN-based object detectors on constrained devices by applying different image compression techniques to satellite data. We examine the capabilities of the NVIDIA Jetson Nano and NVIDIA Jetson AGX Xavier; low-power, high-performance computers, with integrated GPUs, small enough to fit on-board a nanosatellite. We take a closer look at object detection networks, including the Single Shot MultiBox Detector (SSD) and Region-based Fully Convolutional Network (R-FCN) models that are pre-trained on DOTA - a Large Scale Dataset for Object Detection in Aerial Images. The performance is measured in terms of execution time, memory consumption, and accuracy, and are compared against a baseline containing a server with two powerful GPUs. The results show that by applying image compression techniques, we are able to improve the execution time and memory consumption, achieving a fully runnable dataset. A lossless compression technique achieves roughly a 10% reduction in execution time and about a 3% reduction in memory consumption, with no impact on the accuracy. While a lossy compression technique improves the execution time by up to 144% and the memory consumption is reduced by as much as 97%. However, it has a significant impact on accuracy, varying depending on the compression ratio. Thus the application and ratio of these compression techniques may differ depending on the required level of accuracy for a particular task.

【3】 Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection via Spatial-Temporal Feature Transformation 标题：基于时空特征变换的多帧协作有效内窥镜视频息肉检测

作者：Lingyun Wu,Zhiqiang Hu,Yuanfeng Ji,Ping Luo,Shaoting Zhang 机构： SenseTime Research, The University of Hong Kong 备注：Accepted by MICCAI2021 链接：https://arxiv.org/abs/2107.03609 摘要：准确定位息肉是胃肠内镜下早期癌症筛查的关键。内窥镜检查提供的视频比静止图像带来了更丰富的背景信息和更多的挑战。摄像机的运动情况，而不是普通的摄像机固定物体的运动情况，会导致帧间背景的显著变化。严重的内部伪影（如人体内的水流、组织的镜面反射）会使相邻帧的质量发生很大的变化。这些因素阻碍了基于视频的模型从邻域帧中有效地聚集特征并给出更好的预测。在本文中，我们提出了时空特征转换（STFT），一个多框架协作框架来解决这些问题。空间上，STFT通过建议引导的可变形卷积特征对齐来缓解摄像机运动情况下的帧间变化。在时间上，STFT提出了一个信道感知的注意模块来同时估计相邻帧的质量和相关性，以进行自适应特征聚合。实证研究和优越的结果证明了该方法的有效性和稳定性。例如，在CVC Clinic和ASUMayo数据集中，STFT分别将息肉定位任务的综合F1得分提高了10.6%和20.6%，并且比最先进的基于视频的方法分别提高了3.6%和8.0%。代码位于\url{https://github.com/lingyunwu14/STFT}. 摘要：Precise localization of polyp is crucial for early cancer screening in gastrointestinal endoscopy. Videos given by endoscopy bring both richer contextual information as well as more challenges than still images. The camera-moving situation, instead of the common camera-fixed-object-moving one, leads to significant background variation between frames. Severe internal artifacts (e.g. water flow in the human body, specular reflection by tissues) can make the quality of adjacent frames vary considerately. These factors hinder a video-based model to effectively aggregate features from neighborhood frames and give better predictions. In this paper, we present Spatial-Temporal Feature Transformation (STFT), a multi-frame collaborative framework to address these issues. Spatially, STFT mitigates inter-frame variations in the camera-moving situation with feature alignment by proposal-guided deformable convolutions. Temporally, STFT proposes a channel-aware attention module to simultaneously estimate the quality and correlation of adjacent frames for adaptive feature aggregation. Empirical studies and superior results demonstrate the effectiveness and stability of our method. For example, STFT improves the still image baseline FCOS by 10.6% and 20.6% on the comprehensive F1-score of the polyp localization task in CVC-Clinic and ASUMayo datasets, respectively, and outperforms the state-of-the-art video-based method by 3.6% and 8.0%, respectively. Code is available at \url{https://github.com/lingyunwu14/STFT}.

【4】 A hybrid deep learning framework for Covid-19 detection via 3D Chest CT Images 标题：基于三维胸部CT图像的混合深度学习冠状病毒检测框架

作者：Shuang Liang 机构：University of Science and Technology Beijing 备注：5 pages, 1 figure, 2 tables 链接：https://arxiv.org/abs/2107.03904 摘要：本文提出了一种结合卷积神经网络和Transformer的混合深度学习框架CTNet，用于三维胸部CT图像中COVID-19的检测。它由一个CNN特征提取模块和一个Transformer模型组成，CNN特征提取模块具有从CT扫描中提取足够的特征的能力。与以往的工作相比，CTNet提供了一种通过三维CT扫描和数据重采样进行COVID-19诊断的有效方法。在一个大型的公共基准COV19-CT-DB数据库上的先进成果是由提出的CTNet实现的，超过了与数据集一起提出的最先进的基线方法。摘要：In this paper, we present a hybrid deep learning framework named CTNet which combines convolutional neural network and transformer together for the detection of COVID-19 via 3D chest CT images. It consists of a CNN feature extractor module with SE attention to extract sufficient features from CT scans, together with a transformer model to model the discriminative features of the 3D CT scans. Compared to previous works, CTNet provides an effective and efficient method to perform COVID-19 diagnosis via 3D CT scans with data resampling strategy. Advanced results on a large and public benchmarks, COV19-CT-DB database was achieved by the proposed CTNet, over the state-of-the-art baseline approachproposed together with the dataset.

分类|识别相关(8篇)

【1】 Malware Classification Using Deep Boosted Learning 标题：基于深度增强学习的恶意软件分类

作者：Muhammad Asam,Saddam Hussain Khan,Tauseef Jamal,Umme Zahoora,Asifullah Khan 机构：Pattern Recognition Lab, Department of Computer & Information Sciences, PIEAS, PIEAS Artificial Intelligence Center (PAIC), PIEAS, Islamabad , Pakistan, Center for Mathematical Sciences, PIEAS, Nilore, Islamabad , Pakistan 链接：https://arxiv.org/abs/2107.04008 摘要：网络空间中的恶意活动已经不仅仅是黑客攻击机器和传播病毒。它已经成为一个国家生存的挑战，并因此演变为网络战。恶意软件是网络犯罪的重要组成部分，其分析是抵御攻击的第一道防线。提出了一种新的基于深度增强混合学习的恶意软件分类框架，命名为基于深度增强特征空间的恶意软件分类（DFS-MC）。在该框架中，通过融合性能最好的定制CNN模型的特征空间和支持向量机进行分类，提高了识别能力。通过与标准定制cnn的比较，评估了该分类框架的识别能力。定制的CNN模型通过两种方式实现：softmax分类器和基于深度混合学习的恶意软件分类。在混合学习中，从定制的CNN结构中提取深层特征，并将其输入到传统的机器学习分类器中，以提高分类性能。我们还通过微调在定制的基于CNN架构的恶意软件分类框架中引入了迁移学习的概念。利用hold-out交叉验证技术，在MalImg恶意软件数据集上验证了所提出的恶意软件分类方法的性能。实验比较采用创新的、定制的CNN，从零开始训练，并采用迁移学习对定制的CNN进行微调。提出的分类框架DFS-MC显示了改进的结果，准确率为98.61%，F值为0.96，准确率为0.96，召回率为0.96。摘要：Malicious activities in cyberspace have gone further than simply hacking machines and spreading viruses. It has become a challenge for a nations survival and hence has evolved to cyber warfare. Malware is a key component of cyber-crime, and its analysis is the first line of defence against attack. This work proposes a novel deep boosted hybrid learning-based malware classification framework and named as Deep boosted Feature Space-based Malware classification (DFS-MC). In the proposed framework, the discrimination power is enhanced by fusing the feature spaces of the best performing customized CNN architectures models and its discrimination by an SVM for classification. The discrimination capacity of the proposed classification framework is assessed by comparing it against the standard customized CNNs. The customized CNN models are implemented in two ways: softmax classifier and deep hybrid learning-based malware classification. In the hybrid learning, Deep features are extracted from customized CNN architectures and fed into the conventional machine learning classifier to improve the classification performance. We also introduced the concept of transfer learning in a customized CNN architecture based malware classification framework through fine-tuning. The performance of the proposed malware classification approaches are validated on the MalImg malware dataset using the hold-out cross-validation technique. Experimental comparisons were conducted by employing innovative, customized CNN, trained from scratch and fine-tuning the customized CNN using transfer learning. The proposed classification framework DFS-MC showed improved results, Accuracy: 98.61%, F-score: 0.96, Precision: 0.96, and Recall: 0.96.

【2】 EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification 标题：基于单次EEG的EEG-ConvTransform视觉刺激分类

作者：Subhranil Bagchi,Deepti R. Bathula 机构：Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Ropar, India 备注：Preprint and Supplementary material. 17 pages, 13 figures and 4 tables 链接：https://arxiv.org/abs/2107.03983 摘要：不同种类的视觉刺激在人脑中激活不同的反应。这些信号可以被脑电捕获，用于脑-机接口（BCI）等应用。然而，由于脑电信号的低信噪比，单次试验数据的准确分类具有挑战性。介绍了一种基于多头自我注意的脑电转换网络。与其他Transformer不同的是，该模型结合了自我注意来捕捉区域间的相互作用。它进一步扩展到辅助卷积滤波器与多头注意作为一个单一的模块来学习时间模式。实验结果表明，在五种不同的视觉刺激分类任务中，EEG-ConvTransformer的分类精度比现有技术有了很大的提高。最后，头间差异的定量分析也显示了在表征子空间的低相似性，强调了多头注意的内隐多样性。摘要：Different categories of visual stimuli activate different responses in the human brain. These signals can be captured with EEG for utilization in applications such as Brain-Computer Interface (BCI). However, accurate classification of single-trial data is challenging due to low signal-to-noise ratio of EEG. This work introduces an EEG-ConvTranformer network that is based on multi-headed self-attention. Unlike other transformers, the model incorporates self-attention to capture inter-region interactions. It further extends to adjunct convolutional filters with multi-head attention as a single module to learn temporal patterns. Experimental results demonstrate that EEG-ConvTransformer achieves improved classification accuracy over the state-of-the-art techniques across five different visual stimuli classification tasks. Finally, quantitative analysis of inter-head diversity also shows low similarity in representational subspaces, emphasizing the implicit diversity of multi-head attention.

【3】 Image Resolution Susceptibility of Face Recognition Models 标题：人脸识别模型的图像分辨率敏感性

作者：Martin Knoche,Stefan Hörmann,Gerhard Rigoll 机构：Stefan Hörman, Chair of Human-Machine Communication, Technical University, Munich, Germany 备注：19 pages, 15 figures, 2 tables 链接：https://arxiv.org/abs/2107.03769 摘要：人脸识别方法通常依赖于相等的图像分辨率来验证两幅图像上的人脸。然而，在实际应用中，由于不同的图像捕获机制或来源，这些图像分辨率通常不在同一范围内。在这项工作中，我们首先使用一个最先进的人脸识别模型来分析图像分辨率对人脸验证性能的影响。对于综合降低到$5\，\times 5\，\mathrm{px}$分辨率的图像，验证性能从$99.23\%$逐渐下降到$55\%$。特别是对于交叉分辨率图像对（一幅高分辨率图像和一幅低分辨率图像），验证精度进一步降低。我们通过观察每个2图像测试对的特征距离来更深入地研究这种行为。为了解决这一问题，我们提出了以下两种方法：1）训练一个最先进的人脸识别模型，在每批中直接使用50\%$的低分辨率图像2）训练连体网络结构，并在高分辨率和低分辨率特征之间添加余弦距离特征损失。这两种方法都显示了对交叉分辨率场景的改进，并且可以在非常低的分辨率下将精确度提高到大约$70\%$。然而，缺点是需要为每个分辨率对训练一个特定的模型。。。摘要：Face recognition approaches often rely on equal image resolution for verification faces on two images. However, in practical applications, those image resolutions are usually not in the same range due to different image capture mechanisms or sources. In this work, we first analyze the impact of image resolutions on the face verification performance with a state-of-the-art face recognition model. For images, synthetically reduced to $5\, \times 5\, \mathrm{px}$ resolution, the verification performance drops from $99.23\%$ increasingly down to almost $55\%$. Especially, for cross-resolution image pairs (one high- and one low-resolution image), the verification accuracy decreases even further. We investigate this behavior more in-depth by looking at the feature distances for every 2-image test pair. To tackle this problem, we propose the following two methods: 1) Train a state-of-the-art face-recognition model straightforward with $50\%$ low-resolution images directly within each batch. \\ 2) Train a siamese-network structure and adding a cosine distance feature loss between high- and low-resolution features. Both methods show an improvement for cross-resolution scenarios and can increase the accuracy at very low resolution to approximately $70\%$. However, a disadvantage is that a specific model needs to be trained for every resolution-pair ...

【4】 Investigate the Essence of Long-Tailed Recognition from a Unified Perspective 标题：从统一的视角审视长尾认知的本质

作者：Lei Liu,Li Liu 机构：Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen 链接：https://arxiv.org/abs/2107.03758 摘要：随着数据规模的增长，深度识别模型往往会由于样本数在类别间的严重不平衡而出现长尾分布。事实上，现实世界中的数据通常表现出不同类别（如鸽子和麻雀）之间的某种相似关系，在本文中称之为类别相似性。当这类具有相似外观的类别之间出现不平衡时，就更加困难了。然而，现有的解决方案主要集中在样本数的重新平衡上。在这项工作中，我们从一个统一的角度系统地研究了长尾问题的本质。具体来说，我们证明了长尾识别同时受到样本数和类别相似度的影响。直观地，通过一个玩具实例，我们首先证明了样本数并不是影响长尾识别性能下降的唯一因素。理论上，我们证明了（1）类别相似性作为一个不可避免的因素，也会通过相似样本影响长尾分布下的模型学习，（2）使用更具区别性的表示方法（如自监督学习）进行相似性约简，分类器的偏差可以得到进一步的缓解，并大大提高了性能。在多个长尾数据集上的大量实验验证了理论分析的合理性，并表明在现有的SOTAs基础上，通过相似性约简可以进一步提高性能。我们的调查突出了长尾问题背后的本质，并为今后的工作提出了几个可行的方向。摘要：As the data scale grows, deep recognition models often suffer from long-tailed data distributions due to the heavy imbalanced sample number across categories. Indeed, real-world data usually exhibit some similarity relation among different categories (e.g., pigeons and sparrows), called category similarity in this work. It is doubly difficult when the imbalance occurs between such categories with similar appearances. However, existing solutions mainly focus on the sample number to re-balance data distribution. In this work, we systematically investigate the essence of the long-tailed problem from a unified perspective. Specifically, we demonstrate that long-tailed recognition suffers from both sample number and category similarity. Intuitively, using a toy example, we first show that sample number is not the unique influence factor for performance dropping of long-tailed recognition. Theoretically, we demonstrate that (1) category similarity, as an inevitable factor, would also influence the model learning under long-tailed distribution via similar samples, (2) using more discriminative representation methods (e.g., self-supervised learning) for similarity reduction, the classifier bias can be further alleviated with greatly improved performance. Extensive experiments on several long-tailed datasets verify the rationality of our theoretical analysis, and show that based on existing state-of-the-arts (SOTAs), the performance could be further improved by similarity reduction. Our investigations highlight the essence behind the long-tailed problem, and claim several feasible directions for future work.

【5】 Exploiting the relationship between visual and textual features in social networks for image classification with zero-shot deep learning 标题：利用社会网络中视觉和文本特征之间的关系进行Zero-Shot深度学习的图像分类

作者：Luis Lucas,David Tomas,Jose Garcia-Rodriguez 机构：Institute of Informatics Research, University of Alicante, Alicante, Spain 链接：https://arxiv.org/abs/2107.03751 摘要：与无监督机器学习相关的一个主要问题是从大型数据集中处理和提取有用信息的成本。在这项工作中，我们提出了一个分类器集成的基础上转移学习能力的剪辑神经网络结构在多模态环境（图像和文本）从社会媒体。为此，我们使用InstaNY100K数据集，提出了一种基于抽样技术的验证方法。我们的实验是基于位置数据集标签的图像分类任务，首先只考虑视觉部分，然后添加相关文本作为支持。结果表明，CLIP等训练好的神经网络可以在较小的微调下成功地应用于图像分类，并且根据目标考虑图像的相关文本有助于提高分类精度。结果表明，这似乎是一个很有前途的研究方向。摘要：One of the main issues related to unsupervised machine learning is the cost of processing and extracting useful information from large datasets. In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture in multimodal environments (image and text) from social media. For this purpose, we used the InstaNY100K dataset and proposed a validation approach based on sampling techniques. Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part, and then adding the associated texts as support. The results obtained demonstrated that trained neural networks such as CLIP can be successfully applied to image classification with little fine-tuning, and considering the associated texts to the images can help to improve the accuracy depending on the goal. The results demonstrated what seems to be a promising research direction.

【6】 An Embedded Iris Recognition System Optimization using Dynamically ReconfigurableDecoder with LDPC Codes 标题：基于LDPC码动态可重构解码器的嵌入式虹膜识别系统优化

作者：Longyu Ma,Chiu-Wing Sham,Chun Yan Lo,Xinchao Zhong 机构： researcherslike [ 10] introduce a novel way to use error correctingTheauthorsarewiththeSchoolofComputerScience, TheUniversityofAuckland 备注：8 pages, 6 figures 链接：https://arxiv.org/abs/2107.03688 摘要：虹膜纹理的提取和分析在生物特征识别中得到了广泛的研究。随着虹膜识别技术从实验室技术向国家级应用的转变，大多数虹膜识别系统在时间和空间上都面临着高度的复杂性，不适合于嵌入式设备。本文提出的设计方案包括一组最小的计算机视觉模块和多模式QC-LDPC译码器，可以减轻虹膜采集和跟踪过程中的变异性和噪声。对ieee802.16标准中的几类QC-LDPC码进行了精度改进的有效性测试。上述一些码用于进一步的QC-LDPC解码器量化、验证和相互比较。我们证明了我们可以应用动态部分重构技术来实现虹膜识别系统中的多模QC-LDPC译码器。结果表明，该方法具有较高的功耗效率，适合边缘应用。摘要：Extracting and analyzing iris textures for biometric recognition has been extensively studied. As the transition of iris recognition from lab technology to nation-scale applications, most systems are facing high complexity in either time or space, leading to unfitness for embedded devices. In this paper, the proposed design includes a minimal set of computer vision modules and multi-mode QC-LDPC decoder which can alleviate variability and noise caused by iris acquisition and follow-up process. Several classes of QC-LDPC code from IEEE 802.16 are tested for the validity of accuracy improvement. Some of the codes mentioned above are used for further QC-LDPC decoder quantization, validation and comparison to each other. We show that we can apply Dynamic Partial Reconfiguration technology to implement the multi-mode QC-LDPC decoder for the iris recognition system. The results show that the implementation is power-efficient and good for edge applications.

【7】 Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting 标题：行人属性识别的再思考：零拍行人身份设置下的可靠性评价

作者：Jian Jia,Houjing Huang,Xiaotang Chen,Kaiqi Huang 机构：Instituteof Automation 备注：13 pages, 6 figures, journal version of arXiv:2005.11909, submitted to TIP 链接：https://arxiv.org/abs/2107.03576 摘要：行人属性识别的目的是将多个属性分配给一个视频监控摄像机拍摄的行人图像。虽然提出了许多方法并取得了巨大进展，但我们认为，现在是时候退一步分析该地区的现状了。我们从三个方面回顾和反思最近的进展。首先，在行人属性识别没有明确完整定义的情况下，对行人属性识别与其他类似任务进行了形式化定义和区分。第二，基于所提出的定义，我们揭示了现有数据集的局限性，这些局限性违反了学术规范，不符合实际工业应用的本质要求。因此，我们提出了两个数据集，PETA\textsubscript{$ZS$}和RAP\textsubscript{$ZS$}，这两个数据集是按照行人身份的Zero-Shot设置构建的。此外，我们还介绍了未来行人属性数据集构建的一些现实标准。最后，我们重新实现现有的最先进的方法，并引入一个强大的基线方法，以提供可靠的评估和公平的比较。在现有的四个数据集和两个提出的数据集上进行了实验，以测量行人属性识别的进展。摘要：Pedestrian attribute recognition aims to assign multiple attributes to one pedestrian image captured by a video surveillance camera. Although numerous methods are proposed and make tremendous progress, we argue that it is time to step back and analyze the status quo of the area. We review and rethink the recent progress from three perspectives. First, given that there is no explicit and complete definition of pedestrian attribute recognition, we formally define and distinguish pedestrian attribute recognition from other similar tasks. Second, based on the proposed definition, we expose the limitations of the existing datasets, which violate the academic norm and are inconsistent with the essential requirement of practical industry application. Thus, we propose two datasets, PETA\textsubscript{$ZS$} and RAP\textsubscript{$ZS$}, constructed following the zero-shot settings on pedestrian identity. In addition, we also introduce several realistic criteria for future pedestrian attribute dataset construction. Finally, we reimplement existing state-of-the-art methods and introduce a strong baseline method to give reliable evaluations and fair comparisons. Experiments are conducted on four existing datasets and two proposed datasets to measure progress on pedestrian attribute recognition.

【8】 An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild 标题：野外分类和连续情绪识别的视听和上下文方法

作者：Panagiotis Antoniadis,Ioannis Pikoulis,Panagiotis P. Filntisis,Petros Maragos 机构：School of ECE, National Technical University of Athens, Athens, Greece 备注：6 pages, 1 figure, 2 tables, submitted to the 2nd Affective Behavior Analysis in-the-wild (ABAW2) Competition. arXiv admin note: text overlap with arXiv:2105.07484 链接：https://arxiv.org/abs/2107.03465 摘要：在这项工作中，我们处理的任务，基于视频的视听情感识别，在第二次研讨会和比赛的前提下，情感行为分析在野外（ABAW）。仅依赖于面部特征提取的标准方法在上述情感信息源由于头部/身体方向、低分辨率和弱光照而不可访问的情况下通常不能准确预测情感。作为更广泛的情感识别框架的一部分，我们希望通过利用身体和上下文特征来缓解这个问题。标准的CNN-RNN级联构成了我们提出的序列到序列（seq2seq）学习模型的主干。除了通过{RGB}输入模态进行学习之外，我们还构造了一个听觉流，该流对提取的mel谱图序列进行操作。我们在富有挑战性的新组装的Aff-Wild2（Aff-Wild2）数据集上进行了广泛的实验，验证了我们的方法优于现有方法，同时通过将上述所有模块适当地整合到一个网络集成中，我们成功地超越了之前公布的最佳识别分数，在官方的验证集中。所有代码都是使用PyTorch\footnote{\url实现的{https://pytorch.org/}}并且是公开的\footnote{\url{https://github.com/PanosAntoniadis/NTUA-ABAW2021}}. 摘要：In this work we tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Standard methodologies that rely solely on the extraction of facial features often fall short of accurate emotion prediction in cases where the aforementioned source of affective information is inaccessible due to head/body orientation, low resolution and poor illumination. We aspire to alleviate this problem by leveraging bodily as well as contextual features, as part of a broader emotion recognition framework. A standard CNN-RNN cascade constitutes the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the \textit{RGB} input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Affect-in-the-wild-2 (Aff-Wild2) dataset verify the superiority of our methods over existing approaches, while by properly incorporating all of the aforementioned modules in a network ensemble, we manage to surpass the previous best published recognition scores, in the official validation set. All the code was implemented using PyTorch\footnote{\url{https://pytorch.org/}} and is publicly available\footnote{\url{https://github.com/PanosAntoniadis/NTUA-ABAW2021}}.

分割|语义相关(5篇)

【1】 SCSS-Net: Superpoint Constrained Semi-supervised Segmentation Network for 3D Indoor Scenes 标题：SCSS-Net：超点约束的三维室内场景半监督分割网络

作者：Shuang Deng,Qiulei Dong,Bo Liu 机构： and also with the School of Artificial Intelligence, Universityof Chinese Academy of Sciences 链接：https://arxiv.org/abs/2107.03601 摘要：现有的用于三维点云语义分割的深度神经网络（DNNs）需要大量的全标记训练数据。但是，在复杂场景中手动指定点级别标签非常耗时。本文提出了一种基于超点约束的三维点云半监督分割网络，称为SCSS网络。具体来说，我们使用未标记点云预测的伪标签进行自训练，并结合基于几何和基于颜色的区域生长算法产生的叠加点来修改和删除低置信度的伪标签。此外，我们提出一个边缘预测模组来限制来自几何与颜色边缘点的特徵。引入了超点特征聚合模块和超点特征一致性损失函数，对每个超点的点特征进行平滑处理。在两个室内三维公共数据集上的大量实验结果表明，该方法比一些先进的点云分割网络和一些流行的半监督分割方法具有更好的性能。摘要：Many existing deep neural networks (DNNs) for 3D point cloud semantic segmentation require a large amount of fully labeled training data. However, manually assigning point-level labels on the complex scenes is time-consuming. While unlabeled point clouds can be easily obtained from sensors or reconstruction, we propose a superpoint constrained semi-supervised segmentation network for 3D point clouds, named as SCSS-Net. Specifically, we use the pseudo labels predicted from unlabeled point clouds for self-training, and the superpoints produced by geometry-based and color-based Region Growing algorithms are combined to modify and delete pseudo labels with low confidence. Additionally, we propose an edge prediction module to constrain the features from edge points of geometry and color. A superpoint feature aggregation module and superpoint feature consistency loss functions are introduced to smooth the point features in each superpoint. Extensive experimental results on two 3D public indoor datasets demonstrate that our method can achieve better performance than some state-of-the-art point cloud segmentation networks and some popular semi-supervised segmentation methods with few labeled scenes.

【2】 Comparing ML based Segmentation Models on Jet Fire Radiation Zone 标题：基于ML的喷流火力辐射区分割模型比较

作者：Carmina Pérez-Guerrero,Adriana Palacios,Gilberto Ochoa-Ruiz,Christian Mata,Miguel Gonzalez-Mendoza,Luis Eduardo Falcón-Morales 机构： Tecnol´ogico de Monterrey, School of Engineering and Sciences, Mexico., Universidad de las Americas Puebla, Department of Chemical, Food and, Environmental Engineering, Puebla, Mexico., Universitat Politecnica de Catalunya. EEBE, Eduard Maristany 备注：This pre-print is currently under review for the Mexican Conference on AI (MICAI 2021) 链接：https://arxiv.org/abs/2107.03461 摘要：风险评估在任何工作场所都是相关的，但是在处理易燃或危险材料时，存在一定程度的不可预测性，因此仅检测火灾事故本身可能是不够的。这方面的一个例子是喷射火的冲击，火焰的热通量可能到达附近的设备，并大大增加了多米诺效应带来灾难性后果的可能性。因此，从风险管理的角度来看，对此类火灾事故的定性是很重要的。其中一个特征就是火焰中不同辐射区域的分割，因此本文对几种传统的计算机视觉和深度学习分割方法进行了探索性的研究。利用丙烷喷射火灾的数据集对不同的方法进行训练和评估，并考虑到图像区域和背景分布的差异，探讨了不同的损失函数，以减轻数据的不平衡。此外，不同的指标与专家执行的手动排序相关，以做出与专家标准非常相似的评估。Hausdorff距离和Adjsted随机指数是相关性最高的度量，而UNet结构在加权交叉熵损失下的结果最好。这些结果可用于进一步的研究，从分割模板中提取更多的几何信息，甚至可以应用于其他类型的火灾事故。摘要：Risk assessment is relevant in any workplace, however there is a degree of unpredictability when dealing with flammable or hazardous materials so that detection of fire accidents by itself may not be enough. An example of this is the impingement of jet fires, where the heat fluxes of the flame could reach nearby equipment and dramatically increase the probability of a domino effect with catastrophic results. Because of this, the characterization of such fire accidents is important from a risk management point of view. One such characterization would be the segmentation of different radiation zones within the flame, so this paper presents an exploratory research regarding several traditional computer vision and Deep Learning segmentation approaches to solve this specific problem. A data set of propane jet fires is used to train and evaluate the different approaches and given the difference in the distribution of the zones and background of the images, different loss functions, that seek to alleviate data imbalance, are also explored. Additionally, different metrics are correlated to a manual ranking performed by experts to make an evaluation that closely resembles the expert's criteria. The Hausdorff Distance and Adjsted Random Index were the metrics with the highest correlation and the best results were obtained from the UNet architecture with a Weighted Cross-Entropy Loss. These results can be used in future research to extract more geometric information from the segmentation masks or could even be implemented on other types of fire accidents.

【3】 Atlas-Based Segmentation of Intracochlear Anatomy in Metal Artifact Affected CT Images of the Ear with Co-trained Deep Neural Networks 标题：基于Atlas的金属伪影耳蜗内解剖学联合训练的深度神经网络耳部CT图像分割

作者：Jianing Wang,Dingjie Su,Yubo Fan,Srijata Chakravorti,Jack H. Noble,Be-noit M. Dawant 机构：M. Dawant, Dept. of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN , USA 备注：10 pages, 5 figures 链接：https://arxiv.org/abs/2107.03987 摘要：我们提出一种基于图谱的方法来分割人工耳蜗植入术后CT图像中的耳蜗内解剖结构（ICA），该方法保留了图谱中网格与分割体积之间的点对点对应关系。为了解决这个问题，这是具有挑战性的，因为强大的伪影产生的植入，我们使用一对共同训练的深层网络，产生密集变形场（ddf）在相反的方向。一个网络负责将atlas图像注册到Post-CT图像，另一个网络负责将Post-CT图像注册到atlas图像。利用基于体素标签、图像内容、基准配准误差和循环一致性约束的损失函数对网络进行训练。利用训练好的配准网络生成的相应ddf，将atlas图像中ICA的预定义分割网格转移到CT后图像中，得到CT后图像中ICA的分割。我们的模型可以学习ICA的基本几何特征，即使它们被金属伪影所掩盖。我们证明，我们的端到端网络产生的结果与当前的最新技术（SOTA）相当，SOTA依赖于两步方法，首先使用条件生成对抗网络从CT后图像合成无伪影图像，然后使用基于主动形状模型的方法在图像中分割ICA合成图像。我们的方法只需要SOTA所需时间的一小部分，这对于最终用户的接受非常重要。摘要：We propose an atlas-based method to segment the intracochlear anatomy (ICA) in the post-implantation CT (Post-CT) images of cochlear implant (CI) recipients that preserves the point-to-point correspondence between the meshes in the atlas and the segmented volumes. To solve this problem, which is challenging because of the strong artifacts produced by the implant, we use a pair of co-trained deep networks that generate dense deformation fields (DDFs) in opposite directions. One network is tasked with registering an atlas image to the Post-CT images and the other network is tasked with registering the Post-CT images to the atlas image. The networks are trained using loss functions based on voxel-wise labels, image content, fiducial registration error, and cycle-consistency constraint. The segmentation of the ICA in the Post-CT images is subsequently obtained by transferring the predefined segmentation meshes of the ICA in the atlas image to the Post-CT images using the corresponding DDFs generated by the trained registration networks. Our model can learn the underlying geometric features of the ICA even though they are obscured by the metal artifacts. We show that our end-to-end network produces results that are comparable to the current state of the art (SOTA) that relies on a two-steps approach that first uses conditional generative adversarial networks to synthesize artifact-free images from the Post-CT images and then uses an active shape model-based method to segment the ICA in the synthetic images. Our method requires a fraction of the time needed by the SOTA, which is important for end-user acceptance.

【4】 Joint Motion Correction and Super Resolution for Cardiac Segmentation via Latent Optimisation 标题：基于潜在优化的关节运动校正和超分辨率心脏分割

作者：Shuo Wang,Chen Qin,Nicolo Savioli,Chen Chen,Declan O'Regan,Stuart Cook,Yike Guo,Daniel Rueckert,Wenjia Bai 机构： Digital Medical Research Center, Fudan University, China, Shanghai Key Laboratory of MICCAI, China, Data Science Institute, Imperial College London, UK, Department of Computing, Imperial College London, UK 备注：The paper is early accepted to MICCAI 2021. The codes are available at this https URL 链接：https://arxiv.org/abs/2107.03887 摘要：在心脏磁共振成像（CMR）中，心脏的三维高分辨率分割对于详细描述其解剖结构至关重要。然而，由于采集时间和呼吸/心脏运动的限制，临床上通常需要采集多层二维图像。这些图像的分割提供了心脏解剖结构的低分辨率表示，其中可能包含由运动引起的人工制品。在这里，我们提出了一个新的潜在优化框架，联合执行运动校正和超分辨率心脏图像分割。在给定低分辨率分割作为输入的情况下，该框架考虑了心脏MR成像中的层间运动，并将输入超分辨为与输入一致的高分辨率分割。一个多视图的损失是结合利用信息从短轴和长轴视图的心脏成像。为了解决反问题，在一个潜在空间中进行迭代优化，以确保解剖学上的合理性。这减轻了有监督学习对低分辨率和高分辨率图像的需要。在两个心脏MR数据集上的实验表明，该框架具有很高的性能，与最新的超分辨率方法相当，并且具有更好的跨域通用性和解剖学合理性。摘要：In cardiac magnetic resonance (CMR) imaging, a 3D high-resolution segmentation of the heart is essential for detailed description of its anatomical structures. However, due to the limit of acquisition duration and respiratory/cardiac motion, stacks of multi-slice 2D images are acquired in clinical routine. The segmentation of these images provides a low-resolution representation of cardiac anatomy, which may contain artefacts caused by motion. Here we propose a novel latent optimisation framework that jointly performs motion correction and super resolution for cardiac image segmentations. Given a low-resolution segmentation as input, the framework accounts for inter-slice motion in cardiac MR imaging and super-resolves the input into a high-resolution segmentation consistent with input. A multi-view loss is incorporated to leverage information from both short-axis view and long-axis view of cardiac imaging. To solve the inverse problem, iterative optimisation is performed in a latent space, which ensures the anatomical plausibility. This alleviates the need of paired low-resolution and high-resolution images for supervised learning. Experiments on two cardiac MR datasets show that the proposed framework achieves high performance, comparable to state-of-the-art super-resolution approaches and with better cross-domain generalisability and anatomical plausibility.

【5】 Modality Completion via Gaussian Process Prior Variational Autoencoders for Multi-Modal Glioma Segmentation 标题：高斯过程先于变分自动编码器的模态补全多模态胶质瘤分割

作者：Mohammad Hamghalam,Alejandro F. Frangi,Baiying Lei,Amber L. Simpson 机构：L. Simpson, School of Computing, Queen’s University, Kingston, ON, Canada, Department of Electrical, Biomedical, and Mechatronics Engineering, Qazvin, Branch, Azad University, Qazvin,-, Iran 备注：Accepted in MICCAI 2021 链接：https://arxiv.org/abs/2107.03442 摘要：在涉及多协议磁共振成像（MRI）的大型研究中，由于质量差（如成像伪影）、采集失败或走廊中断成像检查，可能会出现遗漏给定患者的一个或多个子模式的情况。在某些情况下，由于扫描时间有限或回顾性协调两项独立研究的成像方案，某些方案不可用。缺失的图像模式对分割框架提出了挑战，因为缺失扫描提供的补充信息随后丢失。在本文中，我们提出了一个新的模式，多模态高斯过程先验变分自动编码器（MGP-VAE），以填补一个或多个缺失子模式的病人扫描。MGP-VAE可以利用变分自动编码器（VAE）上的高斯过程（GP）来利用受试者/患者和子模式的相关性。与为现有子模式的每个可能子集设计一个网络或使用框架来混合特征图不同，可以基于所有可用样本从单个模型生成缺失数据。我们展示了MGP-VAE在脑肿瘤分割中的适用性，其中四个子模式中的任何一个、两个或三个可能缺失。我们在BraTS'19数据集上对具有缺失子模态的竞争性分割基线进行的实验表明，MGP-VAE模型对于分割任务是有效的。摘要：In large studies involving multi protocol Magnetic Resonance Imaging (MRI), it can occur to miss one or more sub-modalities for a given patient owing to poor quality (e.g. imaging artifacts), failed acquisitions, or hallway interrupted imaging examinations. In some cases, certain protocols are unavailable due to limited scan time or to retrospectively harmonise the imaging protocols of two independent studies. Missing image modalities pose a challenge to segmentation frameworks as complementary information contributed by the missing scans is then lost. In this paper, we propose a novel model, Multi-modal Gaussian Process Prior Variational Autoencoder (MGP-VAE), to impute one or more missing sub-modalities for a patient scan. MGP-VAE can leverage the Gaussian Process (GP) prior on the Variational Autoencoder (VAE) to utilize the subjects/patients and sub-modalities correlations. Instead of designing one network for each possible subset of present sub-modalities or using frameworks to mix feature maps, missing data can be generated from a single model based on all the available samples. We show the applicability of MGP-VAE on brain tumor segmentation where either, two, or three of four sub-modalities may be missing. Our experiments against competitive segmentation baselines with missing sub-modality on BraTS'19 dataset indicate the effectiveness of the MGP-VAE model for segmentation tasks.

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 RMA: Rapid Motor Adaptation for Legged Robots 标题：RMA：腿部机器人的快速运动适应

作者：Ashish Kumar,Zipeng Fu,Deepak Pathak,Jitendra Malik 机构：Carnegie Mellon University, UC Berkeley, Facebook 备注：RSS 2021. Webpage at this https URL 链接：https://arxiv.org/abs/2107.04034 摘要：腿型机器人在现实世界中的成功部署需要它们实时适应未知场景，如地形变化、有效载荷变化、磨损等。针对四足机器人的实时在线自适应问题，提出了一种快速运动自适应算法。RMA由两部分组成：基本策略和自适应模块。这些部件的组合使机器人能够在几秒钟内适应新的环境。RMA完全在仿真中训练，不需要参考轨迹或预定义的脚轨迹生成器等领域知识，并且在A1机器人上部署，无需任何微调。我们使用生物能学启发的奖励，在各种地形发生器上训练RMA，并将其部署在各种困难地形上，包括岩石、光滑、可变形表面，以及有草、长植被、混凝土、卵石、楼梯、沙子的环境中，RMA在各种真实世界和模拟实验中展示了最先进的性能。视频结果https://ashish-kmr.github.io/rma-legged-robots/ 摘要：Successful real-world deployment of legged robots would require them to adapt in real-time to unseen scenarios like changing terrains, changing payloads, wear and tear. This paper presents Rapid Motor Adaptation (RMA) algorithm to solve this problem of real-time online adaptation in quadruped robots. RMA consists of two components: a base policy and an adaptation module. The combination of these components enables the robot to adapt to novel situations in fractions of a second. RMA is trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and is deployed on the A1 robot without any fine-tuning. We train RMA on a varied terrain generator using bioenergetics-inspired rewards and deploy it on a variety of difficult terrains including rocky, slippery, deformable surfaces in environments with grass, long vegetation, concrete, pebbles, stairs, sand, etc. RMA shows state-of-the-art performance across diverse real-world as well as simulation experiments. Video results at https://ashish-kmr.github.io/rma-legged-robots/

【2】 Deep Learning Based Image Retrieval in the JPEG Compressed Domain 标题：基于深度学习的JPEG压缩域图像检索

作者：Shrikant Temburwar,Bulla Rajesh,Mohammed Javed 机构：Computer Vision and Biometrics Laboratory (CVBL), Department of Information Technology, Indian Institute of Information Technology Allahabad, Prayagraj, U.P, India 备注：Accepted in MISP2021 链接：https://arxiv.org/abs/2107.03648 摘要：基于内容的像素域图像检索（CBIR）系统利用图像的颜色、纹理和形状等底层特征进行检索。在这一背景下，文献中研究了两类图像表征，即局部和全局图像特征。从像素图像中提取这些特征并与数据库中的图像进行比较非常耗时。因此，近年来，人们致力于在压缩域中以较少的计算量直接完成图像分析。此外，我们日常事务中的大多数图像都是以JPEG压缩格式存储的。因此，如果我们可以直接从部分解码或压缩的数据中检索特征并使用它们进行检索，那将是理想的。本文提出了一种统一的图像检索模型，该模型以DCT系数为输入，在JPEG压缩域中直接有效地提取图像的全局和局部特征，实现了图像的精确检索。实验结果表明，我们提出的模型与当前的DELG模型性能相似，该模型以RGB特征作为输入，参考平均精度，同时具有更快的训练和检索速度。摘要：Content-based image retrieval (CBIR) systems on pixel domain use low-level features, such as colour, texture and shape, to retrieve images. In this context, two types of image representations i.e. local and global image features have been studied in the literature. Extracting these features from pixel images and comparing them with images from the database is very time-consuming. Therefore, in recent years, there has been some effort to accomplish image analysis directly in the compressed domain with lesser computations. Furthermore, most of the images in our daily transactions are stored in the JPEG compressed format. Therefore, it would be ideal if we could retrieve features directly from the partially decoded or compressed data and use them for retrieval. Here, we propose a unified model for image retrieval which takes DCT coefficients as input and efficiently extracts global and local features directly in the JPEG compressed domain for accurate image retrieval. The experimental findings indicate that our proposed model performed similarly to the current DELG model which takes RGB features as an input with reference to mean average precision while having a faster training and retrieval speed.

半弱无监督|主动学习|不确定性(5篇)

【1】 Uncertainty-Aware Camera Pose Estimation from Points and Lines 标题：基于点和线的不确定性感知相机位姿估计

作者：Alexander Vakhitov,Luis Ferraz Colomina,Antonio Agudo,Francesc Moreno-Noguer 机构：SLAMcore, UK, Kognia Sports Intelligence, Spain, Institut de Robotica i Informatica Industrial, CSIC-UPC, Spain 备注：CVPR 2021 链接：https://arxiv.org/abs/2107.03890 摘要：透视n点与线（Perspective-n-Point-and-Line，P$n$PL）算法是现代机器人和AR/VR系统的一个重要组成部分，其目标是从2D-3D特征对应中快速、准确、鲁棒地定位三维模型。目前基于点的姿态估计方法只使用二维特征检测的不确定性，而基于直线的姿态估计方法没有考虑不确定性。在我们的设置中，特征的三维坐标和二维投影都被认为是不确定的。提出了基于EPnP和DLS的PnP（L）解算器，用于不确定性感知的姿态估计。我们还修改了仅运动束调整，以考虑三维不确定性。我们在两个不同的视觉里程计数据集上进行了详尽的合成和真实实验。新的PnP（L）方法在孤立的真实数据上优于最先进的方法，在KITTI的代表子集上平均平移精度提高了18%，而新的不确定精化提高了大多数解算器的姿态精度，例如，与同一数据集上的标准细化相比，EPnP的平均翻译错误降低了16%。代码可在https://alexandervakhitov.github.io/uncertain-pnp/. 摘要：Perspective-n-Point-and-Line (P$n$PL) algorithms aim at fast, accurate, and robust camera localization with respect to a 3D model from 2D-3D feature correspondences, being a major part of modern robotic and AR/VR systems. Current point-based pose estimation methods use only 2D feature detection uncertainties, and the line-based methods do not take uncertainties into account. In our setup, both 3D coordinates and 2D projections of the features are considered uncertain. We propose PnP(L) solvers based on EPnP and DLS for the uncertainty-aware pose estimation. We also modify motion-only bundle adjustment to take 3D uncertainties into account. We perform exhaustive synthetic and real experiments on two different visual odometry datasets. The new PnP(L) methods outperform the state-of-the-art on real data in isolation, showing an increase in mean translation accuracy by 18% on a representative subset of KITTI, while the new uncertain refinement improves pose accuracy for most of the solvers, e.g. decreasing mean translation error for the EPnP by 16% compared to the standard refinement on the same dataset. The code is available at https://alexandervakhitov.github.io/uncertain-pnp/.

【2】 NccFlow: Unsupervised Learning of Optical Flow With Non-occlusion from Geometry 标题：NccFlow：几何无遮挡光流的无监督学习

作者：Guangming Wang,Shuaiqi Ren,Hesheng Wang 机构： and Hesheng Wang are with the KeyLaboratory of System Control and Information Processing of Ministry ofEducation, Department of Automation, Institute of Medical Robotics, Shang-hai Jiao Tong University 备注：10 pages, 7 figures, under review 链接：https://arxiv.org/abs/2107.03610 摘要：光流估计是计算机视觉的一个基本问题，在机器人学习和自主驾驶等领域有着广泛的应用。本文在深入理解和详细定义无遮挡的基础上，揭示了光流的新的几何规律。然后，基于无遮挡的几何规律，提出了两种新的用于光流无监督学习的损耗函数。具体地说，在遮罩图像的遮挡部分后，根据光流的几何规律，仔细考虑像素的流动过程，进行几何约束。首先，在像素位移到第二帧期间，第一帧中的相邻像素将不相交。第二，当第一帧中包含相邻四个像素的簇移动到第二帧时，没有其他像素流入它们形成的四边形。根据这两个几何约束，提出了非遮挡区域的光流无交叉损耗和光流无阻塞损耗。两个损失函数惩罚非遮挡区域中不规则和不精确的光流。在数据集上的实验表明，本文提出的基于非遮挡区域几何规律的无监督光流损耗估计方法使估计的光流更加精细，提高了光流无监督学习的性能。另外，通过对合成数据的训练和对真实数据的评价实验表明，该方法提高了光流网络的泛化能力。摘要：Optical flow estimation is a fundamental problem of computer vision and has many applications in the fields of robot learning and autonomous driving. This paper reveals novel geometric laws of optical flow based on the insight and detailed definition of non-occlusion. Then, two novel loss functions are proposed for the unsupervised learning of optical flow based on the geometric laws of non-occlusion. Specifically, after the occlusion part of the images are masked, the flowing process of pixels is carefully considered and geometric constraints are conducted based on the geometric laws of optical flow. First, neighboring pixels in the first frame will not intersect during the pixel displacement to the second frame. Secondly, when the cluster containing adjacent four pixels in the first frame moves to the second frame, no other pixels will flow into the quadrilateral formed by them. According to the two geometrical constraints, the optical flow non-intersection loss and the optical flow non-blocking loss in the non-occlusion regions are proposed. Two loss functions punish the irregular and inexact optical flows in the non-occlusion regions. The experiments on datasets demonstrated that the proposed unsupervised losses of optical flow based on the geometric laws in non-occlusion regions make the estimated optical flow more refined in detail, and improve the performance of unsupervised learning of optical flow. In addition, the experiments training on synthetic data and evaluating on real data show that the generalization ability of optical flow network is improved by our proposed unsupervised approach.

【3】 Video 3D Sampling for Self-supervised Representation Learning 标题：用于自监督表征学习的视频三维采样

作者：Wei Li,Dezhao Luo,Bo Fang,Yu Zhou,Weiping Wang 机构：Institute of Information Engineering, Chinese Academy of Sciences 备注：9 pages, 5 figures, 6 tables 链接：https://arxiv.org/abs/2107.03578 摘要：现有的视频自监督方法大多是利用视频的时间信号，忽略了运动对象的语义和环境信息对视频相关任务的重要性。本文提出了一种新的视频表示学习的自监督方法，称为视频三维采样（V3S）。为了充分利用视频中提供的信息（空间和时间），我们从三个维度（宽度、高度、时间）对视频进行预处理。因此，我们可以利用空间信息（物体的大小）、时间信息（运动的方向和幅度）作为我们的学习目标。在我们的实现中，我们结合了三维采样，分别提出了空间和时间上的尺度变换和投影变换。实验结果表明，当应用于动作识别、视频检索和动作相似性标注时，我们的方法在很大程度上提高了技术水平。摘要：Most of the existing video self-supervised methods mainly leverage temporal signals of videos, ignoring that the semantics of moving objects and environmental information are all critical for video-related tasks. In this paper, we propose a novel self-supervised method for video representation learning, referred to as Video 3D Sampling (V3S). In order to sufficiently utilize the information (spatial and temporal) provided in videos, we pre-process a video from three dimensions (width, height, time). As a result, we can leverage the spatial information (the size of objects), temporal information (the direction and magnitude of motions) as our learning target. In our implementation, we combine the sampling of the three dimensions and propose the scale and projection transformations in space and time respectively. The experimental results show that, when applied to action recognition, video retrieval and action similarity labeling, our approach improves the state-of-the-arts with significant margins.

【4】 Uncertainty-aware Human Motion Prediction 标题：基于不确定性的人体运动预测

作者：Pengxiang Ding,Jianqin Yin 链接：https://arxiv.org/abs/2107.03575 摘要：人体运动预测是人体运动分析和人机交互等任务的基础。现有的大多数运动预测方法已经被提出。然而，他们忽视了一个重要的任务，即对预测结果质量的评估。目前的方法在实际场景中已经足够了，因为没有预测的评估，人们无法知道如何与机器交互，不可靠的预测可能会误导机器伤害人类。因此，我们提出了一个不确定性感知的人体运动预测框架（UA-HMP）。具体地说，我们首先通过高斯建模设计一个具有不确定性的预测器，以获得预测运动的值和不确定性。在此基础上，提出了一种不确定性引导学习方案，对不确定性进行量化，减少噪声样本在优化过程中的负面影响。我们提出的框架很容易与现有的SOTA基线相结合，以克服它们在参数增量很小的不确定性建模方面的不足。大量实验还表明，它们在H3.6M、CMU-Mocap的短期和长期预测中都能取得较好的效果。摘要：Human motion prediction is essential for tasks such as human motion analysis and human-robot interactions. Most existing approaches have been proposed to realize motion prediction. However, they ignore an important task, the evaluation of the quality of the predicted result. It is far more enough for current approaches in actual scenarios because people can't know how to interact with the machine without the evaluation of prediction, and unreliable predictions may mislead the machine to harm the human. Hence, we propose an uncertainty-aware framework for human motion prediction (UA-HMP). Concretely, we first design an uncertainty-aware predictor through Gaussian modeling to achieve the value and the uncertainty of predicted motion. Then, an uncertainty-guided learning scheme is proposed to quantitate the uncertainty and reduce the negative effect of the noisy samples during optimization for better performance. Our proposed framework is easily combined with current SOTA baselines to overcome their weakness in uncertainty modeling with slight parameters increment. Extensive experiments also show that they can achieve better performance in both short and long-term predictions in H3.6M, CMU-Mocap.

【5】 Label-set Loss Functions for Partial Supervision: Application to Fetal Brain 3D MRI Parcellation 标题：部分监督的标签集损失函数在胎脑3D MRI分割中的应用

作者：Lucas Fidon,Michael Aertsen,Doaa Emam,Nada Mufti,Frederic Guffens,Thomas Deprest,Philippe Demaerel,Anna L. David,Andrew Melbourne,Sebastien Ourselin,Jam Deprest,Tom Vercauteren 机构： School of Biomedical Engineering & Imaging Sciences, King’s College London, UK, Department of Radiology, University Hospitals Leuven, Belgium, Institute for Women’s Health, University College London, UK 备注：Accepted at MICCAI 2021 链接：https://arxiv.org/abs/2107.03846 摘要：深度神经网络提高了自动分割的精度，但其精度依赖于大量完全分割图像的可用性。为了更好地利用部分标注的数据集，有必要使用分割了部分（而不是全部）感兴趣区域的图像来训练深度神经网络。在本文中，我们提出了第一个公理化定义的标签集损失函数是损失函数，可以处理部分分割的图像。我们证明了将完全分割图像的经典损失函数转化为适当的标签集损失函数的方法是唯一的。我们的理论还允许我们定义叶子骰子损失，这是骰子损失的一个标签集泛化，特别适用于只缺少标签的部分监督。利用叶骰子丢失，我们建立了一个新的国家在部分监督学习胎儿大脑三维磁共振分割。基于解剖正常胎儿或开放性脊柱裂的胎脑三维MRI，我们实现了一个能够分割白质、脑室、小脑、脑室外CSF、皮质灰质、深灰质、脑干和胼胝体的深部神经网络。我们建议的标签集损失函数的实现可在https://github.com/LucasFidon/label-set-loss-functions 摘要：Deep neural networks have increased the accuracy of automatic segmentation, however, their accuracy depends on the availability of a large number of fully segmented images. Methods to train deep neural networks using images for which some, but not all, regions of interest are segmented are necessary to make better use of partially annotated datasets. In this paper, we propose the first axiomatic definition of label-set loss functions that are the loss functions that can handle partially segmented images. We prove that there is one and only one method to convert a classical loss function for fully segmented images into a proper label-set loss function. Our theory also allows us to define the leaf-Dice loss, a label-set generalization of the Dice loss particularly suited for partial supervision with only missing labels. Using the leaf-Dice loss, we set a new state of the art in partially supervised learning for fetal brain 3D MRI segmentation. We achieve a deep neural network able to segment white matter, ventricles, cerebellum, extra-ventricular CSF, cortical gray matter, deep gray matter, brainstem, and corpus callosum based on fetal brain 3D MRI of anatomically normal fetuses or with open spina bifida. Our implementation of the proposed label-set loss functions is available at https://github.com/LucasFidon/label-set-loss-functions

时序|行为识别|姿态|视频|运动估计(4篇)

【1】 Use of Affective Visual Information for Summarization of Human-Centric Videos 标题：情感视觉信息在以人为中心的视频摘要中的应用

作者：Berkay Köprü,Engin Erzin 机构： Ko¸c University 备注：12 pages 链接：https://arxiv.org/abs/2107.03783 摘要：越来越多的用户生成的以人为中心的视频内容及其应用（如视频检索和浏览）需要视频摘要文献所述的紧凑表示。目前的有监督研究将视频摘要描述为一个序列到序列的学习问题，而现有的解决方案往往忽略了以人为中心的观点的激增，这种观点内在地包含了情感内容。在这项研究中，我们探讨了情感信息丰富的监督视频摘要任务为人类为中心的视频。首先，我们在RECOLA数据集上训练一个视觉输入驱动的最新的连续情绪识别模型（CER-NET）来估计情绪属性。然后，我们将估计的情感属性和来自CER-NET的高层表示与视觉信息相结合，定义所提出的情感视频摘要体系结构（AVSUM）。此外，本文还研究了注意在AVSUM结构中的应用，提出了两种新的基于时间注意（TA-AVSUM）和空间注意（SA-AVSUM）的AVSUM结构。在TvSum数据库上进行了视频摘要实验。所提出的AVSUM-GRU体系结构早期融合了高级GRU嵌入和基于时间注意的TA-AVSUM体系结构，在F分数和自定义人脸回忆方面与现有技术相比，为以人为中心的视频带来了强大的性能改进，从而获得了具有竞争力的视频摘要性能韵律学。摘要：Increasing volume of user-generated human-centric video content and their applications, such as video retrieval and browsing, require compact representations that are addressed by the video summarization literature. Current supervised studies formulate video summarization as a sequence-to-sequence learning problem and the existing solutions often neglect the surge of human-centric view, which inherently contains affective content. In this study, we investigate the affective-information enriched supervised video summarization task for human-centric videos. First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes. Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM). In addition, we investigate the use of attention to improve the AVSUM architectures and propose two new architectures based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM). We conduct video summarization experiments on the TvSum database. The proposed AVSUM-GRU architecture with an early fusion of high level GRU embeddings and the temporal attention based TA-AVSUM architecture attain competitive video summarization performances by bringing strong performance improvements for the human-centric videos compared to the state-of-the-art in terms of F-score and self-defined face recall metrics.

【2】 4D Attention: Comprehensive Framework for Spatio-Temporal Gaze Mapping 标题：4D注意：时空凝视映射的综合框架

作者：Shuji Oishi,Kenji Koide,Masashi Yokozuka,Atsuhiko Banno 机构：NationalInstituteofAdvanced Industrial Science and Technology (AIST) 链接：https://arxiv.org/abs/2107.03606 摘要：这项研究提出了一个框架，捕捉人类的注意力在时空域使用眼睛跟踪眼镜。注意映射是人类感知活动分析或人机交互（HRI）支持人类视觉认知的关键技术；然而，在动态环境中测量人的注意力是一个挑战，因为在定位主体和处理移动物体时存在困难。为了解决这个问题，我们提出了一个综合的框架，4D注意，统一的凝视映射到静态和动态对象。具体来说，我们通过利用直接视觉定位和惯性测量单元（IMU）值的松耦合来估计眼镜的姿态。此外，通过在我们的框架中安装重建组件，可以基于输入图像实例化三维环境地图中未捕获的动态对象。最后，场景渲染组件合成具有标识（ID）纹理的第一人称视图并执行直接2D-3D注视关联。定量评估表明我们的框架是有效的。此外，我们还通过实验验证了4D注意在真实情境中的应用。摘要：This study presents a framework for capturing human attention in the spatio-temporal domain using eye-tracking glasses. Attention mapping is a key technology for human perceptual activity analysis or Human-Robot Interaction (HRI) to support human visual cognition; however, measuring human attention in dynamic environments is challenging owing to the difficulty in localizing the subject and dealing with moving objects. To address this, we present a comprehensive framework, 4D Attention, for unified gaze mapping onto static and dynamic objects. Specifically, we estimate the glasses pose by leveraging a loose coupling of direct visual localization and Inertial Measurement Unit (IMU) values. Further, by installing reconstruction components into our framework, dynamic objects not captured in the 3D environment map are instantiated based on the input images. Finally, a scene rendering component synthesizes a first-person view with identification (ID) textures and performs direct 2D-3D gaze association. Quantitative evaluations showed the effectiveness of our framework. Additionally, we demonstrated the applications of 4D Attention through experiments in real situations.

【3】 Relation-Based Associative Joint Location for Human Pose Estimation in Videos 标题：基于关系的关联关节定位在视频人体姿态估计中的应用

作者：Yonghao Dang,Jianqin Yin 机构：Beijing University of Posts and Telecommunications, Beijing Shi, China 链接：https://arxiv.org/abs/2107.03591 摘要：基于视频的人体姿态估计（HPE）是一项重要而富有挑战性的任务。虽然深度学习方法在HPE中取得了显著的进展，但大多数方法都是独立地检测每个关节，破坏了姿势结构信息。与以往的方法不同，本文提出了一种基于关系的姿态语义传递网络（RPSTN）来实现关节的关联定位。具体地说，我们设计了一个轻量级的关节关系抽取器（JRE），通过启发式地建立任意两个关节之间的关系，而不是独立地建立每个关节的热图，来对关节的姿态结构特征进行建模，并关联地生成关节的热图。实际上，所提出的JRE模块通过任意两个关节之间的关系来模拟人体姿势的空间配置。此外，考虑到视频的时间语义连续性，当前帧中的姿势语义信息有利于指导下一帧中关节的位置。因此，我们采用知识重用的思想，在连续的帧间传递姿态语义信息。通过这种方式，所提出的RPSTN捕捉了姿态的时间动态。一方面，JRE模块可以根据不可见关节与空间中其他可见关节的关系来推断不可见关节。另一方面，该模型可以在一定的时间内，将非遮挡帧中的位姿语义特征转移到遮挡帧中，实现遮挡关节的定位。因此，我们的方法对遮挡具有很强的鲁棒性，并在两个具有挑战性的数据集上取得了最新的结果，证明了它对基于视频的人体姿态估计的有效性。我们将公开发布代码和模型。摘要：Video-based human pose estimation (HPE) is a vital yet challenging task. While deep learning methods have made significant progress for the HPE, most approaches to this task detect each joint independently, damaging the pose structural information. In this paper, unlike the prior methods, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) to locate joints associatively. Specifically, we design a lightweight joint relation extractor (JRE) to model the pose structural features and associatively generate heatmaps for joints by modeling the relation between any two joints heuristically instead of building each joint heatmap independently. Actually, the proposed JRE module models the spatial configuration of human poses through the relationship between any two joints. Moreover, considering the temporal semantic continuity of videos, the pose semantic information in the current frame is beneficial for guiding the location of joints in the next frame. Therefore, we use the idea of knowledge reuse to propagate the pose semantic information between consecutive frames. In this way, the proposed RPSTN captures temporal dynamics of poses. On the one hand, the JRE module can infer invisible joints according to the relationship between the invisible joints and other visible joints in space. On the other hand, in the time, the propose model can transfer the pose semantic features from the non-occluded frame to the occluded frame to locate occluded joints. Therefore, our method is robust to the occlusion and achieves state-of-the-art results on the two challenging datasets, which demonstrates its effectiveness for video-based human pose estimation. We will release the code and models publicly.

【4】 Automated Object Behavioral Feature Extraction for Potential Risk Analysis based on Video Sensor 标题：基于视频传感器的潜在风险分析目标行为特征自动提取

作者：Byeongjoon Noh,Wonjun Noh,David Lee,Hwasoo Yeo 机构：Applied Science Research Institute, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, Department of Civil and Enviornment Engineering, WonJun No, Deptartment of Civil and Environment Engineering 备注：6 pages, 9 figures 链接：https://arxiv.org/abs/2107.03554 摘要：由于各种原因，行人在道路上，特别是无信号人行横道上，会面临死亡或重伤的风险。到目前为止，基于视觉的交通安全系统已经得到了广泛的研究。然而，许多研究需要人工检测交通视频的体积，才能可靠地获取交通相关对象的行为因素。在本文中，我们提出了一个自动化和更简单的系统，有效地提取目标行为特征的视频传感器部署在道路上。我们对这些特征进行了基本的统计分析，并展示了它们对监控道路交通行为的作用。通过对韩国大山市两个无信号人行横道的应用，验证了该系统的可行性。最后，我们通过简单的统计分析比较了这两个地区的车辆和行人的行为。这项研究展示了连接视频传感器网络为智能城市提供可操作数据的潜力，以改善危险道路环境中的行人安全。摘要：Pedestrians are exposed to risk of death or serious injuries on roads, especially unsignalized crosswalks, for a variety of reasons. To date, an extensive variety of studies have reported on vision based traffic safety system. However, many studies required manual inspection of the volumes of traffic video to reliably obtain traffic related objects behavioral factors. In this paper, we propose an automated and simpler system for effectively extracting object behavioral features from video sensors deployed on the road. We conduct basic statistical analysis on these features, and show how they can be useful for monitoring the traffic behavior on the road. We confirm the feasibility of the proposed system by applying our prototype to two unsignalized crosswalks in Osan city, South Korea. To conclude, we compare behaviors of vehicles and pedestrians in those two areas by simple statistical analysis. This study demonstrates the potential for a network of connected video sensors to provide actionable data for smart cities to improve pedestrian safety in dangerous road environments.

医学相关(1篇)

【1】 Task Fingerprinting for Meta Learning in Biomedical Image Analysis 标题：生物医学图像分析中元学习的任务指纹

作者：Patrick Godau,Lena Maier-Hein 机构：Maier-Hein,[,−,−,−,], German Cancer Research Center, Heidelberg, Germany, Ruprecht-Karls University of Heidelberg, Heidelberg, Germany 备注：Medical Image Computing and Computer Assisted Interventions (MICCAI) 2021 链接：https://arxiv.org/abs/2107.03949 摘要：摘要生物医学图像分析中标注数据的不足是最大的瓶颈之一。元学习研究学习系统如何通过经验来提高效率，从而成为克服数据稀疏性的一个重要概念。然而，基于元学习的方法的核心能力是在给定新任务的情况下识别类似的先前任务，这在生物医学成像领域是一个很大程度上未被探索的挑战。在本文中，我们提出了一个概念，我们称之为任务指纹量化任务相似性的问题。这个概念涉及到将给定的任务（由成像数据和相应的标签表示）转换为固定长度的向量表示。在指纹空间中，可以直接比较不同的任务，而不考虑它们的数据集大小、标签类型或特定分辨率。在外科数据科学（SDS）领域对来自不同医学和非医学领域的26项分类任务进行的初步可行性研究表明，任务指纹技术可用于（1）为预训练选择合适的数据集和（2）为新任务选择合适的体系结构。因此，任务指纹可以成为SDS和其他生物医学图像分析领域元学习的重要工具。摘要：Shortage of annotated data is one of the greatest bottlenecks in biomedical image analysis. Meta learning studies how learning systems can increase in efficiency through experience and could thus evolve as an important concept to overcome data sparsity. However, the core capability of meta learning-based approaches is the identification of similar previous tasks given a new task - a challenge largely unexplored in the biomedical imaging domain. In this paper, we address the problem of quantifying task similarity with a concept that we refer to as task fingerprinting. The concept involves converting a given task, represented by imaging data and corresponding labels, to a fixed-length vector representation. In fingerprint space, different tasks can be directly compared irrespective of their data set sizes, types of labels or specific resolutions. An initial feasibility study in the field of surgical data science (SDS) with 26 classification tasks from various medical and non-medical domains suggests that task fingerprinting could be leveraged for both (1) selecting appropriate data sets for pretraining and (2) selecting appropriate architectures for a new task. Task fingerprinting could thus become an important tool for meta learning in SDS and other fields of biomedical image analysis.

GAN|对抗|攻击|生成相关(2篇)

【1】 TGHop: An Explainable, Efficient and Lightweight Method for Texture Generation 标题：TGHOP：一种可解释、高效、轻量级的纹理生成方法

作者：Xuejing Lei,Ganning Zhao,Kaitai Zhang,C. -C. Jay Kuo 机构：Media Communications Lab, University of Southern California, Los Angeles, CA, USA, C.-C. Jay Kuo 备注：arXiv admin note: substantial text overlap with arXiv:2009.01376 链接：https://arxiv.org/abs/2107.04020 摘要：提出了一种可解释的、高效的、轻量级的纹理生成方法TGHop。虽然深度神经网络可以实现视觉上令人愉悦的纹理合成，但是相关的模型规模大，理论上难以解释，训练上计算量大。相比之下，TGHop的模型尺寸小，数学透明，训练和推理效率高，能够生成高质量的纹理。给定一个示例性纹理，TGHop首先从中裁剪出许多样本块，形成一个称为源的样本块集合。然后，利用PixelHop++框架，对源样本进行像素统计分析，得到由细到粗的子空间序列。为了用TGHop生成纹理块，我们从最粗糙的子空间（称为核）开始，通过跟踪真实样本的分布，在每个子空间中生成样本。最后缝合纹理片，形成大尺寸的纹理图像。实验结果表明，TGHop能以较小的模型尺寸和较快的速度生成高质量的纹理图像。摘要：An explainable, efficient and lightweight method for texture generation, called TGHop (an acronym of Texture Generation PixelHop), is proposed in this work. Although synthesis of visually pleasant texture can be achieved by deep neural networks, the associated models are large in size, difficult to explain in theory, and computationally expensive in training. In contrast, TGHop is small in its model size, mathematically transparent, efficient in training and inference, and able to generate high quality texture. Given an exemplary texture, TGHop first crops many sample patches out of it to form a collection of sample patches called the source. Then, it analyzes pixel statistics of samples from the source and obtains a sequence of fine-to-coarse subspaces for these patches by using the PixelHop++ framework. To generate texture patches with TGHop, we begin with the coarsest subspace, which is called the core, and attempt to generate samples in each subspace by following the distribution of real samples. Finally, texture patches are stitched to form texture images of a large size. It is demonstrated by experimental results that TGHop can generate texture images of superior quality with a small model size and at a fast speed.

【2】 Grid Partitioned Attention: Efficient TransformerApproximation with Inductive Bias for High Resolution Detail Generation 标题：网格划分注意力：用于高分辨率细节生成的带感应偏置的高效Transformer逼近

作者：Nikolay Jetchev,Gökhan Yildirim,Christian Bracher,Roland Vollgraf 机构：Zalando Research, Zalando SE, Berlin, Germany 备注：code available at this https URL 链接：https://arxiv.org/abs/2107.03742 摘要：注意是一种通用的推理机制，它不能灵活地处理图像信息，但由于其内存需求，目前还不能用于高分辨率图像的生成。我们提出了一种新的近似注意算法，它利用稀疏归纳偏差来提高图像域的计算和存储效率：查询只关注少数关键点，空间上的紧密查询由于相关而关注紧密的关键点。本文介绍了新的注意层，分析了它的复杂性，以及如何通过超参数来调整内存使用和模型功率之间的平衡。我们将展示这种注意如何支持新的深度学习体系结构和复制模块，这些模块特别适用于像姿势变形这样的条件图像生成任务。我们的贡献是（i）新的GPA层的算法和代码1，（ii）新的深度注意复制架构，以及（iii）最新的人体姿势变形生成基准的实验结果。摘要：Attention is a general reasoning mechanism than can flexibly deal with image information, but its memory requirements had made it so far impractical for high resolution image generation. We present Grid Partitioned Attention (GPA), a new approximate attention algorithm that leverages a sparse inductive bias for higher computational and memory efficiency in image domains: queries attend only to few keys, spatially close queries attend to close keys due to correlations. Our paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned by the hyper-parameters.We will show how such attention enables novel deep learning architectures with copying modules that are especially useful for conditional image generation tasks like pose morphing. Our contributions are (i) algorithm and code1of the novel GPA layer, (ii) a novel deep attention-copying architecture, and (iii) new state-of-the art experimental results in human pose morphing generation benchmarks.

NAS模型搜索(1篇)

【1】 CHASE: Robust Visual Tracking via Cell-Level Differentiable Neural Architecture Search 标题：Chase：基于细胞级可区分神经结构搜索的鲁棒视觉跟踪

作者：Seyed Mojtaba Marvasti-Zadeh,Javad Khaghani,Li Cheng,Hossein Ghanei-Yakhdan,Shohreh Kasaei 机构： Vision and Learning Lab, University of Alberta, Edmonton, Canada, Digital Image & Video Processing Lab, Yazd University, Yazd, Iran, Image Processing Lab, Sharif University of Technology, Tehran, Iran 备注：The first two authors contributed equally to this work 链接：https://arxiv.org/abs/2107.03463 摘要：如今，一个强大的视觉目标跟踪器依赖于其精心设计的模块，这些模块通常由手动设计的网络体系结构组成，以提供高质量的跟踪结果。不足为奇，手工设计过程成为一个特别具有挑战性的障碍，因为它需要足够的经验，巨大的努力，直觉，也许还有一些好运。同时，神经网络结构搜索在图像分割等实际应用中有着广泛的应用前景，是解决可行网络结构自动搜索问题的一种很有前途的方法。在这项工作中，我们提出了一种新的单元级可微结构搜索机制来自动化跟踪模块的网络设计，目的是在离线训练期间使主干特征适应跟踪网络的目标。该方法简单、高效，无需堆叠一系列模组来建构网路。我们的方法很容易被合并到现有的跟踪器中，并使用不同的基于可微结构搜索的方法和跟踪目标进行了实证验证。广泛的实验评估表明，我们的方法优于五个常用的基准测试。同时，我们在TrackingNet数据集上的二阶（一阶）DARTS方法的自动搜索过程需要41（18）小时。摘要：A strong visual object tracker nowadays relies on its well-crafted modules, which typically consist of manually-designed network architectures to deliver high-quality tracking results. Not surprisingly, the manual design process becomes a particularly challenging barrier, as it demands sufficient prior experience, enormous effort, intuition and perhaps some good luck. Meanwhile, neural architecture search has gaining grounds in practical applications such as image segmentation, as a promising method in tackling the issue of automated search of feasible network structures. In this work, we propose a novel cell-level differentiable architecture search mechanism to automate the network design of the tracking module, aiming to adapt backbone features to the objective of a tracking network during offline training. The proposed approach is simple, efficient, and with no need to stack a series of modules to construct a network. Our approach is easy to be incorporated into existing trackers, which is empirically validated using different differentiable architecture search-based methods and tracking objectives. Extensive experimental evaluations demonstrate the superior performance of our approach over five commonly-used benchmarks. Meanwhile, our automated searching process takes 41 (18) hours for the second (first) order DARTS method on the TrackingNet dataset.

人脸|人群计数(2篇)

【1】 Causal affect prediction model using a facial image sequence 标题：使用面部图像序列的因果影响预测模型

作者：Geesung Oh,Euiseok Jeong,Sejoon Lim 机构：Graduate School of Automotive Engineering, Kookmin University, Seoul, Korea, Department of Automobile and IT Convergence 链接：https://arxiv.org/abs/2107.03886 摘要：在人类情感行为的研究中，随着深度学习的发展，人脸表情识别的研究也在不断提高。然而，为了提高性能，不仅要使用过去的图像，而且要使用未来的图像以及相应的人脸图像，但是这项技术在实时环境中的应用存在障碍。在本文中，我们提出了因果情感预测网络（CAPNet），它只使用过去的面部图像来预测相应的情感效价和唤醒。利用Aff-Wild2数据集，将过去图像序列与当前标签配对，训练CAPNet学习过去图像与相应的情感效价和唤醒之间的因果推理。我们通过实验表明，训练有素的CAPNet通过预测情感效价和唤醒，仅使用前三分之一秒的过去面部图像，优于第二次野外情感行为分析挑战（ABAW2）的基线。因此，在实时应用中，CAPNet仅利用过去的数据就可以可靠地预测情感效价和唤醒。摘要：Among human affective behavior research, facial expression recognition research is improving in performance along with the development of deep learning. However, for improved performance, not only past images but also future images should be used along with corresponding facial images, but there are obstacles to the application of this technique to real-time environments. In this paper, we propose the causal affect prediction network (CAPNet), which uses only past facial images to predict corresponding affective valence and arousal. We train CAPNet to learn causal inference between past images and corresponding affective valence and arousal through supervised learning by pairing the sequence of past images with the current label using the Aff-Wild2 dataset. We show through experiments that the well-trained CAPNet outperforms the baseline of the second challenge of the Affective Behavior Analysis in-the-wild (ABAW2) Competition by predicting affective valence and arousal only with past facial images one-third of a second earlier. Therefore, in real-time application, CAPNet can reliably predict affective valence and arousal only with past data.

【2】 Crowd Counting via Perspective-Guided Fractional-Dilation Convolution 标题：基于透视引导分数膨胀卷积法的人群计数

作者：Zhaoyi Yan,Ruimao Zhang,Hongzhi Zhang,Qingfu Zhang,Wangmeng Zuo 机构： HarbinInstitute of Technology 备注：None 链接：https://arxiv.org/abs/2107.03665 摘要：在众多视频监控场景中，人群计数至关重要。如何处理由透视效应引起的行人尺度的剧烈变化是本课题的主要研究内容之一。针对这一问题，本文提出了一种基于卷积神经网络的人群计数方法，称为透视引导分数膨胀网络（PFDNet）。通过对连续尺度变化的建模，提出的PFDNet能够选择合适的分数膨胀核以适应不同的空间位置。它显著地提高了只考虑离散代表性尺度的艺术状态的灵活性。此外，通过避免其他方法中使用的多尺度或多列结构，它在计算上更加高效。在实际应用中，提出的PFDNet是通过在VGG16-BN主干上叠加多个透视引导的分数阶膨胀卷积（PFC）来构造的。通过引入一种新的广义伸缩卷积运算，PFC可以在透视标注的指导下处理空间域的分数阶伸缩比，实现行人的连续尺度建模。为了解决某些情况下透视信息不可用的问题，我们进一步在所提出的PFDNet中引入了一个有效的透视估计分支，该分支经过预训练后可以在有监督或弱监督的环境下进行训练。大量实验表明，在上海科技A、上海科技B、世博会10、UCF-QNRF、UCF\ U CC\ U 50和TRANCOS数据集上，所提出的PFDNet的性能优于现有的方法，分别达到了MAE 53.8、6.5、6.8、84.3、205.8和3.06。摘要：Crowd counting is critical for numerous video surveillance scenarios. One of the main issues in this task is how to handle the dramatic scale variations of pedestrians caused by the perspective effect. To address this issue, this paper proposes a novel convolution neural network-based crowd counting method, termed Perspective-guided Fractional-Dilation Network (PFDNet). By modeling the continuous scale variations, the proposed PFDNet is able to select the proper fractional dilation kernels for adapting to different spatial locations. It significantly improves the flexibility of the state-of-the-arts that only consider the discrete representative scales. In addition, by avoiding the multi-scale or multi-column architecture that used in other methods, it is computationally more efficient. In practice, the proposed PFDNet is constructed by stacking multiple Perspective-guided Fractional-Dilation Convolutions (PFC) on a VGG16-BN backbone. By introducing a novel generalized dilation convolution operation, the PFC can handle fractional dilation ratios in the spatial domain under the guidance of perspective annotations, achieving continuous scales modeling of pedestrians. To deal with the problem of unavailable perspective information in some cases, we further introduce an effective perspective estimation branch to the proposed PFDNet, which can be trained in either supervised or weakly-supervised setting once the branch has been pre-trained. Extensive experiments show that the proposed PFDNet outperforms state-of-the-art methods on ShanghaiTech A, ShanghaiTech B, WorldExpo'10, UCF-QNRF, UCF_CC_50 and TRANCOS dataset, achieving MAE 53.8, 6.5, 6.8, 84.3, 205.8, and 3.06 respectively.

图像视频检索|Re-id相关(1篇)

【1】 Case-based similar image retrieval for weakly annotated large histopathological images of malignant lymphoma using deep metric learning 标题：基于案例的基于深度度量学习的弱标注恶性淋巴瘤大块组织病理图像相似图像检索

作者：Noriaki Hashimoto,Yusuke Takagi,Hiroki Masuda,Hiroaki Miyoshi,Kei Kohno,Miharu Nagaishi,Kensaku Sato,Mai Takeuchi,Takuya Furuta,Keisuke Kawamoto,Kyohei Yamada,Mayuko Moritsubo,Kanako Inoue,Yasumasa Shimasaki,Yusuke Ogura,Teppei Imamoto,Tatsuzo Mishina,Koichi Ohshima,Hidekata Hontani,Ichiro Takeuchi 机构：RIKEN Center for Advanced Intelligence Project, Department of Computer Science, Nagoya Institute of Technology, Department of Pathology, Kurume University School of Medicine 链接：https://arxiv.org/abs/2107.03602 摘要：在本研究中，我们提出了一种新的基于病例的相似图像检索（SIR）方法，用于苏木精-伊红（H&E）染色的恶性淋巴瘤组织病理学图像。当使用整张幻灯片图像（WSI）作为输入查询时，希望能够通过聚焦于病理重要区域（例如肿瘤细胞）的图像块来检索相似的病例。为了解决这个问题，我们采用了基于注意的多实例学习，这使得我们能够在计算病例之间的相似度时将注意力集中在肿瘤的特定区域。此外，我们采用对比距离度量学习结合免疫组织化学（IHC）染色模式作为有用的监督信息，以确定异质性恶性淋巴瘤病例之间的适当相似性。在249例恶性淋巴瘤患者的实验中，我们证实了该方法比基于病例的SIR方法具有更高的评价指标。此外，病理学家的主观评估显示，我们使用IHC染色模式的相似性度量适合表示恶性淋巴瘤H&E染色组织图像的相似性。摘要：In the present study, we propose a novel case-based similar image retrieval (SIR) method for hematoxylin and eosin (H&E)-stained histopathological images of malignant lymphoma. When a whole slide image (WSI) is used as an input query, it is desirable to be able to retrieve similar cases by focusing on image patches in pathologically important regions such as tumor cells. To address this problem, we employ attention-based multiple instance learning, which enables us to focus on tumor-specific regions when the similarity between cases is computed. Moreover, we employ contrastive distance metric learning to incorporate immunohistochemical (IHC) staining patterns as useful supervised information for defining appropriate similarity between heterogeneous malignant lymphoma cases. In the experiment with 249 malignant lymphoma patients, we confirmed that the proposed method exhibited higher evaluation measures than the baseline case-based SIR methods. Furthermore, the subjective evaluation by pathologists revealed that our similarity measure using IHC staining patterns is appropriate for representing the similarity of H&E-stained tissue images for malignant lymphoma.

裁剪|量化|加速|压缩相关(2篇)

【1】 Weight Reparametrization for Budget-Aware Network Pruning 标题：预算感知网络剪枝的权重重参数化

作者：Robin Dupont,Hichem Sahbi,Guillaume Michel 备注：Accepted at International Conference on Image Processing (ICIP 2021) 链接：https://arxiv.org/abs/2107.03909 摘要：Pruning试图通过去除过参数化网络中的冗余权重来设计轻量级体系结构。大多数现有技术首先去除结构化子网络（滤波器、信道等），然后对产生的网络进行微调以保持高精度。然而，去除一个整体结构是一个很强的拓扑先验，而恢复精确性和微调是非常麻烦的。在本文中，我们介绍了一种“端到端”的轻量级网络设计，它可以在不进行微调的情况下同时实现训练和修剪。该方法的设计原则依赖于重参数化，即不仅学习权值，而且学习轻量级子网络的拓扑结构。这种重参数化充当先验（或正则化器），它从底层网络的权重隐式地定义剪枝掩码，而不增加训练参数的数目。稀疏性是由提供精确修剪的预算损失引起的。使用标准体系结构（即Conv4、VGG19和ResNet18）在CIFAR10和tinyimagnet数据集上进行的大量实验显示了令人信服的结果，而无需进行微调。摘要：Pruning seeks to design lightweight architectures by removing redundant weights in overparameterized networks. Most of the existing techniques first remove structured sub-networks (filters, channels,...) and then fine-tune the resulting networks to maintain a high accuracy. However, removing a whole structure is a strong topological prior and recovering the accuracy, with fine-tuning, is highly cumbersome. In this paper, we introduce an "end-to-end" lightweight network design that achieves training and pruning simultaneously without fine-tuning. The design principle of our method relies on reparametrization that learns not only the weights but also the topological structure of the lightweight sub-network. This reparametrization acts as a prior (or regularizer) that defines pruning masks implicitly from the weights of the underlying network, without increasing the number of training parameters. Sparsity is induced with a budget loss that provides an accurate pruning. Extensive experiments conducted on the CIFAR10 and the TinyImageNet datasets, using standard architectures (namely Conv4, VGG19 and ResNet18), show compelling results without fine-tuning.

【2】 S^3: Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks标题：S^3：用于低位移位网络有效训练的符号稀疏移位再参数化

作者：Xinlin Li,Bang Liu,Yaoliang Yu,Wulong Liu,Chunjing Xu,Vahid Partovi Nia 机构：Noah’s Ark Lab, Huawei Technologies., Department of Computer Science and Operations Research (DIRO), University of Montreal., Cheriton School of Computer Science, University of Waterloo. 链接：https://arxiv.org/abs/2107.03453 摘要：移位神经网络通过去除昂贵的乘法运算和将连续权值量化为低位离散值来降低计算复杂度，与传统神经网络相比，移位神经网络具有快速和节能的特点。然而，现有的移位网络对权值初始化非常敏感，并且由于梯度消失和权值符号冻结问题，导致性能下降。为了解决这些问题，我们提出了一种新的训练低比特移位网络的方法S低比特重参数化。我们的方法以符号稀疏移位3倍的方式分解离散参数。这样，它可以有效地学习一个低比特网络，其权值动态特性类似于全精度网络，且对权值初始化不敏感。我们提出的训练方法突破了移位神经网络的界限，显示了3位移位神经网络在ImageNet上的最高精度方面优于全精度移位神经网络。摘要：Shift neural networks reduce computation complexity by removing expensive multiplication operations and quantizing continuous weights into low-bit discrete values, which are fast and energy efficient compared to conventional neural networks. However, existing shift networks are sensitive to the weight initialization, and also yield a degraded performance caused by vanishing gradient and weight sign freezing problem. To address these issues, we propose S low-bit re-parameterization, a novel technique for training low-bit shift networks. Our method decomposes a discrete parameter in a sign-sparse-shift 3-fold manner. In this way, it efficiently learns a low-bit network with a weight dynamics similar to full-precision networks and insensitive to weight initialization. Our proposed training method pushes the boundaries of shift neural networks and shows 3-bit shift networks out-performs their full-precision counterparts in terms of top-1 accuracy on ImageNet.

超分辨率|去噪|去模糊|去雾(1篇)

【1】 Regional Differential Information Entropy for Super-Resolution Image Quality Assessment 标题：区域差分信息熵在超分辨率图像质量评价中的应用

作者：Ningyuan Xu,Jiayan Zhuang,Jiangjian Xiao,Chengbin Peng 机构：University of Chinese Academy of Sciences, Beijing, China, Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, Ningbo, China, College of Information Science and Engineering, Ningbo University, Ningbo, China, A R T I C L E I N F O 备注：8 pages, 9 figures, 4 tables 链接：https://arxiv.org/abs/2107.03642 摘要：PSNR和SSIM是超分辨率问题中应用最广泛的指标，因为它们易于使用，并且可以评估生成图像和参考图像之间的相似性。然而，单幅图像的超分辨率是一个不适定问题，同一幅低分辨率图像对应多幅高分辨率图像。相似性不能完全反映修复效果。生成的图像的感知质量也很重要，但是PSNR和SSIM不能很好地反映感知质量。为了解决这个问题，我们提出了一种区域差分信息熵的方法来度量相似度和感知质量。针对传统的图像信息熵不能反映图像的结构信息的问题，提出了用滑动窗口来度量各区域的信息熵。考虑到人类视觉系统对低亮度下的亮度差异更为敏感，我们采用$\gamma$量化而不是线性量化。为了加快计算速度，我们用神经网络重新组织了信息熵的计算过程。通过在我们的IQA数据集和PIPAL上的实验，证明了RDIE能够更好地量化图像尤其是基于GAN的图像的感知质量。摘要：PSNR and SSIM are the most widely used metrics in super-resolution problems, because they are easy to use and can evaluate the similarities between generated images and reference images. However, single image super-resolution is an ill-posed problem, there are multiple corresponding high-resolution images for the same low-resolution image. The similarities can't totally reflect the restoration effect. The perceptual quality of generated images is also important, but PSNR and SSIM do not reflect perceptual quality well. To solve the problem, we proposed a method called regional differential information entropy to measure both of the similarities and perceptual quality. To overcome the problem that traditional image information entropy can't reflect the structure information, we proposed to measure every region's information entropy with sliding window. Considering that the human visual system is more sensitive to the brightness difference at low brightness, we take $\gamma$ quantization rather than linear quantization. To accelerate the method, we reorganized the calculation procedure of information entropy with a neural network. Through experiments on our IQA dataset and PIPAL, this paper proves that RDIE can better quantify perceptual quality of images especially GAN-based images.

3D|3D重建等相关(2篇)

【1】 3D Neural Scene Representations for Visuomotor Control 标题：视觉运动控制中的三维神经场景表示

作者：Yunzhu Li,Shuang Li,Vincent Sitzmann,Pulkit Agrawal,Antonio Torralba 机构：MIT CSAIL 备注：First two authors contributed equally. Project Page: this https URL 链接：https://arxiv.org/abs/2107.04004 摘要：人类对我们周围的3D环境有很强的直觉理解。我们大脑中的物理模型适用于不同材料的物体，使我们能够执行各种各样的操作任务，这些任务远远超出了当前机器人的能力范围。在这项工作中，我们希望学习模型的动态三维场景纯粹从二维视觉观察。该模型将神经辐射场（NeRF）和时间对比学习与自动编码框架相结合，学习视点不变的三维感知场景表示。我们证明了一个动力学模型，建立在学习的表示空间，使视觉运动控制具有挑战性的操纵任务涉及到刚体和流体，其中目标是在一个不同于机器人操作的视点指定。当与自动解码框架相结合时，它甚至可以支持来自训练分布之外的摄像机视点的目标规范。通过对未来的预测和新颖的视图合成，进一步证明了所学习的三维动力学模型的丰富性。最后，我们提供了详细的烧蚀研究不同的系统设计和定性分析的学习表示。摘要：Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.

【2】 LanguageRefer: Spatial-Language Model for 3D Visual Grounding 标题：LanguageRefer：三维视觉基础的空间语言模型

作者：Junha Roh,Karthik Desingh,Ali Farhadi,Dieter Fox 机构：Paul G. Allen School, University of Washington, United States 备注：11 pages, 3 figures 链接：https://arxiv.org/abs/2107.03438 摘要：为了在不久的将来实现能够理解人类指令并执行有意义任务的机器人，开发能够理解参考语言的学习模型来识别现实世界三维场景中的常见对象是非常重要的。在本文中，我们发展了一个三维视觉接地问题的空间语言模型。具体地说，给定一个以点云的形式重建的三维场景，其中包含潜在候选对象的三维边界框，以及场景中引用目标对象的语言语句，我们的模型从一组潜在候选对象中识别目标对象。我们的空间语言模型使用了一种基于变换器的结构，它结合了边界盒的空间嵌入和DistilBert的精细语言嵌入，以及3D场景中对象之间的推理来找到目标对象。我们证明了我们的模型在ReferIt3D提出的visio语言数据集上具有竞争力。我们提供了额外的空间推理任务的性能分析，从感知噪声中分离出来，视点相关的话语在准确性方面的影响，以及潜在机器人应用的视点注释。摘要：To realize robots that can understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that can understand referential language to identify common objects in real-world 3D scenes. In this paper, we develop a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model identifies the target object from a set of potential candidates. Our spatial-language model uses a transformer-based architecture that combines spatial embedding from bounding-box with a finetuned language embedding from DistilBert and reasons among the objects in the 3D scene to find the target object. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D. We provide additional analysis of performance in spatial reasoning tasks decoupled from perception noise, the effect of view-dependent utterances in terms of accuracy, and view-point annotations for potential robotics applications.

其他神经网络|深度学习|模型|建模(6篇)

【1】 CamTuner: Reinforcement-Learning based System for Camera Parameter Tuning to enhance Analytics 标题：CamTuner：基于强化学习的摄像机参数调整增强分析系统

作者：Sibendu Paul,Kunal Rao,Giuseppe Coviello,Murugan Sankaradas,Oliver Po,Y. Charlie Hu,Srimat T. Chakradhar 机构：Purdue University, West Lafayette, USA, NEC Laboratories America, New Jersey, USA 链接：https://arxiv.org/abs/2107.03964 摘要：像摄像机这样的复杂传感器包括几十个可配置的参数，最终用户可以设置这些参数来定制传感器以适应特定的应用场景。尽管参数设置会显著影响传感器输出的质量和从传感器数据得出的见解的准确性，但大多数最终用户使用固定的参数设置，因为他们缺乏适当配置这些参数的技能或理解。我们提出了CamTuner，这是一个系统，自动，动态适应复杂的传感器，以适应不断变化的环境。CamTuner包括两个关键部件。首先，一个定制的分析质量估计器，它是一个深度学习模型，可以随着传感器周围环境的变化，自动地、连续地估计分析单元的洞察质量。第二，强化学习（RL）模块，它对质量的变化作出反应，并自动调整相机参数以提高洞察的准确性。我们通过设计虚拟模型来模拟摄像机的基本行为，从而将RL模块的训练时间提高了一个数量级：我们设计了可以设置为不同值的虚拟旋钮来模拟为摄像机的可配置参数分配不同值的效果，我们设计了一个虚拟摄像机模型，模拟一天中不同时间摄像机的输出。这些虚拟模型大大加快了训练速度，因为（a）真实相机的帧速率限制在25-30fps，而虚拟模型的处理速度为300fps，（b）我们不必等到真实相机看到不同的环境，这可能需要数周或数月，（c）虚拟旋钮可以立即更新，而更改相机参数设置可能需要200-500毫秒。我们的动态调整方法可使多个视频分析任务的洞察精度提高12%。摘要：Complex sensors like video cameras include tens of configurable parameters, which can be set by end-users to customize the sensors to specific application scenarios. Although parameter settings significantly affect the quality of the sensor output and the accuracy of insights derived from sensor data, most end-users use a fixed parameter setting because they lack the skill or understanding to appropriately configure these parameters. We propose CamTuner, which is a system to automatically, and dynamically adapt the complex sensor to changing environments. CamTuner includes two key components. First, a bespoke analytics quality estimator, which is a deep-learning model to automatically and continuously estimate the quality of insights from an analytics unit as the environment around a sensor change. Second, a reinforcement learning (RL) module, which reacts to the changes in quality, and automatically adjusts the camera parameters to enhance the accuracy of insights. We improve the training time of the RL module by an order of magnitude by designing virtual models to mimic essential behavior of the camera: we design virtual knobs that can be set to different values to mimic the effects of assigning different values to the camera's configurable parameters, and we design a virtual camera model that mimics the output from a video camera at different times of the day. These virtual models significantly accelerate training because (a) frame rates from a real camera are limited to 25-30 fps while the virtual models enable processing at 300 fps, (b) we do not have to wait until the real camera sees different environments, which could take weeks or months, and (c) virtual knobs can be updated instantly, while it can take 200-500 ms to change the camera parameter settings. Our dynamic tuning approach results in up to 12% improvement in the accuracy of insights from several video analytics tasks.

【2】 Prior Aided Streaming Network for Multi-task Affective Recognitionat the 2nd ABAW2 Competition 标题：第二届ABAW2竞赛中多任务情感认知的先验辅助流媒体网络

作者：Wei Zhang,Zunhu Guo,Keyu Chen,Lincheng Li,Zhimeng Zhang,Yu Ding 机构：Virtual Human Group, Netease Fuxi AI Lab, Southwest University, China 链接：https://arxiv.org/abs/2107.03708 摘要：自动情感识别是人机交互领域的一个重要研究课题。近年来，随着深度学习技术的发展和大规模标注数据集的出现，人脸情感分析正面临着现实环境中的挑战。在这篇文章中，我们介绍了我们提交的第二届情感行为分析在野生（ABAW2）的竞争。在处理不同的情绪表征，包括范畴情绪（CE）、动作单位（AU）和价唤醒（VA）时，我们提出了一个启发式的多任务流网络，这三种表征是内在联系在一起的。此外，我们利用一种先进的人脸表情嵌入技术作为先验知识，在保留表情相似度的同时，能够捕捉到具有身份不变性的表情特征，以辅助下流识别任务。大量的定量评估以及对Aff-Wild2数据集的烧蚀研究证明了我们提出的先验辅助流媒体网络方法的有效性。摘要：Automatic affective recognition has been an important research topic in human computer interaction (HCI) area. With recent development of deep learning techniques and large scale in-the-wild annotated datasets, the facial emotion analysis is now aimed at challenges in the real world settings. In this paper, we introduce our submission to the 2nd Affective Behavior Analysis in-the-wild (ABAW2) Competition. In dealing with different emotion representations, including Categorical Emotions (CE), Action Units (AU), and Valence Arousal (VA), we propose a multi-task streaming network by a heuristic that the three representations are intrinsically associated with each other. Besides, we leverage an advanced facial expression embedding as prior knowledge, which is capable of capturing identity-invariant expression features while preserving the expression similarities, to aid the down-streaming recognition tasks. The extensive quantitative evaluations as well as ablation studies on the Aff-Wild2 dataset prove the effectiveness of our proposed prior aided streaming network approach.

【3】 Feature Pyramid Network for Multi-task Affective Analysis 标题：用于多任务情感分析的特征金字塔网络

作者：Ruian He,Zhen Xing,Bo Yan 机构：School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University 备注：5 pages, 3 figures 链接：https://arxiv.org/abs/2107.03670 摘要：情感分析不是一个单一的任务，它可以同时预测情感的价唤醒值、表达类和行为单位。以往的研究没有将这三种面部特征作为一个整体来研究，也没有忽视这三种面部特征之间的相互纠缠和层次关系。提出了一种新的多任务情感分析模型&特征金字塔网络。通过提取层次特征来预测三个标签，并采用师生训练策略来学习预先训练好的单任务模型。大量的实验结果表明，该模型优于其他模型。有关代码和模型的研究，请访问$\href{https://github.com/ryanhe312/ABAW2-FPNMAA}{\text{此链接}}}$。摘要：Affective Analysis is not a single task, and the valence-arousal value, expression class and action unit can be predicted at the same time. Previous researches failed to take them as a whole task or ignore the entanglement and hierarchical relation of this three facial attributes. We propose a novel model named feature pyramid networks for multi-task affect analysis. The hierarchical features are extracted to predict three labels and we apply teacher-student training strategy to learn from pretrained single-task models. Extensive experiment results demonstrate the proposed model outperform other models. The code and model are available for research purposes at $\href{https://github.com/ryanhe312/ABAW2-FPNMAA}{\text{this link}}$.

【4】 Staying in Shape: Learning Invariant Shape Representations using Contrastive Learning 标题：保持形状：使用对比学习学习不变形状表示

作者：Jeffrey Gu,Serena Yeung 机构：Institute for Computational & Mathematical Eng., Stanford University, Stanford, California, USA, Depts. of Biomedical Data Science and Computer Science, Stanford University, Stanford, California, USA 链接：https://arxiv.org/abs/2107.03552 摘要：创建对等距或几乎等距变换不变性的形状表示一直是形状分析中的一个重要领域，因为强制不变性允许学习更有效和更健壮的形状表示。大多数现有的不变形状表示都是手工制作的，而以前关于学习形状表示的工作并不集中于产生不变量表示。为了解决无监督不变形状表示的学习问题，我们采用对比学习的方法，通过对用户指定的数据进行不变性学习，产生判别式的原生表示。为了产生特定的等距表示和几乎等距不变的表示，我们提出了新的数据扩充，随机抽样这些变换。实验表明，该方法在有效性和鲁棒性方面均优于以往的无监督学习算法。摘要：Creating representations of shapes that are invari-ant to isometric or almost-isometric transforma-tions has long been an area of interest in shape anal-ysis, since enforcing invariance allows the learningof more effective and robust shape representations.Most existing invariant shape representations arehandcrafted, and previous work on learning shaperepresentations do not focus on producing invariantrepresentations. To solve the problem of learningunsupervised invariant shape representations, weuse contrastive learning, which produces discrimi-native representations through learning invarianceto user-specified data augmentations. To producerepresentations that are specifically isometry andalmost-isometry invariant, we propose new dataaugmentations that randomly sample these transfor-mations. We show experimentally that our methodoutperforms previous unsupervised learning ap-proaches in both effectiveness and robustness.

【5】 Tensor Methods in Computer Vision and Deep Learning 标题：计算机视觉与深度学习中的张量方法

作者：Yannis Panagakis,Jean Kossaifi,Grigorios G. Chrysos,James Oldfield,Mihalis A. Nicolaou,Anima Anandkumar,Stefanos Zafeiriou 机构： Chrysos is with the Department of Electrical Engineering, James Oldfield is with Queen Mary University of LondonMihalis A, Nicolaou is with the Computation-based Science and Technol-ogy Re-search Center at The Cyprus Institute 备注：Proceedings of the IEEE (2021) 链接：https://arxiv.org/abs/2107.03436 摘要：张量或多维数组是一种数据结构，可以自然地表示多维的可视数据。张量能够有效地捕捉结构化的、潜在的语义空间和高阶的交互作用，在计算机视觉领域有着悠久的应用历史。随着计算机视觉深度学习范式转换的到来，张量变得更加重要。事实上，现代深度学习体系结构中的基本成分，如卷积和注意机制，可以很容易地看作张量映射。实际上，张量方法在深度学习中的应用越来越广泛，包括设计内存和计算效率高的网络体系结构，提高对随机噪声和对抗性攻击的鲁棒性，以及帮助对深度网络的理论理解。本文在表征学习和深度学习的背景下对张量和张量方法进行了深入而实用的回顾，特别侧重于视觉数据分析和计算机视觉应用。具体地说，除了基于张量的可视化数据分析方法的基础工作之外，我们还关注最近的发展，这些发展导致了张量方法的逐渐增加，特别是在深度学习体系结构中，以及它们在计算机视觉应用中的含义。为了进一步使新手能够快速掌握这些概念，我们提供了Python笔记本，涵盖了本文的关键方面，并用TensorLy一步一步地实现它们。摘要：Tensors, or multidimensional arrays, are data structures that can naturally represent visual data of multiple dimensions. Inherently able to efficiently capture structured, latent semantic spaces and high-order interactions, tensors have a long history of applications in a wide span of computer vision problems. With the advent of the deep learning paradigm shift in computer vision, tensors have become even more fundamental. Indeed, essential ingredients in modern deep learning architectures, such as convolutions and attention mechanisms, can readily be considered as tensor mappings. In effect, tensor methods are increasingly finding significant applications in deep learning, including the design of memory and compute efficient network architectures, improving robustness to random noise and adversarial attacks, and aiding the theoretical understanding of deep networks. This article provides an in-depth and practical review of tensors and tensor methods in the context of representation learning and deep learning, with a particular focus on visual data analysis and computer vision applications. Concretely, besides fundamental work in tensor-based visual data analysis methods, we focus on recent developments that have brought on a gradual increase of tensor methods, especially in deep learning architectures, and their implications in computer vision applications. To further enable the newcomer to grasp such concepts quickly, we provide companion Python notebooks, covering key aspects of the paper and implementing them, step-by-step with TensorLy.

【6】 Elastic deformation of optical coherence tomography images of diabetic macular edema for deep-learning models training: how far to go? 标题：深度学习模型训练中糖尿病黄斑水肿光学相干断层扫描图像的弹性变形：还有多远？

作者：Daniel Bar-David,Laura Bar-David,Yinon Shapira,Rina Leibu,Dalia Dori,Ronit Schneor,Anath Fischer,Shiri Soudry 机构：Technion Israel Institute of Technology, Haifa, Israel, Department of Ophthalmology, Rambam Health Care Campus, Haifa, Israel, Discipline of Ophthalmology and Visual Science, University of Adelaide, Adelaide, South Australia 链接：https://arxiv.org/abs/2107.03651 摘要：探讨光学相干断层扫描（OCT）图像弹性形变在糖尿病性黄斑水肿（DME）深度学习模型建立中的应用价值。摘要：To explore the clinical validity of elastic deformation of optical coherence tomography (OCT) images for data augmentation in the development of deep-learning model for detection of diabetic macular edema (DME).

其他(7篇)

【1】 Adiabatic Quantum Graph Matching with Permutation Matrix Constraints 标题：具有置换矩阵约束的绝热量子图匹配

作者：Marcel Seelbach Benkner,Vladislav Golyanik,Christian Theobalt,Michael Moeller 机构：University of Siegen, Max Planck Institute for Informatics, SIC 备注：None 链接：https://arxiv.org/abs/2107.04032 摘要：三维形状和图像的匹配问题是一个具有挑战性的问题，因为它们通常被描述为具有排列矩阵约束的组合二次指派问题（QAPs），这是NP困难的。在这项工作中，我们用新兴的量子计算技术来解决这类问题，并提出了几种适合在量子硬件上高效执行的无约束QAPs问题。我们研究了在可映射到量子硬件的二次无约束二元优化问题中注入置换矩阵约束的几种方法。我们专注于获得一个足够的谱间隙，这进一步增加了在一次运行中测量最优解和有效排列矩阵的概率。我们在量子计算机D波2000Q（2^11量子比特，绝热）上进行了实验。尽管在模拟绝热量子计算和在实际量子硬件上执行之间观察到了差异，我们对置换矩阵约束的重新表述提高了数值计算的鲁棒性，而不是我们实验中的其他惩罚方法。该算法在未来的量子计算体系结构中具有向更高维度扩展的潜力，为解决三维计算机视觉和图形中的匹配问题开辟了多个新的方向。摘要：Matching problems on 3D shapes and images are challenging as they are frequently formulated as combinatorial quadratic assignment problems (QAPs) with permutation matrix constraints, which are NP-hard. In this work, we address such problems with emerging quantum computing technology and propose several reformulations of QAPs as unconstrained problems suitable for efficient execution on quantum hardware. We investigate several ways to inject permutation matrix constraints in a quadratic unconstrained binary optimization problem which can be mapped to quantum hardware. We focus on obtaining a sufficient spectral gap, which further increases the probability to measure optimal solutions and valid permutation matrices in a single run. We perform our experiments on the quantum computer D-Wave 2000Q (2^11 qubits, adiabatic). Despite the observed discrepancy between simulated adiabatic quantum computing and execution on real quantum hardware, our reformulation of permutation matrix constraints increases the robustness of the numerical computations over other penalty approaches in our experiments. The proposed algorithm has the potential to scale to higher dimensions on future quantum computing architectures, which opens up multiple new directions for solving matching problems in 3D computer vision and graphics.

【2】 Active Safety Envelopes using Light Curtains with Probabilistic Guarantees 标题：使用具有概率保证的轻型窗帘的主动安全封套

作者：Siddharth Ancha,Gaurav Pathak,Srinivasa G. Narasimhan,David Held 机构：Carnegie Mellon University, Pittsburgh PA , USA 备注：18 pages, Published at Robotics: Science and Systems (RSS) 2021 链接：https://arxiv.org/abs/2107.04000 摘要：为了安全地在未知环境中航行，机器人必须准确地感知动态障碍物。我们不使用激光雷达传感器直接测量景深，而是探索使用更便宜、分辨率更高的传感器：可编程光幕。光幕是一种可控的深度传感器，只能沿用户选择的表面进行感应。我们使用光幕来估计场景的安全范围：一个将机器人与所有障碍物隔开的假想表面。我们表明，生成感测随机位置（来自特定分布）的光幕可以快速发现未知对象场景的安全包络。重要的是，我们提供了理论上的安全保证的概率检测障碍物使用随机窗帘。我们结合随机窗帘与机器学习为基础的模型，预测和跟踪运动的安全包络有效。我们的方法在提供概率安全保证的同时，准确估计了安全包络线，可以用来验证机器人感知系统检测和避开动态障碍物的有效性。我们在模拟城市驾驶环境和真实行人环境中使用光幕装置对我们的方法进行了评估，结果表明我们可以有效地估计安全包络线。项目网站：https://siddancha.github.io/projects/active-safety-envelopes-with-guarantees 摘要：To safely navigate unknown environments, robots must accurately perceive dynamic obstacles. Instead of directly measuring the scene depth with a LiDAR sensor, we explore the use of a much cheaper and higher resolution sensor: programmable light curtains. Light curtains are controllable depth sensors that sense only along a surface that a user selects. We use light curtains to estimate the safety envelope of a scene: a hypothetical surface that separates the robot from all obstacles. We show that generating light curtains that sense random locations (from a particular distribution) can quickly discover the safety envelope for scenes with unknown objects. Importantly, we produce theoretical safety guarantees on the probability of detecting an obstacle using random curtains. We combine random curtains with a machine learning based model that forecasts and tracks the motion of the safety envelope efficiently. Our method accurately estimates safety envelopes while providing probabilistic safety guarantees that can be used to certify the efficacy of a robot perception system to detect and avoid dynamic obstacles. We evaluate our approach in a simulated urban driving environment and a real-world environment with moving pedestrians using a light curtain device and show that we can estimate safety envelopes efficiently and effectively. Project website: https://siddancha.github.io/projects/active-safety-envelopes-with-guarantees

【3】 Technical Report for Valence-Arousal Estimation in ABAW2 Challenge 标题：ABAW2挑战赛效价觉醒评估技术报告

作者：Hong-Xia Xie,I-Hsuan Li,Ling Lo,Hong-Han Shuai,Wen-Huang Cheng 备注：arXiv admin note: substantial text overlap with arXiv:2105.01502 链接：https://arxiv.org/abs/2107.03891 摘要：在这项工作中，我们描述了我们的方法来解决价唤醒估计挑战ABAW2 ICCV-2021竞争。竞赛组织者提供了一个在野外的Aff-Wild2数据集，供参与者分析现实生活中的情感行为。我们使用双流模型分别从外表和动作中学习情感特征。为了解决数据不平衡的问题，我们采用标签分布平滑（LDS）技术对标签重新加权。在Aff-wild2数据集的验证集上，我们提出的方法获得了0.591和0.617的一致性相关系数。摘要：In this work, we describe our method for tackling the valence-arousal estimation challenge from ABAW2 ICCV-2021 Competition. The competition organizers provide an in-the-wild Aff-Wild2 dataset for participants to analyze affective behavior in real-life settings. We use a two stream model to learn emotion features from appearance and action respectively. To solve data imbalanced problem, we apply label distribution smoothing (LDS) to re-weight labels. Our proposed method achieves Concordance Correlation Coefficient (CCC) of 0.591 and 0.617 for valence and arousal on the validation set of Aff-wild2 dataset.

【4】 Instance-Level Relative Saliency Ranking with Graph Reasoning 标题：基于图推理的实例级相对显著性排序

作者：Nian Liu,Long Li,Wangbo Zhao,Junwei Han,Ling Shao 备注：TPAMI under review 链接：https://arxiv.org/abs/2107.03824 摘要：传统的显著目标检测模型不能区分不同显著目标的重要性。最近，有两个工作被提出通过给不同的对象分配不同的显著度来检测显著性排序。然而，其中一个模型不能区分对象实例，而另一个模型更侧重于顺序注意转移顺序推理。在本文中，我们研究了一个实际的问题设置，需要同时分割显著实例，并推断其相对显著性排序。我们提出了一个新的统一模型作为第一个端到端的解决方案，其中首先使用改进的Mask R-CNN来分割显著实例，然后添加显著性排序分支来推断相对显著性。对于相对显著性排序，我们通过将实例交互关系、局部对比度、全局对比度和高级语义先验分别结合在一起，构建了一个新的图形推理模块。提出了一种新的损失函数来有效地训练显著性排序分支。此外，本文还提出了一个新的数据集和评价指标，旨在推动这一领域的研究。实验结果表明，本文提出的模型比以往的方法更有效。文中还举例说明了它在自适应图像重定目标中的实际应用。摘要：Conventional salient object detection models cannot differentiate the importance of different salient objects. Recently, two works have been proposed to detect saliency ranking by assigning different degrees of saliency to different objects. However, one of these models cannot differentiate object instances and the other focuses more on sequential attention shift order inference. In this paper, we investigate a practical problem setting that requires simultaneously segment salient instances and infer their relative saliency rank order. We present a novel unified model as the first end-to-end solution, where an improved Mask R-CNN is first used to segment salient instances and a saliency ranking branch is then added to infer the relative saliency. For relative saliency ranking, we build a new graph reasoning module by combining four graphs to incorporate the instance interaction relation, local contrast, global contrast, and a high-level semantic prior, respectively. A novel loss function is also proposed to effectively train the saliency ranking branch. Besides, a new dataset and an evaluation metric are proposed for this task, aiming at pushing forward this field of research. Finally, experimental results demonstrate that our proposed model is more effective than previous methods. We also show an example of its practical usage on adaptive image retargeting.

【5】 Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs 标题：专家协作：100M Flops在ImageNet上实现80%TOP-1准确率

作者：Yikang Zhang,Zhuo Chen,Zhao Zhong 机构：Huawei 链接：https://arxiv.org/abs/2107.03815 摘要：在本文中，我们提出了一个专家协作（CoE）框架，以汇集多个网络的专业知识，实现一个共同的目标。每个专家都是一个独立的网络，在数据集的独特部分拥有专门知识，这增强了集体能力。给定一个样本，委托人选择一个专家，同时输出一个粗略的预测来支持提前终止。为了实现这个框架，我们提出了三个模块来促使每个模型发挥作用，即权重生成模块（WGM）、标签生成模块（LGM）和方差计算模块（VCM）。我们的方法在ImageNet上达到了最先进的性能，在194M的跳频下达到了80.7%的top-1精度。CoE结合PWLU激活函数和CondConv，首次实现了100米的跳频精度，达到80.0%。更重要的是，我们的方法是硬件友好的，实现了3-6倍的加速比相比，一些现有的条件计算方法。摘要：In this paper, we propose a Collaboration of Experts (CoE) framework to pool together the expertise of multiple networks towards a common aim. Each expert is an individual network with expertise on a unique portion of the dataset, which enhances the collective capacity. Given a sample, an expert is selected by the delegator, which simultaneously outputs a rough prediction to support early termination. To fulfill this framework, we propose three modules to impel each model to play its role, namely weight generation module (WGM), label generation module (LGM) and variance calculation module (VCM). Our method achieves the state-of-the-art performance on ImageNet, 80.7% top-1 accuracy with 194M FLOPs. Combined with PWLU activation function and CondConv, CoE further achieves the accuracy of 80.0% with only 100M FLOPs for the first time. More importantly, our method is hardware friendly and achieves a 3-6x speedup compared with some existing conditional computation approaches.

【6】 Complete Scanning Application Using OpenCv 标题：使用OpenCV完成扫描应用程序

作者：Ayushe Gangal,Peeyush Kumar,Sunita Kumari 机构：Dept. of CSE, GB Pant Govt. Engineering College, New Delhi-, India 备注：10 pages, 14 figures 链接：https://arxiv.org/abs/2107.03700 摘要：在下面的论文中，我们结合了NumPy库和OpenCv库提供的各种基本功能，OpenCv库是一个开源的计算机视觉应用程序，如彩色图像到灰度的转换，计算阈值，利用pythonversion3.7对用户输入的图像进行透视变换。其他功能还包括裁剪、旋转和保存。所有这些功能和特性，当一步一步地实现时，就会产生一个完整的扫描应用程序。应用步骤包括：轮廓提取、透视变换和图像加亮、自适应阈值和滤波器去噪、旋转特征和透视变换等特殊裁剪算法。所描述的技术在各种样本上实现。摘要：In the following paper, we have combined the various basic functionalities provided by the NumPy library and OpenCv library, which is an open source for Computer Vision applications, like conversion of colored images to grayscale, calculating threshold, finding contours and using those contour points to take perspective transform of the image inputted by the user, using Python version 3.7. Additional features include cropping, rotating and saving as well. All these functions and features, when implemented step by step, results in a complete scanning application. The applied procedure involves the following steps: Finding contours, applying Perspective transform and brightening the image, Adaptive Thresholding and applying filters for noise cancellation, and Rotation features and perspective transform for a special cropping algorithm. The described technique is implemented on various samples.

【7】 A Dataset and Method for Hallux Valgus Angle Estimation Based on Deep Learing 标题：一种基于深度学习的Hallux Valgus角估计数据集和方法

作者：Ningyuan Xu,Jiayan Zhuang,Yaojun Wu,Jiangjian Xiao 机构：University of Chinese Academy of Sciences, No., Yuquan Road, Shijingshan District, Beijing, China, Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, No., Zhongguan West Road, Zhenhai District, Ningbo City, Zhejiang 备注：7pages, 12 figures 链接：https://arxiv.org/abs/2107.03640 摘要：角度测量是必要的，使一个合理的治疗拇趾外翻（HV），一种常见的前足畸形。然而，它仍然依赖于人工标记和测量，这是费时的，有时是不可靠的。自动化这个过程是一个值得关注的问题。然而，由于缺乏数据集，基于关键点的姿态估计方法不适合该领域，为此，我们制作了一个数据集，开发了一种基于深度学习和线性回归的姿态估计算法。它显示出对地面真相的极大拟合能力。摘要：Angular measurements is essential to make a resonable treatment for Hallux valgus (HV), a common forefoot deformity. However, it still depends on manual labeling and measurement, which is time-consuming and sometimes unreliable. Automating this process is a thing of concern. However, it lack of dataset and the keypoints based method which made a great success in pose estimation is not suitable for this field.To solve the problems, we made a dataset and developed an algorithm based on deep learning and linear regression. It shows great fitting ability to the ground truth.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-09，如有侵权请联系 cloudcommunity@tencent.com 删除

linux