统计学学术速递[7.5]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:21:43

6390

发布于 2021-07-27 10:21:43

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

stat统计学，共计31篇

【1】 Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE 标题：更简单、更快、更强：用FlatNCE打破对比型学习者的log-K魔咒

作者：Junya Chen,Zhe Gan,Xuan Li,Qing Guo,Liqun Chen,Shuyang Gao,Tagyoung Chung,Yi Xu,Belinda Zeng,Wenlian Lu,Fan Li,Lawrence Carin,Chenyang Tao 机构：Duke University ,Microsoft ,Virginia Tech ,Amazon ,Fudan University ,KAUST 链接：https://arxiv.org/abs/2107.01152 摘要：基于信息的对比表征学习者，如SimCLR，近年来取得了巨大的成功。然而，这些对比方案是出了名的资源需求，因为它们的有效性随着小批量训练而崩溃（即log-K诅咒，而K是批量大小）。在这项工作中，我们从数学上揭示了对比学习者在小批量情况下失败的原因，并提出了一个新的简单的、非平凡的对比目标FlatNCE来解决这个问题。与InfoNCE不同的是，我们的flatness不再明确地诉诸于区分性分类目标来进行对比学习。理论上，我们证明了平坦度是信息量的数学对偶形式，从而架起了能量模型经典文献的桥梁；从经验上讲，我们证明了，在对代码进行最小修改的情况下，FlatNCE能够独立于主题工程工作而立即提高性能。对比学习技术的广泛应用，以及对对比训练的监控和诊断的新工具的引入，进一步说明了本研究的意义。我们用CIFAR10、ImageNet和其他数据集的经验证据来证实我们的观点，在这些数据集中，flatness始终优于InfoNCE。摘要：InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years. However, these contrastive schemes are notoriously resource demanding, as their effectiveness breaks down with small-batch training (i.e., the log-K curse, whereas K is the batch-size). In this work, we reveal mathematically why contrastive learners fail in the small-batch-size regime, and present a novel simple, non-trivial contrastive objective named FlatNCE, which fixes this issue. Unlike InfoNCE, our FlatNCE no longer explicitly appeals to a discriminative classification goal for contrastive learning. Theoretically, we show FlatNCE is the mathematical dual formulation of InfoNCE, thus bridging the classical literature on energy modeling; and empirically, we demonstrate that, with minimal modification of code, FlatNCE enables immediate performance boost independent of the subject-matter engineering efforts. The significance of this work is furthered by the powerful generalization of contrastive learning techniques, and the introduction of new tools to monitor and diagnose contrastive training. We substantiate our claims with empirical evidence on CIFAR10, ImageNet, and other datasets, where FlatNCE consistently outperforms InfoNCE.

【2】 Depth-based Outlier Detection for Grouped Smart Meters: a Functional Data Analysis Toolbox 标题：基于深度的分组智能电表异常检测：一个实用的数据分析工具箱

作者：A. Elías,J. M. Morales,S. Pineda 链接：https://arxiv.org/abs/2107.01144 摘要：智能计量基础设施几乎以细粒度长时间序列的形式连续收集数据。这些庞大的时间序列通常有共同的日常模式，在相似的日子或季节之间重复，并在分组的米之间共享。在这种背景下，我们提出了一种方法来突出个人与异常的日常依赖模式，我们称之为进化离群值。为此，我们从功能数据分析（FDA）的角度来处理这个问题，将每天的记录作为一个函数或曲线来处理。然后，我们将重点放在观测曲线的形态方面，如日震级、日形状、导数和日间演变。提出的进化离群点检测方法依赖于函数深度的概念，函数深度是FDA建立形状和大小离群点检测方法的基础。结合我们的进化离群值建议，这些方法为智能仪表数据提供了一个离群值检测工具箱，该工具箱涵盖了大量的函数离群值类。利用光伏发电量和电路电压记录对应的实际智能计量数据，说明了该工具箱的异常识别能力。摘要：Smart metering infrastructures collect data almost continuously in the form of fine-grained long time series. These massive time series often have common daily patterns that are repeated between similar days or seasons and shared between grouped meters. Within this context, we propose a method to highlight individuals with abnormal daily dependency patterns, which we term evolution outliers. To this end, we approach the problem from the standpoint of Functional Data Analysis (FDA), by treating each daily record as a function or curve. We then focus on the morphological aspects of the observed curves, such as daily magnitude, daily shape, derivatives, and inter-day evolution. The proposed method for evolution outliers relies on the concept of functional depth, which has been a cornerstone in the literature of FDA to build shape and magnitude outlier detection methods. In conjunction with our evolution outlier proposal, these methods provide an outlier detection toolbox for smart meter data that covers a wide palette of functional outliers classes. We illustrate the outlier identification ability of this toolbox using actual smart metering data corresponding to photovoltaic energy generation and circuit voltage records.

【3】 Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization 标题：基于对比Fenchel-Legendre优化的紧互信息估计

作者：Qing Guo,Junya Chen,Dong Wang,Yuewei Yang,Xinwei Deng,Lawrence Carin,Fan Li,Chenyang Tao 机构：Duke University ,Virginia Tech ,KAUST 链接：https://arxiv.org/abs/2107.01131 摘要：InfoNCE及其变体的成功应用使对比变分互信息（MI）估计器在机器学习中的应用得到了推广。这些估计器虽然具有很好的稳定性，但在很大程度上依赖于代价高昂的大批量训练，并且为了减少方差而牺牲了界紧性。为了克服这些限制，我们从非正规化统计建模和凸优化的角度重新研究了流行的变分MI界的数学。我们的研究不仅产生了一个新的统一的理论框架，包含了流行的变分MI界，而且产生了一个新颖的、简单的、强大的对比MI估计器FLO。理论上，我们证明了FLO估计是紧的，并且在随机梯度下降下是收敛的。经验上，我们的FLO估计克服了前人的局限性，学习效率更高。FLO的实用性通过一组广泛的基准进行了验证，这也揭示了实际MI估计中的权衡。摘要：Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.

【4】 Asymptotic Analysis of Statistical Estimators related to MultiGraphex Processes under Misspecification 标题：误指定下多图过程统计估计量的渐近分析

作者：Zacharie Naulet,Judith Rousseau,François Caron 机构：Universit´e Paris-Saclay, Laboratoire de math´ematiques d’Orsay, Orsay, France., University of Oxford, Department of Statistics, Oxford, UK. 链接：https://arxiv.org/abs/2107.01120 摘要：本文研究了与图序列结构性质有关的参数向量的贝叶斯或频率估计的渐近性质。所研究的估计量来源于Caron和Fox提出的一类特殊的graphex模型。然而，这里的分析是在对基础数据生成过程非常薄弱的假设下进行的，这可能不同于Caron和Fox的模型或graphex模型。特别地，我们考虑具有无界程度的一般稀疏图模型，其度分布满足一些假设。我们证明了一个参数的估计量的极限与真图生成过程的稀疏常数有关。当采用贝叶斯方法时，我们还证明了后验分布是渐近正态的。我们讨论了经典随机图模型如配置模型、稀疏图模型、边交换模型或图过程满足我们假设的情形。摘要：This article studies the asymptotic properties of Bayesian or frequentist estimators of a vector of parameters related to structural properties of sequences of graphs. The estimators studied originate from a particular class of graphex model introduced by Caron and Fox. The analysis is however performed here under very weak assumptions on the underlying data generating process, which may be different from the model of Caron and Fox or from a graphex model. In particular, we consider generic sparse graph models, with unbounded degree, whose degree distribution satisfies some assumptions. We show that one can relate the limit of the estimator of one of the parameters to the sparsity constant of the true graph generating process. When taking a Bayesian approach, we also show that the posterior distribution is asymptotically normal. We discuss situations where classical random graphs models such as configuration models, sparse graphon models, edge exchangeable models or graphon processes satisfy our assumptions.

【5】 Memory Efficient Meta-Learning with Large Images 标题：大图像存储效率高的元学习算法

作者：John Bronskill,Daniela Massiceti,Massimiliano Patacchiola,Katja Hofmann,Sebastian Nowozin,Richard E. Turner 机构：University of Cambridge, Microsoft Research 链接：https://arxiv.org/abs/2107.01105 摘要：用于少量镜头分类的元学习方法在测试时计算效率很高，只需要几个优化步骤或单个前向过程来学习新任务，但它们仍然需要高度的内存密集型训练。出现这种限制是因为在执行优化步骤之前，必须处理任务的整个支持集（最多可包含1000个图像）。因此，利用大型图像提供的性能增益需要跨多个gpu并行化元学习者（这可能不可用），或者在应用内存限制时在任务和图像大小之间进行权衡。我们改进了这两种方案，提出了LITE，一种通用的、内存高效的情景式训练方案，能够在单个GPU上对由大型图像组成的大型任务进行元训练。我们通过观察任务的梯度可以分解为任务训练图像上的梯度总和来实现这一点。这使我们能够对任务的整个训练集执行前向传递，但通过仅向后传播这些图像的随机子集（我们显示的是完全梯度的无偏近似），可以显著节省内存。我们使用LITE来训练元学习者，并在现实世界的轨道基准和具有挑战性的VTAB+MD基准的4个部分中的3个部分，相对于领先的元学习者，展示新的最先进的准确性。LITE还使元学习者能够与迁移学习方法竞争，但只需测试时间计算成本的一小部分，因此与最近的说法相反，迁移学习是少数镜头分类所需的全部。摘要：Meta learning approaches to few-shot classification are computationally efficient at test time requiring just a few optimization steps or single forward pass to learn a new task, but they remain highly memory-intensive to train. This limitation arises because a task's entire support set, which can contain up to 1000 images, must be processed before an optimization step can be taken. Harnessing the performance gains offered by large images thus requires either parallelizing the meta-learner across multiple GPUs, which may not be available, or trade-offs between task and image size when memory constraints apply. We improve on both options by proposing LITE, a general and memory efficient episodic training scheme that enables meta-training on large tasks composed of large images on a single GPU. We achieve this by observing that the gradients for a task can be decomposed into a sum of gradients over the task's training images. This enables us to perform a forward pass on a task's entire training set but realize significant memory savings by back-propagating only a random subset of these images which we show is an unbiased approximation of the full gradient. We use LITE to train meta-learners and demonstrate new state-of-the-art accuracy on the real-world ORBIT benchmark and 3 of the 4 parts of the challenging VTAB+MD benchmark relative to leading meta-learners. LITE also enables meta-learners to be competitive with transfer learning approaches but at a fraction of the test-time computational cost, thus serving as a counterpoint to the recent narrative that transfer learning is all you need for few-shot classification.

【6】 Generalized Multivariate Signs for Nonparametric Hypothesis Testing in High Dimensions 标题：高维非参数假设检验的广义多元符号

作者：Subhabrata Majumdar,Snigdhansu Chatterjee 机构：Data Science and AI Research, AT&T Chief Data Office, New York, NY , School of Statistics, University of Minnesota Twin Cities, Minneapolis, MN 链接：https://arxiv.org/abs/2107.01103 摘要：高维数据，即特征空间的维数远大于样本大小的数据，出现在许多统计应用中。在此背景下，我们构造了广义多元符号变换，定义为向量除以其范数。对于范数函数的不同选择，所得到的变换向量适应于数据分布的某些几何特征。基于这一思想，我们利用这些广义符号向量得到了高维数据平均向量的一个样本和两个样本的检验过程。这些测试基于使用内核内积的U-统计量，不需要禁止性的假设，并且易于基于快速随机化的实现。通过在一些数据设置中的实验，我们表明使用广义符号的测试显示出比现有测试更高的功率，同时保持了标称的I型错误率。最后，我们提供了MNIST和明尼苏达双胞胎研究基因组数据的应用实例。摘要：High-dimensional data, where the dimension of the feature space is much larger than sample size, arise in a number of statistical applications. In this context, we construct the generalized multivariate sign transformation, defined as a vector divided by its norm. For different choices of the norm function, the resulting transformed vector adapts to certain geometrical features of the data distribution. Building up on this idea, we obtain one-sample and two-sample testing procedures for mean vectors of high-dimensional data using these generalized sign vectors. These tests are based on U-statistics using kernel inner products, do not require prohibitive assumptions, and are amenable to a fast randomization-based implementation. Through experiments in a number of data settings, we show that tests using generalized signs display higher power than existing tests, while maintaining nominal type-I error rates. Finally, we provide example applications on the MNIST and Minnesota Twin Studies genomic data.

【7】 Homogeneity in the instrument-treatment association is not sufficient for the Wald estimand to equal the average causal effect when the exposure is continuous 标题：仪器-治疗联合中的同质性不足以用于Wald估计，并且不足以等于连续暴露时的平均因果效应。

作者：Fernando Pires Hartwig,Linbo Wang,George Davey Smith,Neil Martin Davies 机构：Postgraduate Program in Epidemiology, Federal University of Pelotas, Pelotas, Brazil., MRC Integrative Epidemiology Unit, University of Bristol, Bristol, United Kingdom., Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada. 链接：https://arxiv.org/abs/2107.01070 摘要：背景：解释工具变量（IV）分析结果以及明确的因果估计需要进一步的假设，除了核心IV假设的相关性、独立性和排除限制。其中一个假设是仪器暴露关联中的加性同质性，该假设使得传统的Wald估计等于平均因果效应（ACE）。方法：我们使用理论论证和一个示例来评估当暴露是连续的并且仪器是二进制的或连续的时，仪器暴露的加性同质性是否识别ACE。结果：仪器暴露的加性同质性不足以识别血管紧张素转换酶，当仪器是二进制的，暴露是连续的，暴露对结果的影响在加性尺度上是非线性的。如果曝光是二元的，曝光结果效应必然是加性线性的，所以均匀性条件是充分的。对于连续仪器而言，无论暴露结果是否呈线性，仪器暴露添加剂均匀性都是足够的。结论：对于二元仪器，如果暴露也是二元的，仪器暴露相关性中的加性同质性可识别ACE。否则，需要额外的假设（如暴露结果效应的加性线性）。摘要：Background: Interpreting results of instrumental variable (IV) analysis as well-defined causal estimands requires further assumptions in addition to the core IV assumptions of relevance, independence, and the exclusion restriction. One such assumption is additive homogeneity in the instrument-exposure association, which is postulated to render the conventional Wald estimand as equal to the average causal effect (ACE). Methods: We used theoretical arguments and an illustrative example to assess whether instrument-exposure additive homogeneity identifies the ACE when the exposure is continuous and the instrument is either binary or continuous. Results: Instrument-exposure additive homogeneity is insufficient to identify the ACE when the instrument is binary, the exposure is continuous and the effect of the exposure on the outcome is non-linear on the additive scale. If the exposure is binary, the exposure-outcome effect is necessarily additive linear, so the homogeneity condition is sufficient. For a continuous instrument, the instrument-exposure additive homogeneity is sufficient regardless of the exposure-outcome effect being linear or not. Conclusions: For binary instruments, additive homogeneity in the instrument-exposure association identifies the ACE if the exposure is also binary. Otherwise, additional assumptions (such as additive linearity of the exposure-outcome effect) are required.

【8】 A Robust Seemingly Unrelated Regressions For Row-Wise And Cell-Wise Contamination 标题：行智慧和细胞智慧污染的强健的看似不相关的回归

作者：Giovanni Saraceno,Fatemah Alqallaf,Claudio Agostinelli 机构：Agostinelli, Department of Mathematics, University of Trento, Trento, Italy, Department of Statistics and Operations Research, Kuwait University, Kuwait 备注：18 pages, 6 figures and 3 Tables 链接：https://arxiv.org/abs/2107.00975 摘要：表面无关回归（SUR）模型是计量经济学、保险学和金融学中广泛使用的一种估计方法，其中回归模型往往包含多个方程。采用基于广义最小二乘法或极大似然法的算法估计未知参数、回归系数和误差项之间的协方差，总体上对异常值非常敏感。为了克服这个问题，文献中提出了M-估计和S-估计以及快速算法。然而，这些方法只能处理误差项中的行异常值，而当存在单元格异常值时，随着方程数目的增加，它们的性能变得非常差。提出了一种新的鲁棒算法，该算法在两种污染类型下都有很好的性能，并且计算速度很快。文中给出了基于蒙特卡罗模拟的算例和一个实际数据例子。摘要：The Seemingly Unrelated Regressions (SUR) model is a wide used estimation procedure in econometrics, insurance and finance, where very often, the regression model contains more than one equation. Unknown parameters, regression coefficients and covariances among the errors terms, are estimated using algorithms based on Generalized Least Squares or Maximum Likelihood, and the method, as a whole, is very sensitive to outliers. To overcome this problem M-estimators and S-estimators are proposed in the literature together with fast algorithms. However, these procedures are only able to cope with row-wise outliers in the error terms, while their performance becomes very poor in the presence of cell-wise outliers and as the number of equations increases. A new robust approach is proposed which is able to perform well under both contamination types as well as it is fast to compute. Illustrations based on Monte Carlo simulations and a real data example are provided.

【9】 Time series models with infinite-order partial copula dependence 标题：具有无穷阶部分Copula相依性的时间序列模型

作者：Martin Bladt,Alexander J. McNeil 机构：University of Lausanne, The York Management School, University of York 备注：30 pages, 4 figures 链接：https://arxiv.org/abs/2107.00960 摘要：基于二元copula函数集的s-vine分解可以构造平稳和遍历的时间序列。将这类过程推广到无限copula序列，得到了一类丰富的模型，该模型推广了高斯ARMA和ARFIMA过程，使得序列部分相关结构既具有非高斯的边缘行为，又具有非高斯的描述。将线性过程的经典因果表示和可逆表示推广到一般s-藤过程。提出了一种用Kendall偏自相关函数参数化s-vine过程的实用而简洁的方法。通过一个使用宏观经济数据的例子说明了所得到的模型在许多应用中改进统计拟合的潜力。摘要：Stationary and ergodic time series can be constructed using an s-vine decomposition based on sets of bivariate copula functions. The extension of such processes to infinite copula sequences is considered and shown to yield a rich class of models that generalizes Gaussian ARMA and ARFIMA processes to allow both non-Gaussian marginal behaviour and a non-Gaussian description of the serial partial dependence structure. Extensions of classical causal and invertible representations of linear processes to general s-vine processes are proposed and investigated. A practical and parsimonious method for parameterizing s-vine processes using the Kendall partial autocorrelation function is developed. The potential of the resulting models to give improved statistical fits in many applications is indicated with an example using macroeconomic data.

【10】 A Practical Guide to Counterfactual Estimators for Causal Inference with Time-Series Cross-Sectional Data 标题：用时间序列截面数据进行因果推断的反事实估计器实用指南

作者：Licheng Liu,Ye Wang,Yiqing Xu 机构：(MIT), (NYU), (Stanford) 链接：https://arxiv.org/abs/2107.00856 摘要：本文提出了一个统一的时间序列截面数据反事实估计框架，通过直接输入处理后的反事实来估计处理后数据的平均处理效果。实例包括固定效应反事实估计、交互固定效应反事实估计和矩阵完备估计。当治疗效果不均匀或存在未观察到的时变混杂因素时，这些估计比传统的双向固定效应模型提供了更可靠的因果估计。在这个框架下，我们提出了一个新的动态治疗效果图，以及一些诊断测试，以帮助研究人员衡量的有效性，确定假设。我们用两个政治经济学的例子来说明这些方法，并在R和Stata中开发了一个开源包fect，以便于实现。摘要：This paper introduces a unified framework of counterfactual estimation for time-series cross-sectional data, which estimates the average treatment effect on the treated by directly imputing treated counterfactuals. Examples include the fixed effects counterfactual estimator, interactive fixed effects counterfactual estimator, and matrix completion estimator. These estimators provide more reliable causal estimates than conventional twoway fixed effects models when treatment effects are heterogeneous or unobserved time-varying confounders exist. Under this framework, we propose a new dynamic treatment effects plot, as well as several diagnostic tests, to help researchers gauge the validity of the identifying assumptions. We illustrate these methods with two political economy examples and develop an open-source package, fect, in both R and Stata to facilitate implementation.

【11】 Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning 标题：基于视觉模型的强化学习中因果发现的系统评价

作者：Nan Rosemary Ke,Aniket Didolkar,Sarthak Mittal,Anirudh Goyal,Guillaume Lajoie,Stefan Bauer,Danilo Rezende,Yoshua Bengio,Michael Mozer,Christopher Pal 机构： 6Max Planck Institute for Intelligent Systems 链接：https://arxiv.org/abs/2107.00848 摘要：从观察数据中归纳因果关系是机器学习中的一个经典问题。大多数关于因果关系的工作都是从因果变量本身被观察的前提开始的。然而，对于人工智能代理（如机器人）来说，它们试图了解自己的环境，唯一能观察到的是像图像中像素这样的低级变量。为了更好地概括，一个agent必须归纳出高级变量，特别是那些因果变量或受因果变量影响的变量。因此，人工智能和因果关系的中心目标是联合发现抽象表示和因果结构。然而，我们注意到，现有的研究因果归纳的环境不太适合这个目标，因为它们有复杂的特定于任务的因果图，不可能进行参数化操作（例如，节点数、稀疏性、因果链长度等）。在这项工作中，我们的目标是促进研究学习表现的高层次变量以及因果结构之间。为了系统地探索方法识别这些变量和结构的能力，我们设计了一套基准测试RL环境。我们评估了文献中的各种表示学习算法，发现在模型中显式地结合结构和模块化有助于基于模型的强化学习中的因果归纳。摘要：Inducing causal relationships from observations is a classic problem in machine learning. Most work in causality starts from the premise that the causal variables themselves are observed. However, for AI agents such as robots trying to make sense of their environment, the only observables are low-level variables like pixels in images. To generalize well, an agent must induce high-level variables, particularly those which are causal or are affected by causal variables. A central goal for AI and causality is thus the joint discovery of abstract representations and causal structure. However, we note that existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs which are impossible to manipulate parametrically (e.g., number of nodes, sparsity, causal chain length, etc.). In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them. In order to systematically probe the ability of methods to identify these variables and structures, we design a suite of benchmarking RL environments. We evaluate various representation learning algorithms from the literature and find that explicitly incorporating structure and modularity in models can help causal induction in model-based reinforcement learning.

【12】 Testing Randomization and Relaxed Randomization Assumptions: A Clustering With Side-information Approach 标题：测试随机化和放宽随机化假设：一种带边信息的聚类方法

作者：Kan Chen,Siyu Heng,Qi Long,Bo Zhang 机构：Graduate Group of Applied Mathematics and Computational Science, School of Arts and, Sciences, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A., Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine 链接：https://arxiv.org/abs/2107.00815 摘要：观察性研究设计的一个中心目标是使用统计匹配将非实验数据嵌入一个近似的随机对照试验中。然后，研究人员在其下游的结果分析中进行随机化假设。对于匹配对设计，随机化假设说明所有匹配对的治疗分配是独立的，并且每对中的第一个受试者接受治疗和另一个对照的概率与第一个受试者接受治疗和另一个治疗的概率相同。在这篇文章中，我们提出了一个新的框架来测试随机化假设的基础上解决一个聚类问题的边信息使用现代统计学习工具。我们的测试框架是非参数的，有限样本精确的，与以前的建议不同的是，它可以用来测试一个宽松版本的随机化假设，称为有偏随机化假设。我们的测试框架的一个重要的副产品是称为剩余敏感值（RSV）的量，它量化了由于观察到的协变量没有很好地匹配而导致的最小剩余混杂水平。我们主张在下游初步分析中考虑RSV。提出的方法是通过重新审查一个著名的观察研究右心导管插入术（RHC）在危重病人的初步护理效果说明。摘要：One central goal of design of observational studies is to embed non-experimental data into an approximate randomized controlled trial using statistical matching. Researchers then make the randomization assumption in their downstream, outcome analysis. For matched pair design, the randomization assumption states that the treatment assignment across all matched pairs are independent, and that the probability of the first subject in each pair receiving treatment and the other control is the same as the first receiving control and the other treatment. In this article, we develop a novel framework for testing the randomization assumption based on solving a clustering problem with side-information using modern statistical learning tools. Our testing framework is nonparametric, finite-sample exact, and distinct from previous proposals in that it can be used to test a relaxed version of the randomization assumption called the biased randomization assumption. One important by-product of our testing framework is a quantity called residual sensitivity value (RSV), which quantifies the level of minimal residual confounding due to observed covariates not being well matched. We advocate taking into account RSV in the downstream primary analysis. The proposed methodology is illustrated by re-examining a famous observational study concerning the effect of right heart catheterization (RHC) in the initial care of critically ill patients.

【13】 Meta-Learning for Relative Density-Ratio Estimation 标题：元学习在相对密度比估计中的应用

作者：Atsutoshi Kumagai,Tomoharu Iwata,Yasuhiro Fujiwara 机构：NTT Computer and Data Science Laboratories, NTT Communication Science Laboratories 备注：17 pages 链接：https://arxiv.org/abs/2107.00801 摘要：在机器学习中，两个概率密度之比，称为密度比，是一个重要的量。尤其是相对密度比，作为密度比的有界扩展，由于其稳定性而受到广泛关注，并被广泛应用于离群点检测和数据集比较等领域。现有的（相对）密度比估计（DRE）方法需要从这两种密度得到许多实例。然而，在实践中往往没有足够的实例。本文提出了一种相对DRE的元学习方法，该方法利用相关数据集中的知识，从几个实例中估计相对密度比。具体地说，在给定的两个由多个实例组成的数据集上，我们的模型利用神经网络来提取数据集的信息，并利用它来获得适合于相对DRE的实例嵌入。我们在嵌入空间上建立了相对密度比的线性模型，其全局最优解可以作为一个闭式解得到。闭式解能快速有效地适应少数实例，其可微性使我们能够训练模型，使相对DRE的期望测试误差在适应少数实例后显式地最小化。通过相对DRE、数据集比较和离群点检测三个问题验证了该方法的有效性。摘要：The ratio of two probability densities, called a density-ratio, is a vital quantity in machine learning. In particular, a relative density-ratio, which is a bounded extension of the density-ratio, has received much attention due to its stability and has been used in various applications such as outlier detection and dataset comparison. Existing methods for (relative) density-ratio estimation (DRE) require many instances from both densities. However, sufficient instances are often unavailable in practice. In this paper, we propose a meta-learning method for relative DRE, which estimates the relative density-ratio from a few instances by using knowledge in related datasets. Specifically, given two datasets that consist of a few instances, our model extracts the datasets' information by using neural networks and uses it to obtain instance embeddings appropriate for the relative DRE. We model the relative density-ratio by a linear model on the embedded space, whose global optimum solution can be obtained as a closed-form solution. The closed-form solution enables fast and effective adaptation to a few instances, and its differentiability enables us to train our model such that the expected test error for relative DRE can be explicitly minimized after adapting to a few instances. We empirically demonstrate the effectiveness of the proposed method by using three problems: relative DRE, dataset comparison, and outlier detection.

【14】 A Flexible Joint Model for Multiple Longitudinal Biomarkers and A Time-to-Event Outcome: With Applications to Dynamic Prediction Using Highly Correlated Biomarkers 标题：多个纵向生物标志物和事件发生时间结果的灵活联合模型：及其在高度相关生物标志物动态预测中的应用

作者：Ning Li,Yi Liu,Shanpeng Li,Robert M. Elashoff,Gang Li 机构： Departments of Medicine and Biomathematics, University of California at Los Angeles, Los Angeles, California , U.S.A., School of Mathematical Sciences, Ocean University of China, Qingdao , China 备注：Accepted for publication in biometrical journal: 13 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:math/0602240 by other authors 链接：https://arxiv.org/abs/2107.00776 摘要：在生物医学研究中，通常在研究随访期间收集多个生物标志物的数据，以便动态预测发生临床结果的时间。生物标记物通常是间歇性测量的，在某些事件时间缺失，并且可能受到高度生物变异的影响，这在标准时间-事件模型中不容易用作时间依赖性协变量。此外，如果它们来自同一个生物途径，它们可能高度相关。为了解决这些问题，我们提出了一个灵活的联合模型框架，该框架使用共享的潜在降秩纵向主成分模型对多个生物标志物进行建模，并通过Cox模型将潜在过程与事件时间关联起来，以动态预测事件时间。由于多个生物标志物共享的潜在轨迹不需要先验参数时间趋势的描述，并且是由数据决定的，因此提出的高相关生物标志物联合模型比现有的一些方法更具灵活性。我们推导了一种用于参数估计的期望最大化（EM）算法，研究了估计量的大样本性质，并将所提出的方法应用于事件发生时间结果的动态预测。Bootstrap用于标准误差估计和推断。该方法通过模拟实验进行评价，并以肺移植数据为例，通过测定患者支气管肺泡灌洗液中的趋化因子来预测慢性同种异体肺移植功能障碍（CLAD）。摘要：In biomedical studies it is common to collect data on multiple biomarkers during study follow-up for dynamic prediction of a time-to-event clinical outcome. The biomarkers are typically intermittently measured, missing at some event times, and may be subject to high biological variations, which cannot be readily used as time-dependent covariates in a standard time-to-event model. Moreover, they can be highly correlated if they are from in the same biological pathway. To address these issues, we propose a flexible joint model framework that models the multiple biomarkers with a shared latent reduced rank longitudinal principal component model and correlates the latent process to the event time by the Cox model for dynamic prediction of the event time. The proposed joint model for highly correlated biomarkers is more flexible than some existing methods since the latent trajectory shared by the multiple biomarkers does not require specification of a priori parametric time trend and is determined by data. We derive an Expectation-Maximization (EM) algorithm for parameter estimation, study large sample properties of the estimators, and adapt the developed method to make dynamic prediction of the time-to-event outcome. Bootstrap is used for standard error estimation and inference. The proposed method is evaluated using simulations and illustrated on a lung transplant data to predict chronic lung allograft dysfunction (CLAD) using chemokines measured in bronchoalveolar lavage fluid of the patients.

【15】 Nonnegative Matrix Factorization with Group and Basis Restrictions 标题：具有群和基限制的非负矩阵分解

作者：Phillip Shreeves,Jeffrey L. Andrews,Xinchen Deng,Ramie Ali-Adeeb,Andrew Jirasek 机构：University of British Columbia - Okanagan, Department of Statistics, Kelowna, British Columbia, Department of Physics 备注：11 pages (introduction to conclusion; 17 total). 6 figures 链接：https://arxiv.org/abs/2107.00744 摘要：非负矩阵分解（NMF）是一种常用的数据降维方法。它通过将感兴趣的数据集$\mathbf{X}$分解为两个低秩非负矩阵相乘（$\mathbf{X}\approx\mathbf{WH}$）。这两个矩阵可以描述为$\mathbf{H}$行中表示的潜在因素，以及在$\mathbf{W}$行中发现的这些因素的观察值。本文提供了一个扩展的方法，允许一个指定的先验知识的数据，包括集团信息和可能的潜在因素。这是通过进一步将矩阵$\mathbf{H}$分解为矩阵$\mathbf{A}$和$\mathbf{S}$相乘来实现的。这些矩阵分别表示“辅助”矩阵和半约束因子矩阵。提出了该方法及其修正准则，并将其应用于仿真和实际算例。摘要：Nonnegative matrix factorization (NMF) is a popular method used to reduce dimensionality in data sets whose elements are nonnegative. It does so by decomposing the data set of interest, $\mathbf{X}$, into two lower rank nonnegative matrices multiplied together ($\mathbf{X} \approx \mathbf{WH}$). These two matrices can be described as the latent factors, represented in the rows of $\mathbf{H}$, and the scores of the observations on these factors that are found in the rows of $\mathbf{W}$. This paper provides an extension of this method which allows one to specify prior knowledge of the data, including both group information and possible underlying factors. This is done by further decomposing the matrix, $\mathbf{H}$, into matrices $\mathbf{A}$ and $\mathbf{S}$ multiplied together. These matrices represent an 'auxiliary' matrix and a semi-constrained factor matrix respectively. This method and its updating criterion are proposed, followed by its application on both simulated and real world examples.

【16】 Visualizing the geometry of labeled high-dimensional data with spheres 标题：使用球体可视化标注的高维数据的几何图形

作者：Andrew D Zaharia,Anish S Potnis,Alexander Walther,Nikolaus Kriegeskorte 机构：Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY , Realeyes, Riding House St, London W,W ,FA, UK, Departments of Psychology, Neuroscience, and Electrical Engineering, Columbia University, New York, NY 链接：https://arxiv.org/abs/2107.00731 摘要：数据可视化概括了二维或三维的高维分布。降维意味着信息的丢失，不同的方法所保留的信息也不同。现有的方法保存点的局部或全局几何，并且大多数技术不考虑标签。在这里，我们介绍了“超球体”（H2S），一种新的方法，其目的不是可视化点，而是可视化标记分布之间的关系。H2S将超球体拟合到高维空间中的每个标记点集，并将每个超球体可视化为3D中的球体（或2D中的圆）。H2S完美地捕捉了最多4个三维超球体（或3个二维超球体）的几何图形，并近似于大量分布的几何图形，匹配大小（半径）和成对分离（中心距离之间）和重叠（沿中心连接线）。由此产生的可视化对采样不平衡具有鲁棒性。利用标签和球体作为最简单的几何原语，H2S为可视化技术工具箱提供了重要的补充。摘要：Data visualizations summarize high-dimensional distributions in two or three dimensions. Dimensionality reduction entails a loss of information, and what is preserved differs between methods. Existing methods preserve the local or the global geometry of the points, and most techniques do not consider labels. Here we introduce "hypersphere2sphere" (H2S), a new method that aims to visualize not the points, but the relationships between the labeled distributions. H2S fits a hypersphere to each labeled set of points in a high-dimensional space and visualizes each hypersphere as a sphere in 3D (or circle in 2D). H2S perfectly captures the geometry of up to 4 hyperspheres in 3D (or 3 in 2D), and approximates the geometry for larger numbers of distributions, matching the sizes (radii), and the pairwise separations (between-center distances) and overlaps (along the center-connection line). The resulting visualizations are robust to sampling imbalances. Leveraging labels and the sphere as the simplest geometrical primitive, H2S provides an important addition to the toolbox of visualization techniques.

【17】 Two edge-count tests and relevance analysis in k high-dimensional samples 标题：k个高维样本的两个边数检验及其相关性分析

作者：Xiaoping Shi 机构： Barber School of Arts and Sciences, University of British Columbia 链接：https://arxiv.org/abs/2107.00728 摘要：对于相关性分析的任务，传统的Tukey检验可以应用于所有成对比较的集合。然而，很少有研究同时讨论非参数k样本比较和高维相关分析。我们的目的是捕捉组合样本之间的关联程度，并在高维k样本比较中提供额外的见解和优势。我们的解决方案是扩展一个基于图的两样本比较，并研究它在大样本和不等样本情况下的可用性。我们提出了两种基于样本间边缘计数的无分布检验统计量，并用标准化计数来度量相关度。导出了该统计量的渐近置换零分布，并证明了当样本量小于维数的平方根时的功率增益。我们还讨论了图中不同的边代价来比较分布的参数。通过对肿瘤和图像的仿真比较和实际数据分析，进一步验证了该方法的有效性。实现相关性分析的软件在R包relevance中提供。摘要：For the task of relevance analysis, the conventional Tukey's test may be applied to the set of all pairwise comparisons. However, there were few studies that discuss both nonparametric k-sample comparisons and relevance analysis in high dimensions. Our aim is to capture the degree of relevance between combined samples and provide additional insights and advantages in high-dimensional k-sample comparisons. Our solution is to extend a graph-based two-sample comparison and investigate its availability for large and unequal sample sizes. We propose two distribution-free test statistics based on between-sample edge counts and measure the degree of relevance by standardized counts. The asymptotic permutation null distributions of the proposed statistics are derived, and the power gain is proved when the sample sizes are smaller than the square root of the dimension. We also discuss different edge costs in the graph to compare the parameters of the distributions. Simulation comparisons and real data analysis of tumors and images further convince the value of our proposed method. Software implementing the relevance analysis is available in the R package Relevance.

【18】 Demystifying statistical learning based on efficient influence functions 标题：基于有效影响函数的统计学习去神秘化

作者：Oliver Hines,Oliver Dukes,Karla Diaz-Ordaz,Stijn Vansteelandt 机构：Department of Medical Statistics, London School of Hygiene and Tropical Medicine, UK, Department of Applied Mathematics, Computer Science and Statistics, Ghent, University, Ghent, Belgium 链接：https://arxiv.org/abs/2107.00681 摘要：处理效果的评估和更一般的估计通常是通过参数化建模来实现的，这是不令人满意的，因为模型很可能有误。数据自适应模型构建（例如统计/机器学习）通常用于降低错误指定的风险。然而，这种方法的幼稚使用带来的估计器的偏差可能会随着样本量的减小而过慢，从而导致推理方法（包括基于bootstrap的方法）无法很好地执行。偏差的产生是因为标准数据自适应方法被调整到最小的预测误差，而不是例如估计器中的最小均方误差。由于这些策略的复杂性，这可能导致难以确认的过度可变性。基于非参数统计的结果，有针对性的学习和基于debiased的机器学习通过在非参数模型下利用估计量的有效影响函数构造估计量来克服这些问题。这些日益流行的方法通常假设有效的影响函数已经给出，或者读者已经熟悉它的推导过程。在这篇论文中，我们主要讨论有效影响函数的推导，并解释如何使用它来构造基于统计/机器学习的估计器。我们讨论了这些估计器表现良好的必要条件，并用不同的例子来说明该理论的广泛适用性。摘要：Evaluation of treatment effects and more general estimands is typically achieved via parametric modelling, which is unsatisfactory since model misspecification is likely. Data-adaptive model building (e.g. statistical/machine learning) is commonly employed to reduce the risk of misspecification. Naive use of such methods, however, delivers estimators whose bias may shrink too slowly with sample size for inferential methods to perform well, including those based on the bootstrap. Bias arises because standard data-adaptive methods are tuned towards minimal prediction error as opposed to e.g. minimal MSE in the estimator. This may cause excess variability that is difficult to acknowledge, due to the complexity of such strategies. Building on results from non-parametric statistics, targeted learning and debiased machine learning overcome these problems by constructing estimators using the estimand's efficient influence function under the non-parametric model. These increasingly popular methodologies typically assume that the efficient influence function is given, or that the reader is familiar with its derivation. In this paper, we focus on derivation of the efficient influence function and explain how it may be used to construct statistical/machine-learning-based estimators. We discuss the requisite conditions for these estimators to perform well and use diverse examples to convey the broad applicability of the theory.

【19】 Who Votes for Library Bonds? A Principal Component Exploration 标题：谁投票支持图书馆债券？一种主成分探索法

作者：Eric Jacobson 机构：Weber State University, retired 备注：26 pages, 4 tables 链接：https://arxiv.org/abs/2107.01095 摘要：以往的研究表明，选民特征与选民对税收债券的支持度之间存在一定的关系。然而，这些发现很难解释，因为这些指标之间存在高度的共线性。从一次图书馆债券选举的13个选民的人口统计指标中，提取了7个独立的主成分，占方差的95%。直接人口统计学指标与投票不一致，低社会经济地位、大学经历、女性和服务性工作的主成分与赞成票相关，而高家庭价值与反对票相关。摘要：Previous research has shown a relationship between voter characteristics and voter support for tax bonds. These findings, however, are difficult to interpret because of the high degree of collinearity across the measures. From 13 demographic measures of voters in a library bond election, seven independent principal components were extracted which accounted for 95 percent of the variance. Whereas the direct demographic measures showed inconsistent relationships with voting, the principal components of low SES, college experience, female and service job were related to affirmative voting, while high home value was related to negative voting.

【20】 Projected Least-Squares Quantum Process Tomography 标题：投影最小二乘量子过程层析成像

作者：Trystan Surawy-Stepney,Jonas Kahn,Richard Kueng,Madalin Guta 机构：School of Mathematical Sciences, University of Nottingham, United Kingdom, Institut de Math´ematiques de Toulouse and ANITI, Universit´e de Toulouse, France, Institute for Integrated Circuits, Johannes Kepler University Linz, Austria 备注：13+9 pages, 8 figures 链接：https://arxiv.org/abs/2107.01060 摘要：提出并研究了一种新的量子过程层析成像（QPT）方法&投影最小二乘（PLS）。简言之，PLS首先计算未知信道Choi矩阵的最小二乘估计量，然后将其投影到Choi矩阵的凸集上。我们考虑四个实验设置，包括直接QPT与Pauli特征向量作为输入和Pauli测量，和辅助辅助QPT与相互无偏基（MUB）测量。在每种情况下，我们提供了一个封闭形式的解决方案的最小二乘估计的崔矩阵。我们提出了一种新的两步方法将这些估计器投影到表示物理量子信道的矩阵集合上，并以超平面交叉投影算法的形式给出了一个快速的数值实现。我们为估计量的Frobenius和迹范数误差提供了严格的、非渐近的集中界、抽样复杂性和置信域。对于Frobenius误差，边界在Choi矩阵的秩中是线性的，对于低秩，它们将最小二乘估计的错误率提高了一个因子$d^2$，其中$d$是系统维数。我们用数值实验说明了该方法的有效性，实验结果表明PLS具有很高的计算精度和计算可处理性。摘要：We propose and investigate a new method of quantum process tomography (QPT) which we call projected least squares (PLS). In short, PLS consists of first computing the least-squares estimator of the Choi matrix of an unknown channel, and subsequently projecting it onto the convex set of Choi matrices. We consider four experimental setups including direct QPT with Pauli eigenvectors as input and Pauli measurements, and ancilla-assisted QPT with mutually unbiased bases (MUB) measurements. In each case, we provide a closed form solution for the least-squares estimator of the Choi matrix. We propose a novel, two-step method for projecting these estimators onto the set of matrices representing physical quantum channels, and a fast numerical implementation in the form of the hyperplane intersection projection algorithm. We provide rigorous, non-asymptotic concentration bounds, sampling complexities and confidence regions for the Frobenius and trace-norm error of the estimators. For the Frobenius error, the bounds are linear in the rank of the Choi matrix, and for low ranks, they improve the error rates of the least squares estimator by a factor $d^2$, where $d$ is the system dimension. We illustrate the method with numerical experiments involving channels on systems with up to 7 qubits, and find that PLS has highly competitive accuracy and computational tractability.

【21】 DeformRS: Certifying Input Deformations with Randomized Smoothing 标题：DeformRS：使用随机平滑验证输入变形

作者：Motasem Alfarra,Adel Bibi,Naeemullah Khan,Philip H. S. Torr,Bernard Ghanem 机构：KAUST, University of Oxford 备注：First two authors contributed equally to this work 链接：https://arxiv.org/abs/2107.00996 摘要：深度神经网络易受像素位移向量场形式的输入变形和其他参数化几何变形（如平移、旋转等）的影响。当前的输入变形验证方法（i）不适用于大型输入数据集上的深度网络，或（ii）只能证明特定类别的变形，例如仅旋转。我们重新构造了一般向量场和参数化变形的随机平滑证明，并分别提出了变形器VF和变形器Par。我们的新公式适用于大型输入数据集上的大型网络。例如，DeformRS Par证明了丰富的变形，包括平移、旋转、缩放、仿射变形和其他视觉对齐的变形，例如通过离散余弦变换基参数化的变形。在MNIST、CIFAR10和ImageNet上的大量实验表明，DeformRS Par在认证精度方面优于现有的最新技术，例如，在ImageNet上设置[-10,10]度的扰动旋转时，认证精度提高了6%。摘要：Deep neural networks are vulnerable to input deformations in the form of vector fields of pixel displacements and to other parameterized geometric deformations e.g. translations, rotations, etc. Current input deformation certification methods either (i) do not scale to deep networks on large input datasets, or (ii) can only certify a specific class of deformations, e.g. only rotations. We reformulate certification in randomized smoothing setting for both general vector field and parameterized deformations and propose DeformRS-VF and DeformRS-Par, respectively. Our new formulation scales to large networks on large input datasets. For instance, DeformRS-Par certifies rich deformations, covering translations, rotations, scaling, affine deformations, and other visually aligned deformations such as ones parameterized by Discrete-Cosine-Transform basis. Extensive experiments on MNIST, CIFAR10 and ImageNet show that DeformRS-Par outperforms existing state-of-the-art in certified accuracy, e.g. improved certified accuracy of 6% against perturbed rotations in the set [-10,10] degrees on ImageNet.

【22】 Online Matching in Sparse Random Graphs: Non-Asymptotic Performances of Greedy Algorithm 标题：稀疏随机图的在线匹配：贪婪算法的非渐近性能

作者：Nathan Noiry,Flore Sentenac,Vianney Perchet 机构：T´el´ecom Paris, Palaiseau, France, CREST, ENSAE Paris, Palaiseau, France, CRITEO AI Lab, Paris, France 链接：https://arxiv.org/abs/2107.00995 摘要：受顺序预算分配问题的启发，我们研究了顶点之间的连接不是i.i.d.，而是具有固定度分布的在线匹配问题，即所谓的配置模型。我们估计最简单的算法贪婪的竞争比，通过近似一些相关的随机离散过程的连续对应，这是一个显式的偏微分方程组的解决方案。这种方法给出了估计误差的精确界，随着问题规模的增大，估计误差的概率是任意高的。特别是，它允许在不同的配置模型之间进行形式上的比较。我们还证明了，令人惊讶的是，贪心可以有更好的性能保证比排名，另一个著名的算法在线匹配，通常优于前者。摘要：Motivated by sequential budgeted allocation problems, we investigate online matching problems where connections between vertices are not i.i.d., but they have fixed degree distributions -- the so-called configuration model. We estimate the competitive ratio of the simplest algorithm, GREEDY, by approximating some relevant stochastic discrete processes by their continuous counterparts, that are solutions of an explicit system of partial differential equations. This technique gives precise bounds on the estimation errors, with arbitrarily high probability as the problem size increases. In particular, it allows the formal comparison between different configuration models. We also prove that, quite surprisingly, GREEDY can have better performance guarantees than RANKING, another celebrated algorithm for online matching that usually outperforms the former.

【23】 Partial Identification and Inference in Duration Models with Endogenous Censoring 标题：具有内生删失的久期模型的部分辨识和推断

作者：Shosei Sakaguchi 链接：https://arxiv.org/abs/2107.00928 摘要：研究了具有内生删失的变换模型的辨识与推理问题。许多持续时间模型，如加速失效时间模型、比例风险模型和混合比例风险模型，都可以看作是转换模型。我们允许对持续时间结果的审查与观察到的协变量和未观察到的异质性任意相关。我们对未观测异质性的变换函数和分布函数都没有参数限制。在此背景下，我们发展了回归参数和变换函数的界，其特征是包含U统计量的条件矩不等式。通过构造条件矩不等式模型的一种推理方法，给出了相应的推理方法。我们利用斯坦福心脏移植研究的数据，应用所提出的推断方法来评估心脏移植对患者生存时间的影响。摘要：This paper studies identification and inference in transformation models with endogenous censoring. Many kinds of duration models, such as the accelerated failure time model, proportional hazard model, and mixed proportional hazard model, can be viewed as transformation models. We allow the censoring of a duration outcome to be arbitrarily correlated with observed covariates and unobserved heterogeneity. We impose no parametric restrictions on either the transformation function or the distribution function of the unobserved heterogeneity. In this setting, we develop bounds on the regression parameters and the transformation function, which are characterized by conditional moment inequalities involving U-statistics. We provide inference methods for them by constructing an inference approach for conditional moment inequality models in which the sample analogs of moments are U-statistics. We apply the proposed inference methods to evaluate the effect of heart transplants on patients' survival time using data from the Stanford Heart Transplant Study.

【24】 Reconsidering Dependency Networks from an Information Geometry Perspective 标题：从信息几何角度重新审视依存网络

作者：Kazuya Takabatake,Shotaro Akaho 机构：Human Informatics and Interaction Research Institute, National Institute of Advanced Industrial Science and Technology, Central ,: ,-,-, Umezono, Tsukuba, Ibaraki ,-, Japan., Editor: 备注：28pages, 7figures 链接：https://arxiv.org/abs/2107.00871 摘要：依赖网络（Heckerman et al.，2000）是包含大量变量的系统的潜在概率图形模型。与贝叶斯网络一样，依赖网络的结构由一个有向图表示，每个节点都有一个条件概率表。学习和推理在单个节点上局部实现；因此，即使有大量的变量，计算仍然是容易处理的。然而，依赖网络的学习分布是马尔可夫链的平稳分布，称为伪吉布斯抽样，没有封闭形式的表达式。这种技术上的缺陷阻碍了依赖网络的发展。在本文中，我们考虑每个节点的一个流形。然后，我们可以将伪Gibbs采样解释为这些流形上的迭代m-投影。这种解释为伪Gibbs抽样的平稳分布在分布空间中的位置提供了一个理论界。此外，这种解释涉及结构和参数学习算法作为优化问题。此外，我们还对依赖网络和贝叶斯网络进行了实验比较。结果表明，依赖网络和贝叶斯网络在学习分布的准确性方面具有大致相同的性能。结果还表明，依赖网络的学习速度比贝叶斯网络快得多。摘要：Dependency networks (Heckerman et al., 2000) are potential probabilistic graphical models for systems comprising a large number of variables. Like Bayesian networks, the structure of a dependency network is represented by a directed graph, and each node has a conditional probability table. Learning and inference are realized locally on individual nodes; therefore, computation remains tractable even with a large number of variables. However, the dependency network's learned distribution is the stationary distribution of a Markov chain called pseudo-Gibbs sampling and has no closed-form expressions. This technical disadvantage has impeded the development of dependency networks. In this paper, we consider a certain manifold for each node. Then, we can interpret pseudo-Gibbs sampling as iterative m-projections onto these manifolds. This interpretation provides a theoretical bound for the location where the stationary distribution of pseudo-Gibbs sampling exists in distribution space. Furthermore, this interpretation involves structure and parameter learning algorithms as optimization problems. In addition, we compare dependency and Bayesian networks experimentally. The results demonstrate that the dependency network and the Bayesian network have roughly the same performance in terms of the accuracy of their learned distributions. The results also show that the dependency network can learn much faster than the Bayesian network.

【25】 Quantifying Availability and Discovery in Recommender Systems via Stochastic Reachability 标题：利用随机可达性量化推荐系统中的可用性和发现

作者：Mihaela Curmei,Sarah Dean,Benjamin Recht 机构：Department of EECS, University of California, Berkeley 备注：to appear ICML 2021 链接：https://arxiv.org/abs/2107.00833 摘要：在这项工作中，我们考虑如何在交互式推荐系统中的偏好模型确定内容的可用性和用户的发现机会。我们提出了一个基于随机可达性的评估过程，以量化在一组允许的策略修改下，向用户推荐目标内容的最大概率。这个框架允许我们在对用户行为进行最小假设的情况下计算推荐可能性的上界。随机可达性可用于检测内容可用性中的偏差，并诊断授予用户的发现机会中的限制。我们证明了对于各种实际情况，这个度量可以作为一个凸规划来有效地计算，并且进一步论证了可达性与精确性并不存在内在的矛盾。我们演示了在显式和隐式评分的大数据集上训练的推荐算法的评估。我们的结果说明了偏好模型、选择规则和用户干预是如何影响可达性的，以及这些影响是如何不均匀分布的。摘要：In this work, we consider how preference models in interactive recommendation systems determine the availability of content and users' opportunities for discovery. We propose an evaluation procedure based on stochastic reachability to quantify the maximum probability of recommending a target piece of content to an user for a set of allowable strategic modifications. This framework allows us to compute an upper bound on the likelihood of recommendation with minimal assumptions about user behavior. Stochastic reachability can be used to detect biases in the availability of content and diagnose limitations in the opportunities for discovery granted to users. We show that this metric can be computed efficiently as a convex program for a variety of practical settings, and further argue that reachability is not inherently at odds with accuracy. We demonstrate evaluations of recommendation algorithms trained on large datasets of explicit and implicit ratings. Our results illustrate how preference models, selection rules, and user interventions impact reachability and how these effects can be distributed unevenly.

【26】 Decision tree heuristics can fail, even in the smoothed setting 标题：即使在平滑设置中，决策树启发式也可能失败

作者：Guy Blanc,Jane Lange,Mingda Qiao,Li-Yang Tan 机构：Stanford, MIT 备注：To appear in RANDOM 2021 链接：https://arxiv.org/abs/2107.00819 摘要：贪婪决策树学习启发式算法是机器学习实践的支柱，但其经验成功的理论依据仍然难以捉摸。事实上，人们早就知道有一些简单的目标函数会严重失败（Kearns和Mansour，STOC 1996）。Brutzkus、Daniely和Malach（COLT 2020）最近的工作认为平滑分析模型是解决这种脱节的可能途径。在平滑设置和目标$f$为$k$-juntas的情况下，他们表明这些启发式算法成功地学习了$f$和深度-$k$决策树假设。他们推测，对于深度为$k$决策树的目标，同样的保证更为普遍。我们为这个猜想提供了一个反例：我们构造了深度为$k$决策树的目标，并表明即使在平滑设置下，这些启发式算法在获得高精度之前也会构建深度为$2^{\Omega（k）}$的树。我们还表明，Brutzkus等人的保证不能扩展到不可知论的设置：有非常接近$k$-juntas的目标，对于这些目标，这些启发式算法在获得高精度之前构建深度为$2^{\Omega（k）}$的树。摘要：Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996). Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets $f$ that are $k$-juntas, they showed that these heuristics successfully learn $f$ with depth-$k$ decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-$k$ decision trees. We provide a counterexample to this conjecture: we construct targets that are depth-$k$ decision trees and show that even in the smoothed setting, these heuristics build trees of depth $2^{\Omega(k)}$ before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to $k$-juntas, for which these heuristics build trees of depth $2^{\Omega(k)}$ before achieving high accuracy.

【27】 Few-shot Learning for Unsupervised Feature Selection 标题：用于无监督特征选择的Few-Shot学习

作者：Atsutoshi Kumagai,Tomoharu Iwata,Yasuhiro Fujiwara 机构：NTT Computer and Data Science Laboratories, NTT Communication Science Laboratories 备注：20 pages 链接：https://arxiv.org/abs/2107.00816 摘要：提出了一种无监督特征选择的多镜头学习方法，即在未标记数据中选择相关特征子集的任务。现有的方法通常需要很多实例来进行特征选择。然而，在实践中往往没有足够的实例。该方法通过对多个源任务中的未标记实例进行训练，在给定几个未标记的目标实例的情况下，选择目标任务中相关特征的子集。我们的模型由特征选择器和解码器组成。特征选择器以若干未标记实例作为输入输出相关特征的子集，使得解码器可以从所选实例重构未标记实例的原始特征。特征选择器使用具体的随机变量通过梯度下降来选择特征。为了将特定于任务的属性从几个未标记的实例编码到模型中，具体的随机变量和译码器采用以几个未标记实例为输入的置换不变神经网络进行建模。我们的模型是通过最小化期望的测试重建误差来训练的，给定了一些在源任务中使用数据集计算的未标记实例。实验结果表明，该方法优于现有的特征选择方法。摘要：We propose a few-shot learning method for unsupervised feature selection, which is a task to select a subset of relevant features in unlabeled data. Existing methods usually require many instances for feature selection. However, sufficient instances are often unavailable in practice. The proposed method can select a subset of relevant features in a target task given a few unlabeled target instances by training with unlabeled instances in multiple source tasks. Our model consists of a feature selector and decoder. The feature selector outputs a subset of relevant features taking a few unlabeled instances as input such that the decoder can reconstruct the original features of unseen instances from the selected ones. The feature selector uses the Concrete random variables to select features via gradient descent. To encode task-specific properties from a few unlabeled instances to the model, the Concrete random variables and decoder are modeled using permutation-invariant neural networks that take a few unlabeled instances as input. Our model is trained by minimizing the expected test reconstruction error given a few unlabeled instances that is calculated with datasets in source tasks. We experimentally demonstrate that the proposed method outperforms existing feature selection methods.

【28】 The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models 标题：聚光灯：发现深度学习模型系统误差的通用方法

作者：Greg d'Eon,Jason d'Eon,James R. Wright,Kevin Leyton-Brown 机构：University of Alberta, Canada, University of British Columbia, Canada 链接：https://arxiv.org/abs/2107.00758 摘要：有监督学习模型往往会在数据的罕见子集上产生系统性错误。然而，这种系统性错误可能很难识别，因为只有当敏感组已知并明确标记时，模型性能才能在敏感组中分解。本文介绍了一种发现系统误差的方法，我们称之为聚光灯法。关键思想是，相似的输入往往在神经网络的最后一个隐藏层中有相似的表示。我们通过在这个表示空间上“聚焦”来利用这个结构，找到模型性能较差的连续区域。我们发现，聚光灯表面语义上有意义的薄弱环节，在各种各样的模型架构，包括图像分类器，语言模型和推荐系统。摘要：Supervised learning models often make systematic errors on rare subsets of the data. However, such systematic errors can be difficult to identify, as model performance can only be broken down across sensitive groups when these groups are known and explicitly labelled. This paper introduces a method for discovering systematic errors, which we call the spotlight. The key idea is that similar inputs tend to have similar representations in the final hidden layer of a neural network. We leverage this structure by "shining a spotlight" on this representation space to find contiguous regions where the model performs poorly. We show that the spotlight surfaces semantically meaningful areas of weakness in a wide variety of model architectures, including image classifiers, language models, and recommender systems.

【29】 q-Paths: Generalizing the Geometric Annealing Path using Power Means 标题：Q-路径：用幂平均推广几何退火路径

作者：Vaden Masrani,Rob Brekelmans,Thang Bui,Frank Nielsen,Aram Galstyan,Greg Ver Steeg,Frank Wood 机构：University of British Columbia,USC Information Sciences Institute, University of Sydney,Sony CSL,MILA, ∗Equal Contribution 备注：arXiv admin note: text overlap with arXiv:2012.07823 链接：https://arxiv.org/abs/2107.00745 摘要：许多常见的机器学习方法都涉及到几何退火路径，即用几何平均值构造的两个感兴趣分布之间的中间密度序列。虽然矩平均路径等替代方法在某些情况下表现出性能提升，但它们的实际适用性仍然受到指数族端点假设和缺乏封闭形式能量函数的限制。在这项工作中，我们引入了$q$-路，这是一个由广义平均概念导出的路族，它包括几何和算术混合作为特例，并且允许一个简单的封闭形式，它包含了非扩展热力学中的变形对数函数。根据之前对几何路径的分析，我们将我们的$q$-路径解释为对应于$q$-指数分布族，并将中间密度的变分表示为最小化到端点的$\alpha$-发散的混合物。我们表明，小偏差远离几何路径产生经验收益贝叶斯推理使用序贯蒙特卡罗和生成模型评估使用退火重要性抽样。摘要：Many common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average. While alternatives such as the moment-averaging path have demonstrated performance gains in some settings, their practical applicability remains limited by exponential family endpoint assumptions and a lack of closed form energy function. In this work, we introduce $q$-paths, a family of paths which is derived from a generalized notion of the mean, includes the geometric and arithmetic mixtures as special cases, and admits a simple closed form involving the deformed logarithm function from nonextensive thermodynamics. Following previous analysis of the geometric path, we interpret our $q$-paths as corresponding to a $q$-exponential family of distributions, and provide a variational representation of intermediate densities as minimizing a mixture of $\alpha$-divergences to the endpoints. We show that small deviations away from the geometric path yield empirical gains for Bayesian inference using Sequential Monte Carlo and generative model evaluation using Annealed Importance Sampling.

【30】 Gap-Dependent Bounds for Two-Player Markov Games 标题：两人马尔可夫对策的间隙相关界

作者：Zehao Dou,Zhuoran Yang,Zhaoran Wang,Simon S. Du 机构：Yale University, Princeton University, Northwestern University, Simon S.Du, University of Washington 备注：34 pages 链接：https://arxiv.org/abs/2107.00685 摘要：Q-learning作为强化学习领域最流行的方法之一，受到越来越多的关注。近年来，关于Q-learning类算法在不同环境下的遗憾界的理论研究越来越多。本文分析了基于2人回合的随机Markov对策（2-TBSG）在进行Nash Q-学习算法时的累积后悔，并提出了在情景表环境下的第一个缺口相关对数上界。这个界限与理论下限仅在对数项以内相匹配。此外，我们将此结论推广到无限视界的折扣对策情形，并提出一个类似的缺口相关对数遗憾界。同时，在线性MDP假设下，我们得到了2-TBSG在集中和独立两种情况下的另一个对数遗憾。摘要：As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.

【31】 Shared Data and Algorithms for Deep Learning in Fundamental Physics 标题：用于基础物理深度学习的共享数据和算法

作者：Lisa Benato,Erik Buhmann,Martin Erdmann,Peter Fackeldey,Jonas Glombitza,Nikolai Hartmann,Gregor Kasieczka,William Korcari,Thomas Kuhr,Jan Steinheimer,Horst Stöcker,Tilman Plehn,Kai Zhou 机构：Received: date Accepted: date 备注：13 pages, 5 figures, 5 tables 链接：https://arxiv.org/abs/2107.00656 摘要：我们将介绍一组来自基础物理研究的数据集——包括粒子物理、天体粒子物理、强子和核物理——用于有监督的机器学习研究。这些数据集，包括强子顶夸克、宇宙射线诱导的空气簇射、强子物质的相变和发生器级的历史，被公开，以简化基础物理中跨学科机器学习和转移学习的未来工作。基于这些数据，我们提出了一种简单而灵活的基于图的神经网络结构，可以很容易地应用于这些领域的广泛的有监督学习任务。我们表明，我们的方法在所有数据集上都达到接近最先进的专用方法的性能。为了简化对各种问题的适应，我们提供了易于遵循的说明，说明如何构造与基础物理相关的数据结构的基于图形的表示，并为其中几个问题提供代码实现。文中还给出了我们提出的方法和所有参考算法的实现。摘要：We introduce a collection of datasets from fundamental physics research -- including particle physics, astroparticle physics, and hadron- and nuclear physics -- for supervised machine learning studies. These datasets, containing hadronic top quarks, cosmic-ray induced air showers, phase transitions in hadronic matter, and generator-level histories, are made public to simplify future work on cross-disciplinary machine learning and transfer learning in fundamental physics. Based on these data, we present a simple yet flexible graph-based neural network architecture that can easily be applied to a wide range of supervised learning tasks in these domains. We show that our approach reaches performance close to state-of-the-art dedicated methods on all datasets. To simplify adaptation for various problems, we provide easy-to-follow instructions on how graph-based representations of data structures, relevant for fundamental physics, can be constructed and provide code implementations for several of them. Implementations are also provided for our proposed method and all reference algorithms.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-05，如有侵权请联系 cloudcommunity@tencent.com 删除

linux