统计学学术速递[7.6]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:24:10

1.5K0

发布于 2021-07-27 10:24:10

stat统计学，共计64篇

【1】 Sufficient principal component regression for pattern discovery in transcriptomic data 标题：用于转录数据模式发现的充分主成分回归

作者：Lei Ding,Gabriel E. Zentner,Daniel J. McDonald 机构：Department of Statistics, Indiana University, Bloomington, IN USA, Department of Biology, Melvin and Bren Simon Comprehensive Cancer Center, University of British Columbia, Vancouver, BC Canada 备注：26 pages, 9 figures, 9 tables 链接：https://arxiv.org/abs/2107.02150 摘要：转录物丰度的全局测量方法，如微阵列和RNA-seq，产生的数据集中测量的特征数量远远超过观察的数量。因此，从这些数据中提取具有生物学意义和实验可处理的见解需要高维预测。现有的稀疏线性方法已经取得了惊人的成功，但仍存在一些重要问题。这些方法可能无法选择正确的特征，相对于非稀疏替代方案预测能力较差，或者忽略任何未知的特征分组结构。我们提出了一种称为SuffPCR的方法，该方法在包括回归和分类在内的高维任务中产生了改进的预测，特别是在具有相关特征的典型组学背景下。首先估计稀疏主成分，然后在恢复的子空间上估计线性模型。由于所估计的子空间在特征中是稀疏的，因此所得到的预测将仅依赖于一小部分基因。SuffPCR可以很好地处理各种模拟和实验转录组数据，在满足模型假设的情况下，几乎可以达到最佳效果。我们还证明了接近最优的理论保证。摘要：Methods for global measurement of transcript abundance such as microarrays and RNA-seq generate datasets in which the number of measured features far exceeds the number of observations. Extracting biologically meaningful and experimentally tractable insights from such data therefore requires high-dimensional prediction. Existing sparse linear approaches to this challenge have been stunningly successful, but some important issues remain. These methods can fail to select the correct features, predict poorly relative to non-sparse alternatives, or ignore any unknown grouping structures for the features. We propose a method called SuffPCR that yields improved predictions in high-dimensional tasks including regression and classification, especially in the typical context of omics with correlated features. SuffPCR first estimates sparse principal components and then estimates a linear model on the recovered subspace. Because the estimated subspace is sparse in the features, the resulting predictions will depend on only a small subset of genes. SuffPCR works well on a variety of simulated and experimental transcriptomic data, performing nearly optimally when the model assumptions are satisfied. We also demonstrate near-optimal theoretical guarantees.

【2】 Multivariate functional group sparse regression: functional predictor selection 标题：多元函数组稀疏回归：函数预测因子的选择

作者：Ali Mahzarnia,Jun Song 机构：Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC , USA 备注：The R package that is developed for this paper is available at GitHub. See this https URL 链接：https://arxiv.org/abs/2107.02146 摘要：本文在高维多元函数数据集下，提出了一类标量对函数回归问题中同时进行函数预测器选择和光滑函数系数估计的方法。特别地，我们发展了在无限维广义Hilbert空间下函数群稀疏回归的两种方法。在无限维Hilbert空间下，证明了算法的收敛性以及估计和选择的一致性（oracle性质）。仿真研究表明，该方法对函数系数的选择和估计都是有效的。功能磁共振成像（fMRI）的应用揭示了人脑中与ADHD和IQ相关的区域。摘要：In this paper, we propose methods for functional predictor selection and the estimation of smooth functional coefficients simultaneously in a scalar-on-function regression problem under high-dimensional multivariate functional data setting. In particular, we develop two methods for functional group-sparse regression under a generic Hilbert space of infinite dimension. We show the convergence of algorithms and the consistency of the estimation and the selection (oracle property) under infinite-dimensional Hilbert spaces. Simulation studies show the effectiveness of the methods in both the selection and the estimation of functional coefficients. The applications to the functional magnetic resonance imaging (fMRI) reveal the regions of the human brain related to ADHD and IQ.

【3】 Tiled Squeeze-and-Excite: Channel Attention With Local Spatial Context 标题：平铺挤兑：局部空间背景下的渠道关注

作者：Niv Vosco,Alon Shenkler,Mark Grobman 机构：Hailo 链接：https://arxiv.org/abs/2107.02145 摘要：在这篇论文中，我们研究了通道注意所需的空间上下文的数量。为此，我们研究了流行的挤压和激发（SE）块，这是一种简单而轻量级的通道注意机制。SE块及其众多变体通常使用全局平均池（GAP）为每个通道创建一个描述符。在这里，我们实证分析了有效的通道注意所需的空间上下文的数量，发现在原始图像的七行或七列的顺序上有限的局部上下文足以匹配全局上下文的性能。我们提出了tiled-squeeze-and-excite（TSE），它是一个构建SE-like块的框架，每个通道使用多个描述符，每个描述符仅基于本地上下文。我们进一步证明了TSE是SE块的一个替代品，可以在现有SE网络中使用，而无需重新训练。这意味着本地上下文描述符彼此相似，并且与全局上下文描述符相似。最后，我们证明了TSE对于将SE网络部署到数据流AI加速器具有重要的实际意义，因为它们减少了管道缓冲需求。例如，与SE（从50M到4.77M）相比，使用TSE可将EfficientdD2中的激活管道缓冲量减少90%，而不会损失精度。我们的代码和预先训练的模型将公开提供。摘要：In this paper we investigate the amount of spatial context required for channel attention. To this end we study the popular squeeze-and-excite (SE) block which is a simple and lightweight channel attention mechanism. SE blocks and its numerous variants commonly use global average pooling (GAP) to create a single descriptor for each channel. Here, we empirically analyze the amount of spatial context needed for effective channel attention and find that limited localcontext on the order of seven rows or columns of the original image is sufficient to match the performance of global context. We propose tiled squeeze-and-excite (TSE), which is a framework for building SE-like blocks that employ several descriptors per channel, with each descriptor based on local context only. We further show that TSE is a drop-in replacement for the SE block and can be used in existing SE networks without re-training. This implies that local context descriptors are similar both to each other and to the global context descriptor. Finally, we show that TSE has important practical implications for deployment of SE-networks to dataflow AI accelerators due to their reduced pipeline buffering requirements. For example, using TSE reduces the amount of activation pipeline buffering in EfficientDetD2 by 90% compared to SE (from 50M to 4.77M) without loss of accuracy. Our code and pre-trained models will be publicly available.

【4】 Anisotropic spectral cut-off estimation under multiplicative measurement errors 标题：乘性测量误差下的各向异性谱截止估计

作者：Sergio Brenner Miguel 机构：Ruprecht-Karls-Universität Heidelberg 备注：22 pages, 5 figures, 2 tables 链接：https://arxiv.org/abs/2107.02120 摘要：研究了在R+^d支持下，基于具有乘性测量误差的i.i.d.样本的未知密度f的非参数估计。所提出的完全数据驱动程序是基于密度f的Mellin变换的估计和Mellin变换逆的谱截止正则化。即将到来的偏差-方差权衡是通过数据驱动的各向异性选择截止参数来处理的。为了讨论偏倚项，我们考虑了Melin Sobolev空间，通过它的梅林变换的衰减来刻画未知密度F的正则性。此外，我们还证明了谱截止密度估计在Mellin-Sobolev空间上的minimax最优性。摘要：We study the non-parametric estimation of an unknown density f with support on R+^d based on an i.i.d. sample with multiplicative measurement errors. The proposed fully-data driven procedure is based on the estimation of the Mellin transform of the density f and a regularisation of the inverse of the Mellin transform by a spectral cut-off. The upcoming bias-variance trade-off is dealt with by a data-driven anisotropic choice of the cut-off parameter. In order to discuss the bias term, we consider the Mellin-Sobolev spaces which characterize the regularity of the unknown density f through the decay of its Mellin transform. Additionally, we show minimax-optimality over Mellin-Sobolev spaces of the spectral cut-off density estimator.

【5】 On the symmetric and skew-symmetric K-distributions 标题：关于对称和斜对称K-分布

作者：Stylianos E. Trevlakis,Nestor Chatzidiamantis,George K. Karagiannidis 机构：CHATZIDIAMANTIS,‡ 链接：https://arxiv.org/abs/2107.02092 摘要：本文介绍了四参数对称K分布（SKD）和skew-SKD模型，通过高精度的简化表达式描述机器学习、贝叶斯分析等领域的复杂动力学。给出了概率密度函数、累积分布函数、矩和累积量的计算公式。最后，对SKD随机变量进行了序统计分析，得到了两个SKD随机变量的乘积和比值的分布。摘要：This paper introduces the four-parameter symmetric K-distribution (SKD) and the skew-SKD as models for describing the complex dynamics of machine learning, Bayesian analysis and other fields through simplified expressions with high accuracy. Formulas for the probability density function, cumulative distribution function, moments, and cumulants are given. Finally, an order statistics analysis is provided as well as the distributions of the product and ratio of two SKD random variables are derived.

【6】 Nonparametric quantile regression for time series with replicated observations and its application to climate data 标题：重复观测时间序列的非参数分位数回归及其在气候数据中的应用

作者：Soudeep Deb,Kaushik Jana 机构：Indian Institute of Management Bangalore, Bannerghatta Main Rd, Bangalore, KA , India., and, Imperial College London, Department of Mathematics, Queen’s Gate, London SW,AZ, United Kingdom 链接：https://arxiv.org/abs/2107.02091 摘要：本文提出了一种时间序列回归模型条件分位数的无模型非参数估计，其中协变量向量对不同的响应值重复多次。这类数据在气候研究中大量存在。为了解决这些问题，我们提出的方法利用了数据的重复性，改进了传统分位数回归的限制性线性模型结构。在一个非常一般的框架下，导出了模型均值和方差函数非参数估计的相关渐近理论。我们提供了一个详细的仿真研究，它清楚地表明了该方法比其他基准模型在效率上的提高，特别是当真实的数据生成过程包含非线性均值函数和异方差模式与时间相关的协变量时。当注意力集中在感兴趣的变量的较高分位数上时，非参数方法的预测精度显著高于其他方法。该方法的有效性通过两个气候学应用得到了验证，一个是已知的热带气旋风速数据，另一个是大气污染数据。摘要：This paper proposes a model-free nonparametric estimator of conditional quantile of a time series regression model where the covariate vector is repeated many times for different values of the response. This type of data is abound in climate studies. To tackle such problems, our proposed method exploits the replicated nature of the data and improves on restrictive linear model structure of conventional quantile regression. Relevant asymptotic theory for the nonparametric estimators of the mean and variance function of the model are derived under a very general framework. We provide a detailed simulation study which clearly demonstrates the gain in efficiency of the proposed method over other benchmark models, especially when the true data generating process entails nonlinear mean function and heteroskedastic pattern with time dependent covariates. The predictive accuracy of the non-parametric method is remarkably high compared to other methods when attention is on the higher quantiles of the variable of interest. Usefulness of the proposed method is then illustrated with two climatological applications, one with a well-known tropical cyclone wind-speed data and the other with an air pollution data.

【7】 Analyzing Relevance Vector Machines using a single penalty approach 标题：用单罚方法分析相关向量机

作者：Anand Dixit,Vivekananda Roy 机构：Department of Statistics, Iowa State University 备注：25 pages 链接：https://arxiv.org/abs/2107.02085 摘要：相关向量机（RVM）是一种流行的稀疏贝叶斯学习模型，通常用于预测。最近的研究表明，在RVM中，对多个惩罚参数假设不恰当的先验可能导致不恰当的后验概率。目前文献中关于RVM后验性的充分条件不允许对多个惩罚参数有不适当的先验。在本文中，我们提出了一个单一惩罚相关向量机（SPRVM）模型，其中多个惩罚参数被替换为一个惩罚，我们认为半贝叶斯方法拟合SPRVM。SPRVM的后验适当性的充要条件比RVM更为宽泛，并考虑了罚参数上的几个不适当的先验。此外，我们还证明了用于分析SPRVM模型的Gibbs采样器的几何遍历性，从而可以估计与后验预测分布均值的montecarlo估计相关的渐近标准误差。这种蒙特卡罗标准误差不能在RVM的情况下计算，因为用于分析RVM的吉布斯采样器的收敛速度是未知的。通过对三个实际数据集的分析，比较了RVM和SPRVM的预测性能。摘要：Relevance vector machine (RVM) is a popular sparse Bayesian learning model typically used for prediction. Recently it has been shown that improper priors assumed on multiple penalty parameters in RVM may lead to an improper posterior. Currently in the literature, the sufficient conditions for posterior propriety of RVM do not allow improper priors over the multiple penalty parameters. In this article, we propose a single penalty relevance vector machine (SPRVM) model in which multiple penalty parameters are replaced by a single penalty and we consider a semi Bayesian approach for fitting the SPRVM. The necessary and sufficient conditions for posterior propriety of SPRVM are more liberal than those of RVM and allow for several improper priors over the penalty parameter. Additionally, we also prove the geometric ergodicity of the Gibbs sampler used to analyze the SPRVM model and hence can estimate the asymptotic standard errors associated with the Monte Carlo estimate of the means of the posterior predictive distribution. Such a Monte Carlo standard error cannot be computed in the case of RVM, since the rate of convergence of the Gibbs sampler used to analyze RVM is not known. The predictive performance of RVM and SPRVM is compared by analyzing three real life datasets.

【8】 Antithetic Riemannian Manifold And Quantum-Inspired Hamiltonian Monte Carlo 标题：对偶黎曼流形与量子哈密顿蒙特卡罗

作者：Wilson Tsakane Mongwe,Rendani Mbuvha,Tshilidzi Marwala 机构：School of Electrical Engineering, University of Johannesburg, Auckland Park, South Africa, School of Statistics and Actuarial Science, University of Witwatersrand, Johannesburg, South Africa 链接：https://arxiv.org/abs/2107.02070 摘要：机器学习中目标后验分布的马尔可夫链蒙特卡罗推理主要是通过哈密顿蒙特卡罗及其变体来实现的。这是由于基于哈密顿蒙特卡罗的采样器抑制随机游走行为的能力。与其他马尔可夫链蒙特卡罗方法一样，哈密顿蒙特卡罗方法产生自相关样本，导致估计量的高方差和生成样本的低有效样本容量率。与普通的哈密顿蒙特卡罗相比，在哈密顿蒙特卡罗中加入对偶采样可以产生更高的有效采样率。本文提出了一种新的算法，它是黎曼流形哈密顿蒙特卡罗和量子启发哈密顿蒙特卡罗的对立形式。黎曼流形哈密顿蒙特卡罗算法是对哈密顿蒙特卡罗算法的改进，它考虑了目标的局部几何结构，有利于目标密度的计算。量子启发的哈密顿蒙特卡罗是基于量子粒子，可以有随机质量。量子激发的哈密顿蒙特卡罗使用随机质量矩阵，在尖峰分布和多模分布（如跳跃扩散过程）上比哈密顿蒙特卡罗具有更好的采样效果。利用真实世界的金融市场数据对跳跃扩散过程进行了分析，并利用贝叶斯logistic回归对真实世界的基准分类任务进行了分析。摘要：Markov Chain Monte Carlo inference of target posterior distributions in machine learning is predominately conducted via Hamiltonian Monte Carlo and its variants. This is due to Hamiltonian Monte Carlo based samplers ability to suppress random-walk behaviour. As with other Markov Chain Monte Carlo methods, Hamiltonian Monte Carlo produces auto-correlated samples which results in high variance in the estimators, and low effective sample size rates in the generated samples. Adding antithetic sampling to Hamiltonian Monte Carlo has been previously shown to produce higher effective sample rates compared to vanilla Hamiltonian Monte Carlo. In this paper, we present new algorithms which are antithetic versions of Riemannian Manifold Hamiltonian Monte Carlo and Quantum-Inspired Hamiltonian Monte Carlo. The Riemannian Manifold Hamiltonian Monte Carlo algorithm improves on Hamiltonian Monte Carlo by taking into account the local geometry of the target, which is beneficial for target densities that may exhibit strong correlations in the parameters. Quantum-Inspired Hamiltonian Monte Carlo is based on quantum particles that can have random mass. Quantum-Inspired Hamiltonian Monte Carlo uses a random mass matrix which results in better sampling than Hamiltonian Monte Carlo on spiky and multi-modal distributions such as jump diffusion processes. The analysis is performed on jump diffusion process using real world financial market data, as well as on real world benchmark classification tasks using Bayesian logistic regression.

【9】 An extended watershed-based zonal statistical AHP model for flood risk estimation: Constraining runoff converging related indicators by sub-watersheds 标题：基于流域的洪水风险分区统计AHP扩展模型：子流域约束径流收敛相关指标

作者：Hongping Zhang,Zhenfeng Shao,Jinqi Zhao,Xiao Huang,Jie Yang,Bin Hu,Wenfu Wu 机构：a, Wenfu, a State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan, University, Wuhan , China;, b Jiangsu Key Laboratory of Resources and Environmental Information Engineering, China University of 备注：This paper is a research paper, it contains 40 pages and 8 figures. This paper is a modest contribution to the ongoing discussions the accuracy of flood risk estimation via AHP model improved by adopting pixels replaced with sub-watersheds as basic unit 链接：https://arxiv.org/abs/2107.02043 摘要：洪水是高度不确定的事件，发生在不同的地区，有不同的先决条件和强度。高可靠的洪水灾害风险图有助于减少洪水对洪水管理、减灾和城市化抗灾能力的影响。在洪水风险评估中，广泛应用的层次分析法（AHP）通常以像元为基本单元，无法在子流域尺度上捕捉到邻近源洪水单元造成的相似威胁。为此，本文提出了一种基于扩展流域的分区统计AHP模型（WZSAHP-Slope&Stream）来约束子流域的径流汇流相关指标。以巢湖流域为例，以2020年7月提取的实际洪水面积为例，验证了该方法的有效性。结果表明，采用多流向划分流域的WZSAHP坡面流模型，用最大统计量法计算距溪流和坡面的距离统计量，其结果优于其他检验方法。与基于像素的AHP方法相比，该方法在验证1中的正确率提高了16%（从67%提高到83%），拟合率提高了1%（从13%提高到14%），在验证2中的正确率提高了37%（从23%提高到60%），拟合率提高了6%（从12%提高到18%）。摘要：Floods are highly uncertain events, occurring in different regions, with varying prerequisites and intensities. A highly reliable flood disaster risk map can help reduce the impact of floods for flood management, disaster decreasing, and urbanization resilience. In flood risk estimation, the widely used analytic hierarchy process (AHP) usually adopts pixel as a basic unit, it cannot capture the similar threaten caused by neighborhood source flooding cells at sub-watershed scale. Thus, an extended watershed-based zonal statistical AHP model constraining runoff converging related indicators by sub-watersheds (WZSAHP-Slope & Stream) is proposed to fill this gap. Taking the Chaohu basin as test case, we validated the proposed method with a real-flood area extracted in July 2020. The results indicate that the WZSAHP-Slope & Stream model using multiple flow direction division watersheds to calculate statistics of distance from stream and slope by maximum statistic method outperformed other tested methods. Compering with pixel-based AHP method, the proposed method can improve the correct ratio by 16% (from 67% to 83%) and fit ratio by 1% (from 13% to 14%) as in validation 1, and improve the correct ratio by 37% (from 23% to 60%) and fit ratio by 6% (from 12% to 18%) as in validation 2.

【10】 Template-Based Graph Clustering 标题：基于模板的图聚类

作者：Mateus Riva,Florian Yger,Pietro Gori,Roberto M. Cesar Jr.,Isabelle Bloch 机构： LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France, LAMSADE, Universit´e Paris-Dauphine, PSL Research University, France, IME, University of S˜ao Paulo, Brazil 备注：None 链接：https://arxiv.org/abs/2107.01994 摘要：我们提出了一种新的基于簇（或社区）底层结构信息的图聚类方法。该问题被描述为一个图与一个更小维数的模板的匹配，因此将被观察图的$n$个顶点（待聚类）与模板图的$k$个顶点相匹配，使用其边作为支持信息，并在正交矩阵集上松弛以找到一个$k$维嵌入。通过对聚类密度及其关系进行编码的相关先验知识，我们的方法优于经典方法，特别是在具有挑战性的情况下。摘要：We propose a novel graph clustering method guided by additional information on the underlying structure of the clusters (or communities). The problem is formulated as the matching of a graph to a template with smaller dimension, hence matching $n$ vertices of the observed graph (to be clustered) to the $k$ vertices of a template graph, using its edges as support information, and relaxed on the set of orthonormal matrices in order to find a $k$ dimensional embedding. With relevant priors that encode the density of the clusters and their relationships, our method outperforms classical methods, especially for challenging cases.

【11】 UCSL : A Machine Learning Expectation-Maximization framework for Unsupervised Clustering driven by Supervised Learning 标题：UCSL：一种监督学习驱动的无监督聚类机器学习期望最大化框架

作者：Robin Louiset,Pietro Gori,Benoit Dufumier,Josselin Houenou,Antoine Grigis,Edouard Duchesnay 机构： Universit´e Paris-Saclay, CEA, Neurospin, Gif-sur-Yvette, France, LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France 备注：None 链接：https://arxiv.org/abs/2107.01988 摘要：子类型发现是指发现数据集中可解释且一致的子部分，这些子部分也与某个受监督的任务相关。从数学的角度来看，这可以被定义为一个由监督学习驱动的聚类任务，以发现符合监督预测的子群。本文提出了一个通用的期望最大化集成框架UCSL（unsupervisedclusteringdriven by Supervised Learning）。我们的方法是通用的，它可以集成任何聚类方法，可以由二元分类和回归驱动。我们建议通过合并多个线性估计器来构造一个非线性模型，每个簇一个线性估计器。对每个超平面进行估计，使其能够正确识别或预测一个簇。我们使用SVC或Logistic回归进行分类，使用SVR进行回归。此外，为了在更合适的空间内进行聚类分析，我们还提出了一种降维算法，将数据投影到与监督任务相关的正交空间中。利用合成数据和实验数据分析了算法的鲁棒性和泛化能力。特别是，我们验证了它的能力，以确定合适的一致的子类型进行精神疾病聚类分析与已知的地面真相标签。在平衡精度方面，所提出的方法比以前最先进的技术的增益约为+1.9点。最后，我们在一个scikit-learn-compatible Python包中提供代码和示例https://github.com/neurospin-projects/2021_rlouiset_ucsl 摘要：Subtype Discovery consists in finding interpretable and consistent sub-parts of a dataset, which are also relevant to a certain supervised task. From a mathematical point of view, this can be defined as a clustering task driven by supervised learning in order to uncover subgroups in line with the supervised prediction. In this paper, we propose a general Expectation-Maximization ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised Learning). Our method is generic, it can integrate any clustering method and can be driven by both binary classification and regression. We propose to construct a non-linear model by merging multiple linear estimators, one per cluster. Each hyperplane is estimated so that it correctly discriminates - or predict - only one cluster. We use SVC or Logistic Regression for classification and SVR for regression. Furthermore, to perform cluster analysis within a more suitable space, we also propose a dimension-reduction algorithm that projects the data onto an orthonormal space relevant to the supervised task. We analyze the robustness and generalization capability of our algorithm using synthetic and experimental datasets. In particular, we validate its ability to identify suitable consistent sub-types by conducting a psychiatric-diseases cluster analysis with known ground-truth labels. The gain of the proposed method over previous state-of-the-art techniques is about +1.9 points in terms of balanced accuracy. Finally, we make codes and examples available in a scikit-learn-compatible Python package at https://github.com/neurospin-projects/2021_rlouiset_ucsl

【12】 Neyman-Pearson Hypothesis Testing, Epistemic Reliability and Pragmatic Value-Laden Asymmetric Error Risks 标题：Neyman-Pearson假设检验、认知可靠性与语用价值负载的非对称错误风险

作者：Adam P. Kubiak,Pawel Kawalec,Adam Kiersztyn 机构：John Paul II Catholic University of Lublin, Aleje Racławickie , -, Lublin, Poland, Department of Computer Science, Lublin University of Technology, Nadbystrzycka ,B 备注：Axiomathes (2021) 链接：https://arxiv.org/abs/2107.01944 摘要：Neyman和Pearson的假设检验理论并不能保证最小的认知可靠性：这是一种比错误结论更容易得出正确结论的特征。错误概率的不平等设置可能会对理论的认知可靠性产生负面影响。最重要的是，在消极影响的情况下，没有方法论的调整可以抵消它，因此在这种情况下，所讨论的理论的语用价值取向不可避免地损害了获得真理的目标。摘要：Neyman and Pearson's theory of testing hypotheses does not warrant minimal epistemic reliability: the feature of driving to true conclusions more often than to false ones. The theory does not protect from the possible negative effects of the pragmatic value-laden unequal setting of error probabilities on the theory's epistemic reliability. Most importantly, in the case of a negative impact, no methodological adjustment is available to neutralize it, so in such cases, the discussed pragmatic-value-ladenness of the theory inevitably compromises the goal of attaining truth.

【13】 On the Estimation of Bivariate Return Curves for Extreme Values 标题：关于二元回归曲线极值的估计

作者：C. J. R. Murphy-Barltrop,J. L. Wadsworth,E. F. Eastoe 机构：STOR-i Centre for Doctoral Training, Lancaster University LA,YR, United Kingdom, Department of Mathematics and Statistics, Lancaster University LA,YR, United Kingdom 备注：41 pages (without supplementary), 11 figures, 2 tables 链接：https://arxiv.org/abs/2107.01942 摘要：在多变量环境中，定义极端风险度量在许多情况下都很重要，例如金融、环境规划和结构工程。在本文中，我们回顾了关于极值二元收益曲线的文献，这是一种风险度量，是对收益水平的自然二元扩展，并提出了基于多元极值模型的新的估计方法，它可以同时考虑渐近依赖性和渐近独立性。我们找出了现有文献中的空白，并提出了新的工具来测试和验证收益曲线，并比较一系列多元模型的估计。然后使用这些工具通过模拟和案例研究对选定的模型进行比较。最后，我们将进行讨论，并列出一些挑战。摘要：In the multivariate setting, defining extremal risk measures is important in many contexts, such as finance, environmental planning and structural engineering. In this paper, we review the literature on extremal bivariate return curves, a risk measure that is the natural bivariate extension to a return level, and propose new estimation methods based on multivariate extreme value models that can account for both asymptotic dependence and asymptotic independence. We identify gaps in the existing literature and propose novel tools for testing and validating return curves and comparing estimates from a range of multivariate models. These tools are then used to compare a selection of models through simulation and case studies. We conclude with a discussion and list some of the challenges.

【14】 Blind source separation for non-stationary random fields 标题：非平稳随机场的盲源分离

作者：Christoph Muehlmann,François Bachoc,Klaus Nordhausen 机构：Institute of Statistics & Mathematical Methods in Economics, Vienna University of Technology, Austria, Institut de Mathématiques de Toulouse, Université Paul Sabatier, France, Department of Mathematics and Statistics, University of Jyväskylä, Finland 链接：https://arxiv.org/abs/2107.01916 摘要：区域数据分析涉及到对测量数据的分析和建模，这些测量数据在空间上是分开的，特别是对这些数据的典型特征进行说明。也就是说，近距离的测量往往比进一步分离的测量更相似。当考虑多变量空间数据时，这可能也适用于交叉依赖。通常，科学家们对这些数据的线性变换很感兴趣，因为线性变换很容易解释，并且可以用作降维。最近，空间盲源分离（SBSS）被引入，它假设观测数据是由不相关的、弱平稳随机场的线性混合形成的。然而，在实际应用中，众所周知，当空间域的大小增加时，在二阶依赖性在域上变化的意义上，可能违反弱平稳性假设，从而导致非平稳分析。在我们的工作中，我们扩展了SBSS模型来调整这些平稳性违规，提出了三种新的估计，并建立了定义这些估计的分解矩阵泛函的可辨识性和仿射等变性。在一个广泛的模拟研究中，我们研究了我们的估计器的性能，并展示了它们在GEMAS地球化学填图项目的地球化学数据集分析中的应用。摘要：Regional data analysis is concerned with the analysis and modeling of measurements that are spatially separated by specifically accounting for typical features of such data. Namely, measurements in close proximity tend to be more similar than the ones further separated. This might hold also true for cross-dependencies when multivariate spatial data is considered. Often, scientists are interested in linear transformations of such data which are easy to interpret and might be used as dimension reduction. Recently, for that purpose spatial blind source separation (SBSS) was introduced which assumes that the observed data are formed by a linear mixture of uncorrelated, weakly stationary random fields. However, in practical applications, it is well-known that when the spatial domain increases in size the weak stationarity assumptions can be violated in the sense that the second order dependency is varying over the domain which leads to non-stationary analysis. In our work we extend the SBSS model to adjust for these stationarity violations, present three novel estimators and establish the identifiability and affine equivariance property of the unmixing matrix functionals defining these estimators. In an extensive simulation study, we investigate the performance of our estimators and also show their use in the analysis of a geochemical dataset which is derived from the GEMAS geochemical mapping project.

【15】 Randomized multilevel Monte Carlo for embarrassingly parallel inference 标题：尴尬并行推理的随机化多水平蒙特卡罗方法

作者：Ajay Jasra,Kody J. H. Law,Fangyuan Yu 机构： Computer, Electrical and Mathematical Sciences and Engineering Division, King, Abdullah University of Science and Technology, Thuwal, KSA., Department of Mathematics, University of Manchester, Manchester, M,PL, UK. 链接：https://arxiv.org/abs/2107.01913 摘要：本文总结了近年来在以数据为中心的科学和工程应用背景下开展的一个以推理为中心的研究项目，并对其未来十年的发展趋势进行了预测。在这种情况下，人们往往努力学习复杂的系统，以便在不确定的情况下做出更明智的预测和高风险的决策。在这种情况下必须面对的一些关键挑战是健壮性、可推广性和可解释性。贝叶斯框架解决了这三个挑战，同时还带来了第四个不受欢迎的特性：它通常比确定性框架昂贵得多。在21世纪，并且在过去的十年中，越来越多的方法出现了，这些方法允许人们利用廉价的低保真度模型来预处理算法，以便使用更昂贵的模型进行推理，并使贝叶斯推理在高维和昂贵的模型的上下文中易于处理。值得注意的例子是多层蒙特卡罗（MLMC）、多指标蒙特卡罗（MIMC）和它们的随机化对应物（rMLMC），它们能够证明相对于均方误差（MSE）$1/$MSE实现与维数无关（包括$\infty-$dimension）的规范复杂率。在推理上下文中，通常会丢失一些并行性，但最近通过新的双随机化方法在很大程度上恢复了这一点。这种方法提供了有关无限分辨率目标分布无偏的感兴趣量的i.i.d.样本。在未来的十年里，这一系列算法有可能通过扩展和扩展完全贝叶斯推理来改变以数据为中心的科学和工程，以及经典的机器学习应用，如深度学习。摘要：This position paper summarizes a recently developed research program focused on inference in the context of data centric science and engineering applications, and forecasts its trajectory forward over the next decade. Often one endeavours in this context to learn complex systems in order to make more informed predictions and high stakes decisions under uncertainty. Some key challenges which must be met in this context are robustness, generalizability, and interpretability. The Bayesian framework addresses these three challenges, while bringing with it a fourth, undesirable feature: it is typically far more expensive than its deterministic counterparts. In the 21st century, and increasingly over the past decade, a growing number of methods have emerged which allow one to leverage cheap low-fidelity models in order to precondition algorithms for performing inference with more expensive models and make Bayesian inference tractable in the context of high-dimensional and expensive models. Notable examples are multilevel Monte Carlo (MLMC), multi-index Monte Carlo (MIMC), and their randomized counterparts (rMLMC), which are able to provably achieve a dimension-independent (including $\infty-$dimension) canonical complexity rate with respect to mean squared error (MSE) of $1/$MSE. Some parallelizability is typically lost in an inference context, but recently this has been largely recovered via novel double randomization approaches. Such an approach delivers i.i.d. samples of quantities of interest which are unbiased with respect to the infinite resolution target distribution. Over the coming decade, this family of algorithms has the potential to transform data centric science and engineering, as well as classical machine learning applications such as deep learning, by scaling up and scaling out fully Bayesian inference.

【16】 Causally Invariant Predictor with Shift-Robustness 标题：具有移位稳健性的因果不变预报器

作者：Xiangyu Zheng,Xinwei Sun,Wei Chen,Tie-Yan Liu 机构： Microsoft Research Asia, Beijing, Peking University, Beijing 链接：https://arxiv.org/abs/2107.01876 摘要：本文提出了一种不变的因果预测方法，该方法对域间的分布转移具有鲁棒性，并最大限度地保留了可转移的不变信息。基于一个分离的因果分解，我们将分布转移描述为系统中的软干预，它涵盖了广泛的分布转移案例，因为我们没有事先说明因果结构或干预变量。本文提出了一种基于do算子的介入条件期望的预测方法，并证明了它是跨域不变的，而不是通过正则化来约束预测的不变性。更重要的是，我们证明了所提出的预测器是一个鲁棒预测器，它使所有域的分布中最坏情况的二次损失最小化。对于经验学习，我们提出了一种基于数据再生的直观灵活的估计方法，并提出了一个局部因果发现过程来指导再生步骤。其关键思想是重新生成数据，使得重新生成的分布与中间图相兼容，这使得我们可以将标准的监督学习方法与重新生成的数据相结合。在合成数据和真实数据上的实验结果表明，我们的预测方法在提高跨领域预测的准确性和鲁棒性方面是有效的。摘要：This paper proposes an invariant causal predictor that is robust to distribution shift across domains and maximally reserves the transferable invariant information. Based on a disentangled causal factorization, we formulate the distribution shift as soft interventions in the system, which covers a wide range of cases for distribution shift as we do not make prior specifications on the causal structure or the intervened variables. Instead of imposing regularizations to constrain the invariance of the predictor, we propose to predict by the intervened conditional expectation based on the do-operator and then prove that it is invariant across domains. More importantly, we prove that the proposed predictor is the robust predictor that minimizes the worst-case quadratic loss among the distributions of all domains. For empirical learning, we propose an intuitive and flexible estimating method based on data regeneration and present a local causal discovery procedure to guide the regeneration step. The key idea is to regenerate data such that the regenerated distribution is compatible with the intervened graph, which allows us to incorporate standard supervised learning methods with the regenerated data. Experimental results on both synthetic and real data demonstrate the efficacy of our predictor in improving the predictive accuracy and robustness across domains.

【17】 Variational Bayesian Inference for the Polytomous-Attribute Saturated Diagnostic Classification Model with Parallel Computing 标题：并行计算多属性饱和诊断分类模型的变分贝叶斯推理

作者：Motonori Oka,Shun Saso,Kensuke Okada 机构：Graduate School of Education, The University of Tokyo, Author Note 链接：https://arxiv.org/abs/2107.01865 摘要：诊断分类模型（DCMs）作为一种辅助教育环境下形成性评估的统计工具，越来越多地被用来提供关于考生属性的诊断信息。DCMs通常采用掌握和不掌握属性的两分法来表示属性的掌握状态。然而，许多实际设置涉及不同级别的掌握状态，而不是单一属性中的简单二分法。虽然多属性离散余弦模型可以满足这一实际需求，但由于多属性离散余弦模型的数量比二元属性离散余弦模型的数量多，使得其在马尔可夫链蒙特卡罗估计中的计算量大，阻碍了其大规模应用。本文研究了一种可扩展的多属性离散余弦模型的贝叶斯估计方法，并在已有的二值属性离散余弦模型和多属性离散余弦模型的文献基础上，提出了一种多属性饱和离散余弦模型的变分贝叶斯算法。此外，我们还针对所提出的VB算法提出了并行计算的配置，以达到更好的计算效率。montecarlo模拟结果表明，该方法在各种条件下都具有良好的参数恢复性能。通过一个实例说明了该方法的有效性。摘要：As a statistical tool to assist formative assessments in educational settings, diagnostic classification models (DCMs) have been increasingly used to provide diagnostic information regarding examinees' attributes. DCMs often adopt dichotomous division such as mastery and non-mastery of attributes to express mastery states of attributes. However, many practical settings involve different levels of mastery states rather than a simple dichotomy in a single attribute. Although this practical demand can be addressed by polytomous-attribute DCMs, their computational cost in a Markov chain Monte Carlo estimation impedes their large-scale applications due to the larger number of polytomous-attribute mastery patterns than that of binary-attribute ones. This study considers a scalable Bayesian estimation method for polytomous-attribute DCMs and developed a variational Bayesian (VB) algorithm for a polytomous-attribute saturated DCM -- a generalization of polytomous-attribute DCMs -- by building on the existing literature in VB for binary-attribute DCMs and polytomous-attribute DCMs. Furthermore, we proposed the configuration of parallel computing for the proposed VB algorithm to achieve better computational efficiency. Monte Carlo simulations revealed that our method exhibited the high performance in parameter recovery under a wide range of conditions. An empirical example is used to demonstrate the utility of our method.

【18】 Matching a Desired Causal State via Shift Interventions 标题：通过移位干预匹配期望的因果状态

作者：Jiaqi Zhang,Chandler Squires,Caroline Uhler 机构：LIDS, EECS, and IDSS, MIT, Cambridge, MA , LIDS EECS, and IDSS, MIT 链接：https://arxiv.org/abs/2107.01850 摘要：将因果系统从一个给定的初始状态转化为一个期望的目标状态是一个贯穿控制理论、生物学和材料科学等多个领域的重要课题。在因果模型中，这种转换可以通过执行一组干预来实现。在本文中，我们考虑的问题，确定一个转移干预，匹配所需的平均值系统通过主动学习。我们定义了可从移位干预中识别的Markov等价类，并提出了两种保证精确匹配期望均值的主动学习策略。然后，我们推导了所需干预次数的最坏情况下界，并证明了这些策略对于某些类别的图是最优的。特别是，我们表明，我们的策略可能需要的干预比以前考虑的方法，优化结构学习的基础因果图指数少。与我们的理论结果一致，我们也在实验上证明，我们提出的主动学习策略需要较少的干预相比，几个基线。摘要：Transforming a causal system from a given initial state to a desired target state is an important task permeating multiple fields including control theory, biology, and materials science. In causal models, such transformations can be achieved by performing a set of interventions. In this paper, we consider the problem of identifying a shift intervention that matches the desired mean of a system through active learning. We define the Markov equivalence class that is identifiable from shift interventions and propose two active learning strategies that are guaranteed to exactly match a desired mean. We then derive a worst-case lower bound for the number of interventions required and show that these strategies are optimal for certain classes of graphs. In particular, we show that our strategies may require exponentially fewer interventions than the previously considered approaches, which optimize for structure learning in the underlying causal graph. In line with our theoretical results, we also demonstrate experimentally that our proposed active learning strategies require fewer interventions compared to several baselines.

【19】 Zero-modified Count Time Series with Markovian Intensities 标题：具有马尔可夫强度的零修正计数时间序列

作者：N. Balakrishna,Muhammed Anvar,Bovas Abraham 机构： Cochin University of Science and Technology, University of Waterloo 备注：31 pages including Tables and Figures 链接：https://arxiv.org/abs/2107.01813 摘要：本文提出了一种分析零膨胀或零收缩计数时间序列的方法。特别地，研究了由非负Markov序列生成的强度为零修正Poisson和零修正负二项式序列。模型参数的估计采用估计方程的方法，该方法将模型表示为广义状态空间形式。利用广义Kalman滤波器提取估计所需的潜在强度。利用模拟数据和实际数据说明了该模型及其估计方法的应用。摘要：This paper proposes a method for analyzing count time series with inflation or deflation of zeros. In particular, zero-modified Poisson and zero-modified negative binomial series with intensities generated by non-negative Markov sequences are studied in detail. Parameters of the model are estimated by the method of estimating equations which is facilitated by expressing the model in a generalized state space form. The latent intensities required for estimation are extracted using generalized Kalman filter. The applications of proposed model and its estimation methods are illustrated using simulated and real data sets.

【20】 Statistical Theory for Imbalanced Binary Classification 标题：非平衡二分分类的统计理论

作者：Shashank Singh,Justin Khim 机构：Empirical Inference Department, Max Planck Institute for Intelligent Systems, Tübingen, Germany, Amazon, New York, NY 备注：Parts of this paper have been revised from arXiv:2004.04715v2 [math.ST] 链接：https://arxiv.org/abs/2107.01777 摘要：在为二元分类发展的大量统计理论中，不平衡分类几乎没有什么有意义的结果，不平衡分类的数据主要来自两类中的一类。现有的理论至少面临两大挑战。首先，有意义的结果必须考虑比分类精度更复杂的性能度量。为了解决这一问题，我们将贝叶斯最优分类器的一个新的推广刻画为从混淆矩阵计算出的任何性能度量，并用它来说明在均匀（$\mathcal{L}\infty$）损失下，如何根据估计类概率函数的误差来获得相对性能保证。其次，如我们所示，最佳分类性能取决于类不平衡的某些性质，这些性质以前没有被形式化。具体来说，我们提出了一种新的类失衡的子类型，我们称之为均匀类失衡。我们分析了均匀类不平衡是如何影响最优分类器性能的，并且表明它需要不同于其他类型的类不平衡的分类器行为。我们进一步说明这两个贡献的情况下，k$-最近邻分类，我们开发了新的保证。这些结果为非平衡二元分类提供了第一个有意义的有限样本统计理论。摘要：Within the vast body of statistical theory developed for binary classification, few meaningful results exist for imbalanced classification, in which data are dominated by samples from one of the two classes. Existing theory faces at least two main challenges. First, meaningful results must consider more complex performance measures than classification accuracy. To address this, we characterize a novel generalization of the Bayes-optimal classifier to any performance metric computed from the confusion matrix, and we use this to show how relative performance guarantees can be obtained in terms of the error of estimating the class probability function under uniform ($\mathcal{L}_\infty$) loss. Second, as we show, optimal classification performance depends on certain properties of class imbalance that have not previously been formalized. Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class Imbalance influences optimal classifier performance and show that it necessitates different classifier behavior than other types of class imbalance. We further illustrate these two contributions in the case of $k$-nearest neighbor classification, for which we develop novel guarantees. Together, these results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.

【21】 Extending Latent Basis Growth Model to Explore Joint Development in the Framework of Individual Measurement Occasions 标题：拓展潜在基数增长模型探索个体测量场合框架下的联动发展

作者：Jin Liu 备注：Draft version 1.1, 07/04/2021. This paper has not been peer reviewed. Please do not copy or cite without author's permission 链接：https://arxiv.org/abs/2107.01773 摘要：多个领域中的纵向过程通常被理论化为非线性，这给统计带来了独特的挑战。经验研究者通常通过权衡模型对于非线性性质的具体程度、模型是否提供有效的估计和可解释的系数来选择非线性纵向模型。潜在基增长模型（LBGMs）是一种可以绕过这些权衡的方法：它不需要任何函数形式的规范；此外，它的估计过程是迅速的，估计是直截了当的解释。我们提出了一种新的LBGMs规范，允许（1）不等间距的研究波和（2）每个波周围的单独测量场合。然后我们扩展LBGMs来探索多个重复的结果，因为纵向过程很少单独展开。通过仿真研究和实际数据分析，提出了该模型。仿真研究表明，该模型能提供95%置信区间的无偏准确估计。通过使用纵向阅读和数学成绩的真实世界分析，我们证明了所提出的平行LBGM能够捕获这两种能力的潜在发展模式，并且LBGMs的新规范有助于纵向过程具有不同时间结构的联合发展。我们还为所提出的模型提供了相应的代码。摘要：Longitudinal processes in multiple domains are often theorized to be nonlinear, which poses unique statistical challenges. Empirical researchers often select a nonlinear longitudinal model by weighing how specific the model must be regarding the nature of the nonlinearity, whether the model provides efficient estimation and interpretable coefficients. Latent basis growth models (LBGMs) are one method that can get around these tradeoffs: it does not require specification of any functional form; additionally, its estimation process is expeditious, and estimates are straightforward to interpret. We propose a novel specification for LBGMs that allows for (1) unequally-spaced study waves and (2) individual measurement occasions around each wave. We then extend LBGMs to explore multiple repeated outcomes because longitudinal processes rarely unfold in isolation. We present the proposed model by simulation studies and real-world data analyses. Our simulation studies demonstrate that the proposed model can provide unbiased and accurate estimates with target coverage probabilities of a 95% confidence interval. With the real-world analyses using longitudinal reading and mathematics scores, we demonstrate that the proposed parallel LBGM can capture the underlying developmental patterns of these two abilities and that the novel specification of LBGMs is helpful in joint development where longitudinal processes have different time structures. We also provide the corresponding code for the proposed model.

【22】 Nonparametric Detection of Multiple Location-Scale Change Points via Wild Binary Segmentation 标题：基于野二值分割的多位置尺度变化点非参数检测

作者：Gordon J. Ross 机构：University of Edinburgh 链接：https://arxiv.org/abs/2107.01742 摘要：虽然参数多变点检测已经得到了广泛的研究，但对于在观测序列的分布未知的情况下检测多个变点的非参数任务却很少有人关注。关于这一主题的大多数现有工作要么基于可能遭受误报检测的惩罚代价函数，要么基于可能无法检测变化点的某些配置的二进制分割。我们介绍了一种新的变化点检测方法，该方法将最近提出的野生二进制分割（WBS）过程应用于非参数设置。我们的方法是基于秩检验统计的使用，它在检测位置和/或规模变化方面特别强大。仿真结果表明，与现有方法相比，本文提出的非参数WBS方法具有良好的性能，特别是在检测尺度变化时。我们应用我们的程序来研究一个涉及作者写作风格变化点的笔法问题，并在一个相关的R包中提供了我们算法的完整实现。摘要：While parametric multiple change point detection has been widely studied, less attention has been given to the nonparametric task of detecting multiple change points in a sequence of observations when their distribution is unknown. Most existing work on this topic is either based on penalized cost functions which can suffer from false positive detections, or on binary segmentation which can fail to detect certain configurations of change points. We introduce a new approach to change point detection which adapts the recently proposed Wild Binary Segmentation (WBS) procedure to a nonparametric setting. Our approach is based on the use of rank based test statistics which are especially powerful at detecting changes in location and/or scale. We show via simulation that the resulting nonparametric WBS procedure has favorable performance compared to existing methods, particularly when it comes to detecting changes in scale. We apply our procedure to study a problem in stylometry involving change points in an author's writing style, and provide a full implementation of our algorithm in an associated R package.

【23】 Latent structure blockmodels for Bayesian spectral graph clustering 标题：贝叶斯谱图聚类的潜在结构块模型

作者：Francesco Sanna Passino,Nicholas A. Heard 机构：Department of Mathematics, Imperial College London, Queen’s Gate, SW,AZ, London (United Kingdom) 链接：https://arxiv.org/abs/2107.01734 摘要：网络邻接矩阵的谱嵌入通常产生近似于低维子流形结构的节点表示。特别地，当从潜在位置模型生成图时，期望出现隐藏子结构。此外，网络中社区的存在可能会在嵌入中产生特定于社区的子流形结构，但这在大多数网络统计模型中没有明确说明。在本文中，提出了一类称为潜在结构块模型（LSBM）的模型来解决这种情况，允许在存在特定于社区的一维流形结构时进行图聚类。LSBMs关注于一类特定的潜在空间模型，即随机点积图（RDPG），并将一个潜在子流形分配给每个群体的潜在位置。讨论了一种基于LSBMs的贝叶斯嵌入模型，并通过仿真和实际网络数据验证了该模型的有效性。该模型能够正确地恢复生活在一维流形中的底层群落，即使底层曲线的参数形式未知，也能在各种真实数据上取得显著效果。摘要：Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.

【24】 Rates of Estimation of Optimal Transport Maps using Plug-in Estimators via Barycentric Projections 标题：通过重心投影使用插件估计器估计最优交通地图的速度

作者：Nabarun Deb,Promit Ghosal,Bodhisattva Sen 机构： Amsterdam Avenue, New York, NY , Massachusetts Ave, Cambridge, MA 备注：40 pages 链接：https://arxiv.org/abs/2107.01718 摘要：在$\mathbb{R}^d$上，两个概率分布$\mu$和$\nu$之间的最优传输映射在机器学习和统计学中都有广泛的应用。实际上，这些地图需要根据$\mu$和$\nu$采样的数据进行估计。插件估计器可能是计算最优传输领域中最常用的传输图估计方法。在本文中，我们提供了一个通过重心投影定义的一般插件估计的收敛速度的综合分析。我们的主要贡献是一个新的重心投影稳定性估计，它在最小光滑性假设下进行，可以用来分析一般的插件估计。首先给出了最优传输映射的自然离散和半离散估计的收敛速度，说明了这种稳定性估计的有效性。然后，我们使用相同的稳定性估计来说明，在Besov型或Sobolev型的额外光滑性假设下，基于小波或核平滑的插件估计分别加快了收敛速度，并显著减轻了自然离散/半离散估计所遭受的维数灾难。作为我们分析的副产品，我们还获得了在上述平滑度假设下，$\mu$和$\nu$之间的Wasserstein距离的插入式估值器的更快收敛速度，从而补充了Chizat等人（2020）的最新结果。最后，我们说明了我们的结果在获得两个概率分布之间的Wasserstein重心的收敛速度和获得最近一些基于最优传输的独立性测试的渐近检测阈值方面的适用性。摘要：Optimal transport maps between two probability distributions $\mu$ and $\nu$ on $\mathbb{R}^d$ have found extensive applications in both machine learning and statistics. In practice, these maps need to be estimated from data sampled according to $\mu$ and $\nu$. Plug-in estimators are perhaps most popular in estimating transport maps in the field of computational optimal transport. In this paper, we provide a comprehensive analysis of the rates of convergences for general plug-in estimators defined via barycentric projections. Our main contribution is a new stability estimate for barycentric projections which proceeds under minimal smoothness assumptions and can be used to analyze general plug-in estimators. We illustrate the usefulness of this stability estimate by first providing rates of convergence for the natural discrete-discrete and semi-discrete estimators of optimal transport maps. We then use the same stability estimate to show that, under additional smoothness assumptions of Besov type or Sobolev type, wavelet based or kernel smoothed plug-in estimators respectively speed up the rates of convergence and significantly mitigate the curse of dimensionality suffered by the natural discrete-discrete/semi-discrete estimators. As a by-product of our analysis, we also obtain faster rates of convergence for plug-in estimators of $W_2(\mu,\nu)$, the Wasserstein distance between $\mu$ and $\nu$, under the aforementioned smoothness assumptions, thereby complementing recent results in Chizat et al. (2020). Finally, we illustrate the applicability of our results in obtaining rates of convergence for Wasserstein barycenters between two probability distributions and obtaining asymptotic detection thresholds for some recent optimal-transport based tests of independence.

【25】 Calibrating generalized predictive distributions 标题：校准广义预测分布

作者：Pei-Shien Wu,Ryan Martin 机构：and 备注：33 pages, 5 figures, 6 tables 链接：https://arxiv.org/abs/2107.01688 摘要：在预测问题中，通常对数据生成过程建模，然后使用基于模型的过程（如贝叶斯预测分布）来量化下一次观测的不确定性。然而，如果假定的模型是错误的，那么它的预测可能不会被校准——也就是说，预测分布的分位数可能不是名义上的频率预测上限，甚至是渐近的。我们提出了一种策略，在该策略中，数据本身有助于确定假设的基于模型的解决方案是否应进行调整，以考虑模型的错误指定，而不是放弃基于模型的公式对更复杂的非基于模型的方法的舒适性。这是通过一个广义Bayes公式来实现的，在这个公式中，通过所提出的广义预测校正（GPrC）算法来调整学习率参数，使得即使在模型错误的情况下也能校正预测分布。通过大量的数值实验，证明了该算法的有效性、有效性和鲁棒性。摘要：In prediction problems, it is common to model the data-generating process and then use a model-based procedure, such as a Bayesian predictive distribution, to quantify uncertainty about the next observation. However, if the posited model is misspecified, then its predictions may not be calibrated -- that is, the predictive distribution's quantiles may not be nominal frequentist prediction upper limits, even asymptotically. Rather than abandoning the comfort of a model-based formulation for a more complicated non-model-based approach, here we propose a strategy in which the data itself helps determine if the assumed model-based solution should be adjusted to account for model misspecification. This is achieved through a generalized Bayes formulation where a learning rate parameter is tuned, via the proposed generalized predictive calibration (GPrC) algorithm, to make the predictive distribution calibrated, even under model misspecification. Extensive numerical experiments are presented, under a variety of settings, demonstrating the proposed GPrC algorithm's validity, efficiency, and robustness.

【26】 Time Series Graphical Lasso and Sparse VAR Estimation 标题：时间序列图套索与稀疏VAR估计

作者：Aramayis Dallakyan,Rakheon Kim,Mohsen Pourahmadi 机构：Texas A&M University 链接：https://arxiv.org/abs/2107.01659 摘要：我们改进了Davis等人（2016）的两阶段稀疏向量自回归（sVAR）方法，提出了一种替代的两阶段改进sVAR方法，该方法在第一阶段依赖时间序列图形套索来估计稀疏逆谱密度，第二阶段使用错误发现率（FDR）过程来细化AR系数矩阵的非零项。该方法的优点是避免了谱密度矩阵的求逆，但必须处理具有复数项的Hermitian矩阵的优化问题。它在预测性能损失不大的情况下显著地提高了计算时间。我们研究了我们提出的方法的性质，并用模拟的和真实的宏观经济数据集比较了这两种方法的性能。我们的模拟结果显示，当目标是学习AR系数矩阵的结构时，所提出的修正或msVAR是首选，而当最终任务是预测时，sVAR优于msVAR。摘要：We improve upon the two-stage sparse vector autoregression (sVAR) method in Davis et al. (2016) by proposing an alternative two-stage modified sVAR method which relies on time series graphical lasso to estimate sparse inverse spectral density in the first stage, and the second stage refines non-zero entries of the AR coefficient matrices using a false discovery rate (FDR) procedure. Our method has the advantage of avoiding the inversion of the spectral density matrix but has to deal with optimization over Hermitian matrices with complex-valued entries. It significantly improves the computational time with a little loss in forecasting performance. We study the properties of our proposed method and compare the performance of the two methods using simulated and a real macro-economic dataset. Our simulation results show that the proposed modification or msVAR is a preferred choice when the goal is to learn the structure of the AR coefficient matrices while sVAR outperforms msVAR when the ultimate task is forecasting.

【27】 Learning Bayesian Networks through Birkhoff Polytope: A Relaxation Method 标题：通过Birkhoff多面体学习贝叶斯网络：一种松弛方法

作者：Aramayis Dallakyan,Mohsen Pourahmadi 机构： the ordering space (or the space oftopological ordering) has been exploited for score-basedDepartment of Statistics, Texas A&M University 链接：https://arxiv.org/abs/2107.01658 摘要：我们建立了一个学习有向无环图（DAG）的新框架，当数据由高斯线性结构方程模型生成时。它由两部分组成：（1）在正则化高斯对数似然中引入置换矩阵作为新的参数来表示变量的排序；（2）给定序，通过逆协方差矩阵的稀疏Cholesky因子估计DAG结构。对于置换矩阵估计，我们提出了一种松弛技术，避免了阶估计的NP难组合问题。在给定排序的情况下，采用逐行解耦的循环坐标下降算法估计稀疏Cholesky因子。我们的框架可以恢复DAG，而无需对非循环约束进行昂贵的验证或枚举可能的父集。在已知变量阶数的情况下，证明了算法的数值收敛性和Cholesky因子估计的一致性。通过几个模拟和宏观经济数据集，我们研究了该方法的适用范围和性能。摘要：We establish a novel framework for learning a directed acyclic graph (DAG) when data are generated from a Gaussian, linear structural equation model. It consists of two parts: (1) introduce a permutation matrix as a new parameter within a regularized Gaussian log-likelihood to represent variable ordering; and (2) given the ordering, estimate the DAG structure through sparse Cholesky factor of the inverse covariance matrix. For permutation matrix estimation, we propose a relaxation technique that avoids the NP-hard combinatorial problem of order estimation. Given an ordering, a sparse Cholesky factor is estimated using a cyclic coordinatewise descent algorithm which decouples row-wise. Our framework recovers DAGs without the need for an expensive verification of the acyclicity constraint or enumeration of possible parent sets. We establish numerical convergence of the algorithm, and consistency of the Cholesky factor estimator when the order of variables is known. Through several simulated and macro-economic datasets, we study the scope and performance of the proposed methodology.

【28】 Discussion of the manuscript: Spatial+ a novel approach to spatial confounding 标题：《手稿：空间+空间混淆的一种新方法》评介

作者：Georgia Papadogeorgou 机构：Department of Statistics, University of Florida, I congratulate Dupont, method for estimation in the presence of spatial confounding, and for addressing some of the compli- 链接：https://arxiv.org/abs/2107.01644 摘要：我祝贺杜邦、伍德和奥古斯丁（DWA hereon）提供了一种易于实现的方法，用于在存在空间混淆的情况下进行估计，并解决了该主题的一些复杂方面。该方法基于空间基函数对感兴趣的协变量进行回归，并在结果回归中使用该模型的残差。作者表明，如果协变量不完全是空间的，这种方法导致一致的估计条件之间的联系暴露和结果。下面我将讨论在空间环境中推理的基本概念和操作问题：（I）目标量及其可解释性，（ii）协变量的非空间方面及其相对空间尺度，以及（iii）空间平滑的影响。虽然DWA提供了一些关于这些问题的见解，但我相信听众可能会从更深入的讨论中受益。在接下来的内容中，我将重点放在研究人员对解释给定协变量和结果之间的关系感兴趣的环境中。我将感兴趣的协变量称为风险敞口，以区别于其他风险敞口。摘要：I congratulate Dupont, Wood and Augustin (DWA hereon) for providing an easy-to-implement method for estimation in the presence of spatial confounding, and for addressing some of the complicated aspects on the topic. The method regresses the covariate of interest on spatial basis functions and uses the residuals of this model in an outcome regression. The authors show that, if the covariate is not completely spatial, this approach leads to consistent estimation of the conditional association between the exposure and the outcome. Below I discuss conceptual and operational issues that are fundamental to inference in spatial settings: (i) the target quantity and its interpretability, (ii) the non-spatial aspect of covariates and their relative spatial scales, and (iii) the impact of spatial smoothing. While DWA provide some insights on these issues, I believe that the audience might benefit from a deeper discussion. In what follows, I focus on the setting where a researcher is interested in interpreting the relationship between a given covariate and an outcome. I refer to the covariate of interest as the exposure to differentiate it from the rest.

【29】 The Role of "Live" in Livestreaming Markets: Evidence Using Orthogonal Random Forest

作者：Ziwei Cong,Jia Liu,Puneet Manchanda 链接：https://arxiv.org/abs/2107.01629 摘要：对于livestreaming这种不断增长的媒体，人们普遍认为它的价值在于它的“直播”组件。在本文中，我们利用一个大型livestreaming平台的数据来检验这个观点。我们能够做到这一点，因为这个平台还允许观众购买录制的livestream版本。我们通过估计需求在livestream之前、当天和之后对价格的反应来总结livestreaming内容的价值。为此，我们提出了一个广义正交随机森林框架。这个框架允许我们在存在高维混杂因素的情况下估计异质治疗效果，这些混杂因素与治疗政策（即价格）的关系复杂但部分已知。我们发现，在价格弹性的需求在时间上的距离，以安排直播一天和一天之后显着的动态。具体来说，随着时间的推移，需求对价格的敏感度逐渐降低到直播日，而在直播日则没有弹性。在后livestream时期，需求仍然对价格敏感，但远低于前livestream时期。这表明livestreaming的价值在live组件之外持续存在。最后，我们为驱动我们结果的可能机制提供了提示性证据。这些是livestream前后模式的质量不确定性降低，以及在livestream的当天与创建者进行实时交互的可能性。摘要：The common belief about the growing medium of livestreaming is that its value lies in its "live" component. In this paper, we leverage data from a large livestreaming platform to examine this belief. We are able to do this as this platform also allows viewers to purchase the recorded version of the livestream. We summarize the value of livestreaming content by estimating how demand responds to price before, on the day of, and after the livestream. We do this by proposing a generalized Orthogonal Random Forest framework. This framework allows us to estimate heterogeneous treatment effects in the presence of high-dimensional confounders whose relationships with the treatment policy (i.e., price) are complex but partially known. We find significant dynamics in the price elasticity of demand over the temporal distance to the scheduled livestreaming day and after. Specifically, demand gradually becomes less price sensitive over time to the livestreaming day and is inelastic on the livestreaming day. Over the post-livestream period, demand is still sensitive to price, but much less than the pre-livestream period. This indicates that the vlaue of livestreaming persists beyond the live component. Finally, we provide suggestive evidence for the likely mechanisms driving our results. These are quality uncertainty reduction for the patterns pre- and post-livestream and the potential of real-time interaction with the creator on the day of the livestream.

【30】 Accounting for Uncertainty When Estimating Counts Through an Average Rounded to the Nearest Integer 标题：通过四舍五入到最接近的整数的平均值估计计数时的不确定性

作者：Roberto Rivera,Axel Cortes-Cubero,Roberto Reyes-Carranza,Wolfgang Rolke 机构： University of Puerto Rico, †Department of Mathematical Sciences, ‡Department of Mathematical Sciences, §Department of Mathematical Sciences 链接：https://arxiv.org/abs/2107.01618 摘要：在实践中，舍入的使用是无处不在的。虽然研究人员已经研究了舍入连续随机变量的含义，舍入也可以应用于离散随机变量的函数。例如，为了推断两个时间段的自杀差异，当局可以提供每个时间段的四舍五入平均死亡人数。自杀率在世界范围内往往相对较低，这种舍入可能严重影响对自杀率变化的推断。本文研究了用四舍五入到最近整数平均估计非负离散随机变量的情形。具体来说，我们的兴趣是从Y的pmf中推断出一个参数，当我们得到U=n[Y/n]作为Y的代理时。U、E（U）和Var（U）的概率母函数反映了Y的支撑粗化效应。对一些特殊情况，讨论了分布参数的矩和估计。在某些条件下，舍入的影响很小。然而，我们也发现四舍五入可以显著影响统计推断的情形，如两个应用所示。我们提出的简单方法能够部分抵消舍入误差的影响。摘要：In practice, the use of rounding is ubiquitous. Although researchers have looked at the implications of rounding continuous random variables, rounding may be applied to functions of discrete random variables as well. For example, to infer on suicide difference between two time periods, authorities may provide a rounded average of deaths for each period. Suicide rates tend to be relatively low around the world and such rounding may seriously affect inference on the change of suicide rate. In this paper, we study the scenario when a rounded to nearest integer average is used to estimate a non-negative discrete random variable. Specifically, our interest is in drawing inference on a parameter from the pmf of Y, when we get U=n[Y/n]as a proxy for Y. The probability generating function of U, E(U), and Var(U) capture the effect of the coarsening of the support of Y. Also, moments and estimators of distribution parameters are explored for some special cases. Under certain conditions, there is little impact from rounding. However, we also find scenarios where rounding can significantly affect statistical inference as demonstrated in two applications. The simple methods we propose are able to partially counter rounding error effects.

【31】 Deep Gaussian Process Emulation using Stochastic Imputation 标题：基于随机插补的深高斯过程仿真

作者：Deyu Ming,Daniel Williamson,Serge Guillas 机构：Department of Statistical Science, University College London, UK, College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK 链接：https://arxiv.org/abs/2107.01590 摘要：提出了一种新的基于随机插补的深高斯过程（DGP）推理方法。通过随机输入潜在层，该方法将DGP转换为链接GP，这是一种通过连接前馈耦合GPs系统形成的最先进的代理模型。这种变换使得DGP训练过程简单而有效，只需要对传统的静止GPs进行优化。此外，分析可处理的平均数和方差的链接GP允许一个实现预测从DGP模拟器在一个快速和准确的方式。我们通过一系列综合实例和实际应用证明了该方法的有效性，并与变分推理和全贝叶斯方法进行了比较，证明了该方法是一种具有竞争力的DGP代理建模方法。实现该方法的$\texttt{Python}$包$\texttt{dgpsi}$也在https://github.com/mingdeyu/DGP. 摘要：We propose a novel deep Gaussian process (DGP) inference method for computer model emulation using stochastic imputation. By stochastically imputing the latent layers, the approach transforms the DGP into the linked GP, a state-of-the-art surrogate model formed by linking a system of feed-forward coupled GPs. This transformation renders a simple while efficient DGP training procedure that only involves optimizations of conventional stationary GPs. In addition, the analytically tractable mean and variance of the linked GP allows one to implement predictions from DGP emulators in a fast and accurate manner. We demonstrate the method in a series of synthetic examples and real-world applications, and show that it is a competitive candidate for efficient DGP surrogate modeling in comparison to the variational inference and the fully-Bayesian approach. A $\texttt{Python}$ package $\texttt{dgpsi}$ implementing the method is also produced and available at https://github.com/mingdeyu/DGP.

【32】 One-step TMLE to target cause-specific absolute risks and survival curves 标题：针对特定原因的绝对风险和生存曲线的一步TMLE

作者：Helene C. W. Rytgaard,Mark J. van der Laan 机构：Section of Biostatistics, University of Copenhagen, Denmark, Devision of Biostatistics, University of California, Berkeley 备注：21 pages (including appendix), 2 figures, 5 tables 链接：https://arxiv.org/abs/2107.01537 摘要：本文研究了一般竞争风险和生存分析环境下事件时间发生在正实线R+上且受右删失的一步目标极大似然估计方法。我们的兴趣是基线治疗决定的总体效果，静态、动态或随机，可能被治疗前的协变量所混淆。我们指出我们工作的两个总体贡献。首先，我们的方法可用于在竞争风险环境中获得所有绝对风险的同时推断。第二，我们提出了一个实际的结果，通过在一个足够精细的点网格上确定目标，来实现对整个生存曲线或整个绝对风险曲线的推断。一步程序基于每个原因特定危害的一维通用最不利子模型，该模型可沿相应的通用最不利子模型递归实施。给出了无穷维目标参数估计弱收敛条件的一个定理。我们的实证研究证明了这些方法的使用。摘要：This paper considers one-step targeted maximum likelihood estimation method for general competing risks and survival analysis settings where event times take place on the positive real line R+ and are subject to right-censoring. Our interest is overall in the effects of baseline treatment decisions, static, dynamic or stochastic, possibly confounded by pre-treatment covariates. We point out two overall contributions of our work. First, our method can be used to obtain simultaneous inference across all absolute risks in competing risks settings. Second, we present a practical result for achieving inference for the full survival curve, or a full absolute risk curve, across time by targeting over a fine enough grid of points. The one-step procedure is based on a one-dimensional universal least favorable submodel for each cause-specific hazard that can be implemented in recursive steps along a corresponding universal least favorable submodel. We present a theorem for conditions to achieve weak convergence of the estimator for an infinite-dimensional target parameter. Our empirical study demonstrates the use of the methods.

【33】 Selection of invalid instruments can improve estimation in Mendelian randomization 标题：孟德尔随机化中无效工具的选择可以改善估计

作者：Ashish Patel,Francis J. Ditraglia,Verena Zuber,Stephen Burgess 机构：a MRC Biostatistics Unit, University of Cambridge, b Department of Economics, University of Oxford, c Department of Epidemiology and Biostatistics, Imperial College London, d Cardiovascular Epidemiology Unit, University of Cambridge 备注：56 pages, 8 figures 链接：https://arxiv.org/abs/2107.01513 摘要：孟德尔随机（MR）是一种广泛使用的方法，以确定因果关系之间的风险因素和疾病。MR分析的一个基本部分是选择合适的遗传变异作为工具变量。目前的做法通常是只选择那些被认为满足某些排除限制的基因变体，以期消除未观察到的混淆的偏见。由于未知的多效性效应（即对结果的直接影响，而不是通过暴露），更多的基因变异可能违反这些排除限制，但它们的加入可能会增加因果效应估计的精度，代价是允许一些偏差。我们探讨如何最佳地处理这种偏见-方差权衡仔细选择从许多薄弱和局部无效的工具。具体而言，我们研究了一种针对公开的两样本遗传关联汇总数据的聚焦工具选择方法，即根据遗传变异如何影响因果效应估计的渐近均方误差来选择遗传变异。我们展示了对多效性效应性质的不同限制如何对选择后推断的质量产生重要影响。特别是，系统多效性下的集中选择方法允许一致的模型选择，但在实践中容易受到赢家诅咒偏见的影响。而一种更普遍的特殊多效性形式只允许保守的模型选择，但提供了一致有效的置信区间。我们提出了一种新的方法来收紧诚实的置信区间通过支持限制多效性。我们将我们的结果应用于几个真实的数据例子，这些例子表明，仪器的最佳选择不仅涉及生物学上合理的有效仪器，而且还涉及数百种潜在的多效性变体。摘要：Mendelian randomization (MR) is a widely-used method to identify causal links between a risk factor and disease. A fundamental part of any MR analysis is to choose appropriate genetic variants as instrumental variables. Current practice usually involves selecting only those genetic variants that are deemed to satisfy certain exclusion restrictions, in a bid to remove bias from unobserved confounding. Many more genetic variants may violate these exclusion restrictions due to unknown pleiotropic effects (i.e. direct effects on the outcome not via the exposure), but their inclusion could increase the precision of causal effect estimates at the cost of allowing some bias. We explore how to optimally tackle this bias-variance trade-off by carefully choosing from many weak and locally invalid instruments. Specifically, we study a focused instrument selection approach for publicly available two-sample summary data on genetic associations, whereby genetic variants are selected on the basis of how they impact the asymptotic mean square error of causal effect estimates. We show how different restrictions on the nature of pleiotropic effects have important implications for the quality of post-selection inferences. In particular, a focused selection approach under systematic pleiotropy allows for consistent model selection, but in practice can be susceptible to winner's curse biases. Whereas a more general form of idiosyncratic pleiotropy allows only conservative model selection, but offers uniformly valid confidence intervals. We propose a novel method to tighten honest confidence intervals through support restrictions on pleiotropy. We apply our results to several real data examples which suggest that the optimal selection of instruments does not only involve biologically-justified valid instruments, but additionally hundreds of potentially pleiotropic variants.

【34】 Novel Semi-parametric Tobit Additive Regression Models 标题：一种新的半参数Tobit加性回归模型

作者：Hailin Huang 机构：Department of Statistics, George Washington University, Washington DC 备注：ICDATA 2021 链接：https://arxiv.org/abs/2107.01497 摘要：回归方法被广泛应用于研究因变量和自变量之间的关系。在实践中，经常会出现数据删失等问题。当响应变量（固定）删失时，Tobit回归模型被广泛应用于研究响应变量与协变量之间的关系。本文将传统的参数Tobit模型推广到一种新的半参数回归模型，将Tobit模型中的线性分量替换为非参数可加分量，称之为Tobit可加模型，并提出了一种基于似然的Tobit可加模型估计方法该方法计算效率高，易于实现。通过数值实验对有限样本的性能进行了评价。该估计方法在有限样本实验中效果良好，即使样本量相对较小。摘要：Regression method has been widely used to explore relationship between dependent and independent variables. In practice, data issues such as censoring and missing data often exist. When the response variable is (fixed) censored, Tobit regression models have been widely employed to explore the relationship between the response variable and covariates. In this paper, we extend conventional parametric Tobit models to a novel semi-parametric regression model by replacing the linear components in Tobit models with nonparametric additive components, which we refer as Tobit additive models, and propose a likelihood based estimation method for Tobit additive models. %The proposed estimation method is computational efficient and easy to implement. Numerical experiments are conducted to evaluate the finite sample performance. The estimation method works well in finite sample experiments, even when sample size is relative small.

【35】 Assessing contribution of treatment phases through tipping point analyses via counterfactual elicitation using rank preserving structural failure time models 标题：用保秩结构失效时间模型通过反事实启发通过临界点分析评估治疗阶段的贡献

作者：Sudipta Bhattacharya,Jyotirmoy Dey 机构：Statistics and Quantitative Sciences, Takeda, Cambridge, MA, USA;, Haematology, Regeneron, NY,USA, LinkedIn Sites of the Authors: 备注：38 pages, 6 figures, 3 tables. arXiv admin note: text overlap with arXiv:2011.09070 链接：https://arxiv.org/abs/2107.01480 摘要：本文提供了一种新的方法，通过使用保秩结构失效时间（RPSFT）模型对时间-事件终点的临界点分析（TPA）来评估治疗方案中特定治疗阶段的重要性。在肿瘤学临床研究中，实验性治疗常常被添加到多个治疗阶段的标准护理治疗中，以改善患者的预后。当由此产生的新方案提供了一个有意义的利益超过标准的照顾，获得见解的贡献，每个治疗阶段成为重要的，以正确地指导临床实践。由于传统方法不足以回答这些问题，因此需要新的统计方法。RPSFT模型是一种因果推断的方法，通常用于调整随机临床试验中的治疗转换，以达到事件终点。临界点分析通常用于以下情况：在统计上显著的治疗效果被怀疑是缺失或未观察到的数据而不是真正的治疗差异的产物。本文提出的方法是这两种观点的结合，以研究包含多个治疗阶段的方案的特定组成部分的贡献。我们提供了该方法的不同变体，并构建了治疗阶段对方案整体效益的贡献指数，以便于解释结果。建议的方法用最近结束的，现实生活中的3期癌症临床试验的结果来说明。最后，我们对这一新方法的实际实施提出了若干考虑和建议。摘要：This article provides a novel approach to assess the importance of specific treatment phases within a treatment regimen through tipping point analyses (TPA) of a time-to-event endpoint using rank-preserving-structural-failure-time (RPSFT) modelling. In oncology clinical research, an experimental treatment is often added to the standard of care therapy in multiple treatment phases to improve patient outcomes. When the resulting new regimen provides a meaningful benefit over standard of care, gaining insights into the contribution of each treatment phase becomes important to properly guide clinical practice. New statistical approaches are needed since traditional methods are inadequate in answering such questions. RPSFT modelling is an approach for causal inference, typically used to adjust for treatment switching in randomized clinical trials with time-to-event endpoints. A tipping-point analysis is commonly used in situations where a statistically significant treatment effect is suspected to be an artifact of missing or unobserved data rather than a real treatment difference. The methodology proposed in this article is an amalgamation of these two ideas to investigate the contribution of a specific component of a regimen comprising multiple treatment phases. We provide different variants of the method and construct indices of contribution of a treatment phase to the overall benefit of a regimen that facilitates interpretation of results. The proposed approaches are illustrated with findings from a recently concluded, real-life phase 3 cancer clinical trial. We conclude with several considerations and recommendations for practical implementation of this new methodology.

【36】 Slope and generalization properties of neural networks 标题：神经网络的斜率和泛化性质

作者：Anton Johansson,Niklas Engsner,Claes Strannegård,Petter Mostad 机构： these functions 1Chalmers University of Technology 链接：https://arxiv.org/abs/2107.01473 摘要：神经网络是非常成功的工具，例如高级分类。从统计学的角度来看，拟合神经网络可以看作是一种回归，在这种回归中，我们从输入空间寻找一个函数，到一个遵循数据“一般”形状的分类概率空间，但通过避免记忆个别数据点来避免过拟合。在统计学中，这可以通过控制回归函数的几何复杂性来实现。我们建议在拟合神经网络时通过控制网络的斜率来做类似的事情。在定义了斜率并讨论了它的一些理论性质之后，我们继续用ReLU网络的例子来证明，经过良好训练的神经网络分类器的斜率分布通常与全连通网络中的层宽度无关，总体上，分布的平均值对模型结构的依赖性很弱。整个相关体积的坡度大小相似，变化平稳。它的行为也与重缩放示例中预测的一样。我们讨论了斜率概念的可能应用，例如在网络训练过程中将其作为损失函数或停止准则的一部分，或根据数据集的复杂性对其进行排序。摘要：Neural networks are very successful tools in for example advanced classification. From a statistical point of view, fitting a neural network may be seen as a kind of regression, where we seek a function from the input space to a space of classification probabilities that follows the "general" shape of the data, but avoids overfitting by avoiding memorization of individual data points. In statistics, this can be done by controlling the geometric complexity of the regression function. We propose to do something similar when fitting neural networks by controlling the slope of the network. After defining the slope and discussing some of its theoretical properties, we go on to show empirically in examples, using ReLU networks, that the distribution of the slope of a well-trained neural network classifier is generally independent of the width of the layers in a fully connected network, and that the mean of the distribution only has a weak dependence on the model architecture in general. The slope is of similar size throughout the relevant volume, and varies smoothly. It also behaves as predicted in rescaling examples. We discuss possible applications of the slope concept, such as using it as a part of the loss function or stopping criterion during network training, or ranking data sets in terms of their complexity.

【37】 Scale Mixtures of Neural Network Gaussian Processes 标题：神经网络高斯过程的尺度混合

作者：Hyungi Lee,Eunggu Yun,Hongseok Yang,Juho Lee 机构：Graduate School of Artificial Intelligence, KAIST, South Korea, School of Computing, KAIST, South Korea, AITRICS, Seoul, South Korea 链接：https://arxiv.org/abs/2107.01408 摘要：最近的工作表明，任意结构的无限宽前馈或递归神经网络对应于被称为$\mathrm{NNGP}$的高斯过程。虽然这些工作将神经网络收敛到高斯过程的类大大扩展了，但是却很少有人关注如何扩展这类神经网络收敛到的随机过程的类。在这项工作中，受高斯随机变量的尺度混合的启发，我们提出了$\mathrm{NNGP}$的尺度混合，我们在最后一层参数的尺度上引入了先验分布。我们证明，只要在最后一层参数上引入一个尺度先验，就可以将任意结构的无限宽的神经网络转化为更丰富的随机过程。特别地，在一定的尺度先验下，我们得到了重尾随机过程，并在逆gamma先验下恢复了Student的$t$过程。我们进一步分析了用我们的先验设置初始化并用梯度下降法训练的神经网络的分布，得到了与$\mathrm{NNGP}$相似的结果。我们提出了一个实用的$\mathrm{NNGP}$尺度混合的后验推理算法，并通过实验证明了它在回归和分类任务中的有效性。摘要：Recent works have revealed that infinitely-wide feed-forward or recurrent neural networks of any architecture correspond to Gaussian processes referred to as $\mathrm{NNGP}$. While these works have extended the class of neural networks converging to Gaussian processes significantly, however, there has been little focus on broadening the class of stochastic processes that such neural networks converge to. In this work, inspired by the scale mixture of Gaussian random variables, we propose the scale mixture of $\mathrm{NNGP}$ for which we introduce a prior distribution on the scale of the last-layer parameters. We show that simply introducing a scale prior on the last-layer parameters can turn infinitely-wide neural networks of any architecture into a richer class of stochastic processes. Especially, with certain scale priors, we obtain heavy-tailed stochastic processes, and we recover Student's $t$ processes in the case of inverse gamma priors. We further analyze the distributions of the neural networks initialized with our prior setting and trained with gradient descents and obtain similar results as for $\mathrm{NNGP}$. We present a practical posterior-inference algorithm for the scale mixture of $\mathrm{NNGP}$ and empirically demonstrate its usefulness on regression and classification tasks.

【38】 Proportional mean model for panel count data with multiple modes of recurrence 标题：具有多种递推方式的面板计数数据的比例均值模型

作者：Sreedevi E. P.,Sankaran P. G. 机构： SNGS College, Pattambi., Cochin University of Science and Technology, Cochin. 备注：25 pages, 2 fgires 链接：https://arxiv.org/abs/2107.01388 摘要：当受试者暴露于仅在离散时间点观察到的反复事件时，小组计数数据是常见的。在本文中，我们考虑回归分析的面板计数数据与多种复发模式。我们提出了一个比例平均模型来估计由于不同的重现模式，协变量对潜在计数过程的影响。详细研究了$（k>1）$递推模式的基线累积均值函数和回归参数的同时估计问题。同时给出了估计量的渐近性质。通过montecarlo模拟研究，验证了所提出的估计量的有限样本行为。该方法应用于皮肤癌化学预防试验的实际数据。摘要：Panel count data is common when the study subjects are exposed to recurrent events, observed only at discrete time points. In this article, we consider the regression analysis of panel count data with multiple modes of recurrence. We propose a proportional mean model to estimate the effect of covariates on the underlying counting process due to different modes of recurrence. The simultaneous estimation of baseline cumulative mean functions and regression parameters of $(k>1)$ recurrence modes are studied in detail. Asymptotic properties of the proposed estimators are also established. A Monte Carlo simulation study is carried out to validate the finite sample behaviour of the proposed estimators. The methods are applied to a real data arising from skin cancer chemoprevention trial.

【39】 Sibling Regression for Generalized Linear Models 标题：广义线性模型的同胞回归

作者：Shiv Shankar,Daniel Sheldon 机构： University of Massachusetts, Amherst, MA , USA, Mount Holyoke College, South Hadley, MA , USA 链接：https://arxiv.org/abs/2107.01338 摘要：野外观测是许多科学研究的基础，特别是在生态和社会科学方面。尽管努力以标准化的方式进行此类调查，但观测结果可能容易出现系统性的测量误差。如果可能的话，消除观测过程中引入的系统变异性，可以大大增加这些数据的价值。现有的非参数校正技术采用线性加性噪声模型。当应用于广义线性模型（GLM）时，这会导致有偏估计。我们提出了一种基于残差函数的方法来解决这个问题。然后，我们在合成数据上证明了它的有效性，并表明它降低了蛾类调查中的系统检测变异性。摘要：Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.

【40】 A Uniformly Consistent Estimator of non-Gaussian Causal Effects Under the k-Triangle-Faithfulness Assumption 标题：k-三角忠实性假设下非高斯因果效应的一致一致估计

作者：Shuyan Wang,Peter Spirtes 链接：https://arxiv.org/abs/2107.01333 摘要：Kalisch和B\{u}hlmann（2007）表明，对于线性高斯模型，在因果马尔可夫假设、强因果忠实假设和因果充分性假设下，PC算法是线性高斯模型真因果DAG的马尔可夫等价类的一致一致一致估计；由此可知，对于Markov等价类中可识别的因果效应，也存在因果效应的一致一致一致估计。$k$-三角信度假设是一个严格较弱的假设，它避免了强因果信度假设的一些不可信的含义，并且允许一致一致地估计马尔可夫等价类（在较弱的意义上）和可识别的因果效应。然而，这两个假设都局限于线性高斯模型。我们提出了广义的$k$-三角信度，它可以应用于任何光滑分布。此外，在广义的$k$-三角信度假设下，我们描述了一种边缘估计算法，该算法在某些情况下（或在其他情况下输出“不知道”）提供一致一致一致的因果效应估计，以及\textit{Very Conservative}$SGS$算法，它（在稍微弱一点的意义上）是真DAG的Markov等价类的一致一致一致估计。摘要：Kalisch and B\"{u}hlmann (2007) showed that for linear Gaussian models, under the Causal Markov Assumption, the Strong Causal Faithfulness Assumption, and the assumption of causal sufficiency, the PC algorithm is a uniformly consistent estimator of the Markov Equivalence Class of the true causal DAG for linear Gaussian models; it follows from this that for the identifiable causal effects in the Markov Equivalence Class, there are uniformly consistent estimators of causal effects as well. The $k$-Triangle-Faithfulness Assumption is a strictly weaker assumption that avoids some implausible implications of the Strong Causal Faithfulness Assumption and also allows for uniformly consistent estimates of Markov Equivalence Classes (in a weakened sense), and of identifiable causal effects. However, both of these assumptions are restricted to linear Gaussian models. We propose the Generalized $k$-Triangle Faithfulness, which can be applied to any smooth distribution. In addition, under the Generalized $k$-Triangle Faithfulness Assumption, we describe the Edge Estimation Algorithm that provides uniformly consistent estimates of causal effects in some cases (and otherwise outputs "can't tell"), and the \textit{Very Conservative }$SGS$ Algorithm that (in a slightly weaker sense) is a uniformly consistent estimator of the Markov equivalence class of the true DAG.

【41】 Minimum Wasserstein Distance Estimator under Finite Location-scale Mixtures 标题：有限位置-尺度混合下的最小Wasserstein距离估计

作者：Qiong Zhang,Jiahua Chen 机构：Department of Statistics, University of British Columbia 链接：https://arxiv.org/abs/2107.01323 摘要：当一个种群表现出异质性时，我们通常通过一个有限的混合物来建模：将它分解成几个不同但同质的亚种群。当代的实践有利于学习混合物的可能性最大化的统计效率和方便的EM算法的数值计算。然而，对于应用最广泛的有限正态混合和有限位置尺度混合，极大似然估计（MLE）并没有很好的定义。因此，我们研究可行的替代极大似然估计，如最小距离估计。最近，Wasserstein距离在机器学习领域引起了越来越多的关注。它有直观的几何解释，并成功地应用于许多新的应用。通过最小瓦瑟斯坦距离估计器（MWDE）学习有限位置-尺度混合，我们能得到什么？本文从几个方面探讨了这种可能性。我们发现MWDE是一致的，并在有限位置尺度下得到了数值解。我们研究了它对异常值和轻度模型错误规范的鲁棒性。我们的中等规模的模拟研究表明，MWDE相对于MLE的惩罚版本通常会遭受一些效率损失，而鲁棒性没有明显的提高。我们重申了基于可能性的学习策略的普遍优越性，即使对于非规则的有限位置-尺度混合。摘要：When a population exhibits heterogeneity, we often model it via a finite mixture: decompose it into several different but homogeneous subpopulations. Contemporary practice favors learning the mixtures by maximizing the likelihood for statistical efficiency and the convenient EM-algorithm for numerical computation. Yet the maximum likelihood estimate (MLE) is not well defined for the most widely used finite normal mixture in particular and for finite location-scale mixture in general. We hence investigate feasible alternatives to MLE such as minimum distance estimators. Recently, the Wasserstein distance has drawn increased attention in the machine learning community. It has intuitive geometric interpretation and is successfully employed in many new applications. Do we gain anything by learning finite location-scale mixtures via a minimum Wasserstein distance estimator (MWDE)? This paper investigates this possibility in several respects. We find that the MWDE is consistent and derive a numerical solution under finite location-scale mixtures. We study its robustness against outliers and mild model mis-specifications. Our moderate scaled simulation study shows the MWDE suffers some efficiency loss against a penalized version of MLE in general without noticeable gain in robustness. We reaffirm the general superiority of the likelihood based learning strategies even for the non-regular finite location-scale mixtures.

【42】 The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models 标题：先验和试验设计对高斯链图模型精度矩阵推断的影响

作者：Yunyi Shen,Claudia Solis-Lemus 机构：Department of Statistics, University of Wisconsin-Madison, Madison, WI , Claudia Solís-Lemus∗, Wisconsin Institute for Discovery 链接：https://arxiv.org/abs/2107.01306 摘要：在这里，我们研究实验设计是否（以及如何）有助于高斯链图模型中精度矩阵的估计，特别是设计、实验效果和关于效果的先验知识之间的相互作用。在不同的先验条件下，利用拉普拉斯逼近法对精度矩阵的边缘后验精度进行了逼近：平坦先验、共轭先验正规Wishart、无边界先验正规矩阵广义逆高斯（MGIG）和一般独立先验。我们证明，对于正规Wishart先验和平坦先验的情况，近似后验精度不是设计矩阵的函数，而对于正规MGIG和一般独立先验的情况，近似后验精度是设计矩阵的函数。然而，对于一般的MGIG和一般的独立先验，我们发现在近似后验精度上有一个尖锐的上界，它不涉及设计矩阵，而设计矩阵则转化为一个关于可以从给定实验中提取的信息的上界。我们通过模拟研究证实了理论结果，比较了随机和非实验（设计矩阵等于零）之间的Stein损失差异。我们的发现为领域科学家提供了实用的建议，帮助他们进行实验来解码多维反应和一组预测因子之间的关系。摘要：Here, we investigate whether (and how) experimental design could aid in the estimation of the precision matrix in a Gaussian chain graph model, especially the interplay between the design, the effect of the experiment and prior knowledge about the effect. We approximate the marginal posterior precision of the precision matrix via Laplace approximation under different priors: a flat prior, the conjugate prior Normal-Wishart, the unconfounded prior Normal-Matrix Generalized Inverse Gaussian (MGIG) and a general independent prior. We show that the approximated posterior precision is not a function of the design matrix for the cases of the Normal-Wishart and flat prior, but it is for the cases of the Normal-MGIG and the general independent prior. However, for the Normal-MGIG and the general independent prior, we find a sharp upper bound on the approximated posterior precision that does not involve the design matrix which translates into a bound on the information that could be extracted from a given experiment. We confirm the theoretical findings via a simulation study comparing the Stein's loss difference between random versus no experiment (design matrix equal to zero). Our findings provide practical advice for domain scientists conducting experiments to decode the relationships between a multidimensional response and a set of predictors.

【43】 Maximum likelihood for high-noise group orbit estimation and single-particle cryo-EM 标题：高噪声群轨道估计的最大似然法和单粒子低温电磁法

作者：Zhou Fan,Roy R. Lederman,Yi Sun,Tianhao Wang,Sheng Xu 机构： Department of Statistics and Data Science, Yale UniversityYS, University of ChicagoE-mail addresses 链接：https://arxiv.org/abs/2107.01305 摘要：基于单粒子低温电子显微术（cryo-EM）的应用，我们研究了在低信噪比条件下函数估计的几个问题，在低信噪比条件下，样品是在函数域的随机旋转下观察到的。在线性投影群轨道估计的一般框架下，我们根据不变代数中的超越度序列描述Fisher信息特征值的分层，并将对数似然图的临界点与矩优化问题的序列方法联系起来。这扩展了以前的无投影离散旋转组的结果。然后，我们计算了这些超越度和这些矩优化问题的形式，对于$SO（2）$和$SO（3）$旋转下的函数估计的几个例子，包括Bandeira、Blum Smith、Kileel、Perry、Weed和Wein提出的cryo EM简化模型。对于其中几个例子，我们肯定地解决了$3^\text{rd}$阶矩足以局部识别到达其旋转轨道的一般信号的数值猜想。对于两个小蛋白质分子的电势图的低维近似，我们在一个没有投影的$SO（3）$旋转模型中，经验地验证了Fisher信息特征值的噪声标度在一定信噪比范围内与这些理论预测相符。摘要：Motivated by applications to single-particle cryo-electron microscopy (cryo-EM), we study several problems of function estimation in a low SNR regime, where samples are observed under random rotations of the function domain. In a general framework of group orbit estimation with linear projection, we describe a stratification of the Fisher information eigenvalues according to a sequence of transcendence degrees in the invariant algebra, and relate critical points of the log-likelihood landscape to a sequence of method-of-moments optimization problems. This extends previous results for a discrete rotation group without projection. We then compute these transcendence degrees and the forms of these moment optimization problems for several examples of function estimation under $SO(2)$ and $SO(3)$ rotations, including a simplified model of cryo-EM as introduced by Bandeira, Blum-Smith, Kileel, Perry, Weed, and Wein. For several of these examples, we affirmatively resolve numerical conjectures that $3^\text{rd}$-order moments are sufficient to locally identify a generic signal up to its rotational orbit. For low-dimensional approximations of the electric potential maps of two small protein molecules, we empirically verify that the noise-scalings of the Fisher information eigenvalues conform with these theoretical predictions over a range of SNR, in a model of $SO(3)$ rotations without projection.

【44】 Optimizing ROC Curves with a Sort-Based Surrogate Loss Function for Binary Classification and Changepoint Detection 标题：基于排序的代理损失函数优化ROC曲线进行二值分类和突变检测

作者：Jonathan Hillman,Toby Dylan Hocking 链接：https://arxiv.org/abs/2107.01285 摘要：受试者操作特征（ROC）曲线是真阳性率与假阳性率的曲线图，这对评估二元分类模型很有用，但由于曲线下面积（AUC）是非凸的，很难用于学习。ROC曲线也可用于其他具有假阳性率和真阳性率的问题，如变化点检测。我们证明，在这种更一般的情况下，ROC曲线可以有循环，具有高度次优错误率的点，并且AUC大于1。这一观察结果激发了一个新的优化目标：我们不希望AUC最大化，而是希望AUC=1的单调ROC曲线避免Min（FP，FN）值较大的点。我们提出了一个凸松弛的目标，结果在一个新的替代损失函数称为AUM，简称面积低于Min（FP，FN）。以前的损失函数是基于对所有标记的例子或对的求和，而AUM需要对ROC曲线上的点序列进行排序和求和。结果表明，在梯度下降学习算法中，可以有效地计算AUM的方向导数。在我们的有监督二值分类和变化点检测问题的实证研究中，我们证明了我们新的AUM最小化学习算法相对于以前的基线在AUC和可比较的速度上都有改进。摘要：Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose a convex relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and comparable speed relative to previous baselines.

【45】 Bayesian two-interval test 标题：贝叶斯双区间检验

作者：Nicolas Meyer,Erik-André Sauleau 机构：CHU de Strasbourg, GMRC Pôle de Santé Publique, F-, Strasbourg, France, Université de Strasbourg, CNRS, iCUBE UMR , F-, Strasbourg, France, Corresponding author 备注：49 pages, 7 figures, 4 tables 链接：https://arxiv.org/abs/2107.01271 摘要：无效假设检验（NHT）被广泛用于验证科学假设，但实际上受到了高度批评。虽然贝叶斯测试克服了一些批评，但仍然存在一些局限性。我们提出了一个贝叶斯双区间检验（2IT），其中两个假设的影响是存在或不存在表示为预先指定的联合或不联合区间和他们的后验概率计算。同样的形式主义也适用于优势、非劣势或等价性测试。对2IT进行了三个实例和三组模拟研究（一个比例和一个平均值与一个参考值的比较以及两个比例的比较）。创建了几个场景（具有不同的样本大小），并进行了模拟，以计算在给定其中一个假设下生成的数据的情况下，感兴趣的参数在与任一假设对应的区间内的概率。后验估计是使用共轭低信息先验。还估计了偏差。当一个假设为真时，接受一个假设的概率会逐渐增加样本量，趋向于1，而接受另一个假设的概率总是很低（小于5%），趋向于0。收敛速度随假设之间的差距和宽度而变化。在平均值的情况下，偏差很小，很快就可以忽略不计。我们提出了一个贝叶斯测试，遵循一个科学合理的过程，其中两个区间假设显式使用和测试。拟议的测试几乎没有NHT的局限性，并提出了新的特征，如偶然性的理由或“数据趋势”的理由。2-IT的概念框架还允许计算样本量，并在许多上下文中使用顺序方法。摘要：The null hypothesis test (NHT) is widely used for validating scientific hypotheses but is actually highly criticized. Although Bayesian tests overcome several criticisms, some limits remain. We propose a Bayesian two-interval test (2IT) in which two hypotheses on an effect being present or absent are expressed as prespecified joint or disjoint intervals and their posterior probabilities are computed. The same formalism can be applied for superiority, non-inferiority, or equivalence tests. The 2IT was studied for three real examples and three sets of simulations (comparison of a proportion and a mean to a reference and comparison of two proportions). Several scenarios were created (with different sample sizes), and simulations were conducted to compute the probabilities of the parameter of interest being in the interval corresponding to either hypothesis given the data generated under one of the hypotheses. Posterior estimates were obtained using conjugacy with a low-informative prior. Bias was also estimated. The probability of accepting a hypothesis when that hypothesis is true progressively increases the sample size, tending towards 1, while the probability of accepting the other hypothesis is always very low (less than 5%) and tends towards 0. The speed of convergence varies with the gap between the hypotheses and with their width. In the case of a mean, the bias is low and rapidly becomes negligible. We propose a Bayesian test that follows a scientifically sound process, in which two interval hypotheses are explicitly used and tested. The proposed test has almost none of the limitations of the NHT and suggests new features, such as a rationale for serendipity or a justification for a "trend in data". The conceptual framework of the 2-IT also allows the calculation of a sample size and the use of sequential methods in numerous contexts.

【46】 Asymptotic Statistical Analysis of Sparse Group LASSO via Approximate Message Passing Algorithm 标题：基于近似消息传递算法的稀疏群套索渐近统计分析

作者：Kan Chen,Zhiqi Bu,Shiyun Xu 机构：Graduate Group of Applied Mathematics and Computational Science, School of Arts, and Sciences, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A. 链接：https://arxiv.org/abs/2107.01266 摘要：稀疏群LASSO（SGL）是一种求解高维线性回归问题的正则化模型。SGL分别对个体预测因子和组预测因子施加$l_1$和$l_2$惩罚，以保证在组间和组内水平上的稀疏效应。本文应用近似消息传递（AMP）算法有效地解决了高斯随机设计下的SGL问题。我们进一步利用最近发展起来的AMP状态演化分析，得到了SGL解的渐近精确特征。这使我们能够对SGL进行多个细粒度的统计分析，通过这些分析，我们研究了组信息和$\gamma$（罚款的比例$\ellu 1$）的影响。通过对各种性能指标的分析，我们发现，在信号恢复率、错误发现率和均方误差方面，小$\gamma$的SGL能显著地从群信息中获益，并能优于其他SGL（包括LASSO）或不利用群信息的正则化模型。摘要：Sparse Group LASSO (SGL) is a regularized model for high-dimensional linear regression problems with grouped covariates. SGL applies $l_1$ and $l_2$ penalties on the individual predictors and group predictors, respectively, to guarantee sparse effects both on the inter-group and within-group levels. In this paper, we apply the approximate message passing (AMP) algorithm to efficiently solve the SGL problem under Gaussian random designs. We further use the recently developed state evolution analysis of AMP to derive an asymptotically exact characterization of SGL solution. This allows us to conduct multiple fine-grained statistical analyses of SGL, through which we investigate the effects of the group information and $\gamma$ (proportion of $\ell_1$ penalty). With the lens of various performance measures, we show that SGL with small $\gamma$ benefits significantly from the group information and can outperform other SGL (including LASSO) or regularized models which does not exploit the group information, in terms of the recovery rate of signal, false discovery rate and mean squared error.

【47】 Uncertainty in Lung Cancer Stage for Outcome Estimation via Set-Valued Classification 标题：集值分类在肺癌分期预后估计中的不确定性

作者：Savannah Bergquist,Gabriel Brooks,Mary Beth Landrum,Nancy Keating,Sherri Rose 机构：Haas School of Business, University of, California, Berkeley, CA, United States, The Dartmouth Institute for Health Policy, and Clinical Practice, Geisel School of, Medicine, NH, United States, Department of Health Care Policy, Harvard 备注：Code available at: this https URL 链接：https://arxiv.org/abs/2107.01251 摘要：在医疗索赔数据中很难确定癌症阶段，这限制了肿瘤医疗质量和健康结果的研究。我们使用索赔数据将肺癌分期的预测算法分为三类（I/II期、III期和IV期），然后展示了一种将分类不确定性纳入结果估计的方法。利用集值分类和分裂共形推理，我们展示了在一组数据中开发的固定算法如何部署在另一组数据中，同时严格考虑了初始分类步骤的不确定性。我们使用SEER癌症登记数据和医疗保险索赔数据来演示这个过程。摘要：Difficulty in identifying cancer stage in health care claims data has limited oncology quality of care and health outcomes research. We fit prediction algorithms for classifying lung cancer stage into three classes (stages I/II, stage III, and stage IV) using claims data, and then demonstrate a method for incorporating the classification uncertainty in outcomes estimation. Leveraging set-valued classification and split conformal inference, we show how a fixed algorithm developed in one cohort of data may be deployed in another, while rigorously accounting for uncertainty from the initial classification step. We demonstrate this process using SEER cancer registry data linked with Medicare claims data.

【48】 Truncated Marginal Neural Ratio Estimation 标题：截断边缘神经比估计

作者：Benjamin Kurt Miller,Alex Cole,Patrick Forré,Gilles Louppe,Christoph Weniger 机构：University of Amsterdam, University of Liège 备注：9 pages. 23 pages with references and supplemental material. Code available at this http URL Underlying library this http URL 链接：https://arxiv.org/abs/2107.01214 摘要：参数随机模拟器在科学中普遍存在，通常具有高维输入参数和/或难以处理的可能性。在这种情况下进行贝叶斯参数推断可能是一个挑战。提出了一种基于神经仿真器的推理算法，该算法同时具有仿真效率高、经验后验性好等优点，在现代推理算法中具有独特的优势。我们的方法是通过同时估计低维的边缘后验概率而不是联合后验概率，并且通过一个被指标函数适当截断的前验概率来提出针对感兴趣的观察的模拟，从而提高了模拟的效率。此外，通过估计一个局部摊薄后验概率，我们的算法能够有效地检验推理结果的稳健性。这样的测试对于真实世界应用程序中的健全性检查推理非常重要，这些应用程序没有已知的基本事实。我们在一个边缘化版本的基于模拟的推理基准和两个复杂和狭窄的后验概率上进行了实验，突出了我们算法的模拟效率以及估计的边缘后验概率的质量。GitHub上的实现。摘要：Parametric stochastic simulators are ubiquitous in science, often featuring high-dimensional input parameters and/or an intractable likelihood. Performing Bayesian parameter inference in this context can be challenging. We present a neural simulator-based inference algorithm which simultaneously offers simulation efficiency and fast empirical posterior testability, which is unique among modern algorithms. Our approach is simulation efficient by simultaneously estimating low-dimensional marginal posteriors instead of the joint posterior and by proposing simulations targeted to an observation of interest via a prior suitably truncated by an indicator function. Furthermore, by estimating a locally amortized posterior our algorithm enables efficient empirical tests of the robustness of the inference results. Such tests are important for sanity-checking inference in real-world applications, which do not feature a known ground truth. We perform experiments on a marginalized version of the simulation-based inference benchmark and two complex and narrow posteriors, highlighting the simulator efficiency of our algorithm as well as the quality of the estimated marginal posteriors. Implementation on GitHub.

【49】 Feature Cross Search via Submodular Optimization 标题：基于子模块优化的特征交叉搜索

作者：Lin Chen,Hossein Esfandiari,Gang Fu,Vahab S. Mirrokni,Qian Yu 备注：Accepted to ESA 2021. Authors are ordered alphabetically 链接：https://arxiv.org/abs/2107.02139 摘要：本文将特征交叉搜索作为特征工程中的一个基本原语进行研究。特征交叉搜索的重要性，特别是对于线性模型的重要性已经被人们所认识了一段时间，有一些著名的教科书例子。在这个问题中，目标是选择一小部分特征，通过考虑它们的笛卡尔积，将它们组合起来形成一个新的特征（称为交叉特征），并找到特征交叉来学习一个\emph{精确的}模型。特别地，我们研究了在交叉特征列上训练的线性模型的曲线下归一化面积（AUC）的最大化问题。首先，我们证明了除非指数时间假设失败，否则不可能为这个问题提供一个$n^{1/\log\log n}$-近似算法。这个结果也排除了在多项式时间内解决这个问题的可能性，除非$\mathsf{P}=\mathsf{NP}$。在积极的方面，通过假设\ naive \假设，我们证明了这个问题存在一个简单的贪心$（1-1/e）$-近似算法。通过将AUC与两个概率测度的交换子的总变差联系起来，证明了交换子的总变差是单调的、子模的。为了证明这一点，我们将此函数的子模性与相应核矩阵的半正定性联系起来。然后，利用Bochner定理证明了正半定性，证明了它的逆Fourier变换处处是非负的。我们的技术和结构结果可能是独立的兴趣。摘要：In this paper, we study feature cross search as a fundamental primitive in feature engineering. The importance of feature cross search especially for the linear model has been known for a while, with well-known textbook examples. In this problem, the goal is to select a small subset of features, combine them to form a new feature (called the crossed feature) by considering their Cartesian product, and find feature crosses to learn an \emph{accurate} model. In particular, we study the problem of maximizing a normalized Area Under the Curve (AUC) of the linear model trained on the crossed feature column. First, we show that it is not possible to provide an $n^{1/\log\log n}$-approximation algorithm for this problem unless the exponential time hypothesis fails. This result also rules out the possibility of solving this problem in polynomial time unless $\mathsf{P}=\mathsf{NP}$. On the positive side, by assuming the \naive\ assumption, we show that there exists a simple greedy $(1-1/e)$-approximation algorithm for this problem. This result is established by relating the AUC to the total variation of the commutator of two probability measures and showing that the total variation of the commutator is monotone and submodular. To show this, we relate the submodularity of this function to the positive semi-definiteness of a corresponding kernel matrix. Then, we use Bochner's theorem to prove the positive semi-definiteness by showing that its inverse Fourier transform is non-negative everywhere. Our techniques and structural results might be of independent interest.

【50】 Subset Privacy: Draw from an Obfuscated Urn 标题：子集隐私：从模糊的骨灰盒中提取

作者：Ganghua Wang,Jie Ding 机构： Ding are with the School of Statistics, University of Minnesota 链接：https://arxiv.org/abs/2107.02013 摘要：随着个人数据收集和分析能力的迅速提高，数据隐私成为一个新的关注点。在这项工作中，我们发展了一个新的统计概念，本地隐私，以保护每一个分类数据将收集的不受信任的实体。提出的解决方案称为子集隐私，通过将原始数据值替换为包含该值的随机子集来私有化原始数据值。我们发展了分布函数的估计方法和从子集私有数据独立性检验的方法，并提供了理论保证。我们还研究了实现子集隐私的不同机制，以及在实践中量化隐私量的评估指标。在模拟和真实数据集上的实验结果表明，所提出的概念和方法具有令人鼓舞的性能。摘要：With the rapidly increasing ability to collect and analyze personal data, data privacy becomes an emerging concern. In this work, we develop a new statistical notion of local privacy to protect each categorical data that will be collected by untrusted entities. The proposed solution, named subset privacy, privatizes the original data value by replacing it with a random subset containing that value. We develop methods for the estimation of distribution functions and independence testing from subset-private data with theoretical guarantees. We also study different mechanisms to realize the subset privacy and evaluation metrics to quantify the amount of privacy in practice. Experimental results on both simulated and real-world datasets demonstrate the encouraging performance of the developed concepts and methods.

【51】 Fast and Scalable Optimal Transport for Brain Tractograms 标题：快速可扩展的脑电地形图优化传输

作者：Jean Feydy,Pierre Roussillon,Alain Trouvé,Pietro Gori 机构： CMLA, ENS Paris-Saclay, France, DMA, ´Ecole Normale Sup´erieure, Paris, France, LTCI, T´el´ecom ParisTech, Institut Mines T´el´ecom, Paris, France 备注：None 链接：https://arxiv.org/abs/2107.02010 摘要：我们提出了一个新的多尺度算法来解决正则化最优传输问题的GPU上，线性内存占用。该方法利用Sinkhorn发散函数的凸性、光滑性和正定损失函数，可以在几分钟内计算出数百万个点之间的运输计划。我们在模拟成纤维束或轨迹密度图的脑束图上显示了这种方法的有效性。我们使用得到的平滑分配来执行基于atlas的纤维束图分割的标签转移。我们的方法的参数——模糊和到达——是有意义的，它定义了两个光纤相互比较的最小和最大距离。可根据解剖学知识进行设置。此外，我们还提出了一个以Wasserstein重心估计轨道密度图总体的概率图集。我们的CUDA实现被赋予了一个用户友好的PyTorch接口，在PyPi存储库（pip-install-geomloss）和www.kernel-operations.io/geomloss. 摘要：We present a new multiscale algorithm for solving regularized Optimal Transport problems on the GPU, with a linear memory footprint. Relying on Sinkhorn divergences which are convex, smooth and positive definite loss functions, this method enables the computation of transport plans between millions of points in a matter of minutes. We show the effectiveness of this approach on brain tractograms modeled either as bundles of fibers or as track density maps. We use the resulting smooth assignments to perform label transfer for atlas-based segmentation of fiber tractograms. The parameters -- blur and reach -- of our method are meaningful, defining the minimum and maximum distance at which two fibers are compared with each other. They can be set according to anatomical knowledge. Furthermore, we also propose to estimate a probabilistic atlas of a population of track density maps as a Wasserstein barycenter. Our CUDA implementation is endowed with a user-friendly PyTorch interface, freely available on the PyPi repository (pip install geomloss) and at www.kernel-operations.io/geomloss.

【52】 Machine Learning for Fraud Detection in E-Commerce: A Research Agenda 标题：用于电子商务欺诈检测的机器学习：研究议程

作者：Niek Tax,Kees Jan de Vries,Mathijs de Jong,Nikoleta Dosoula,Bram van den Akker,Jon Smith,Olivier Thuong,Lucas Bernardi 备注：Accepted and to appear in the proceedings of the KDD 2021 co-located workshop: the 2nd International Workshop on Deployable Machine Learning for Security Defense (MLHat) 链接：https://arxiv.org/abs/2107.01979 摘要：欺诈检测和预防对于确保任何电子商务业务的持续运营都起着重要的作用。机器学习（ML）在这些反欺诈操作中经常扮演着重要的角色，但是这些机器学习模型所处的组织环境是不容忽视的。在本文中，我们以组织为中心的观点，通过建立电子商务组织中反欺诈部门的运作模型来探讨欺诈检测的问题。我们从这个操作模型中导出了欺诈检测的6个研究主题和12个实际挑战。我们总结了每个研究主题的文献状况，讨论了实际挑战的潜在解决方案，并确定了22个开放的研究挑战。摘要：Fraud detection and prevention play an important part in ensuring the sustained operation of any e-commerce business. Machine learning (ML) often plays an important role in these anti-fraud operations, but the organizational context in which these ML models operate cannot be ignored. In this paper, we take an organization-centric view on the topic of fraud detection by formulating an operational model of the anti-fraud departments in e-commerce organizations. We derive 6 research topics and 12 practical challenges for fraud detection from this operational model. We summarize the state of the literature for each research topic, discuss potential solutions to the practical challenges, and identify 22 open research challenges.

【53】 Universal Approximation of Functions on Sets 标题：集合上函数的泛逼近

作者：Edward Wagstaff,Fabian B. Fuchs,Martin Engelcke,Michael A. Osborne,Ingmar Posner 机构：Department of Engineering Science, University of Oxford, Oxford, UK 备注：54 pages, 13 figures 链接：https://arxiv.org/abs/2107.01959 摘要：建立集合的函数模型，或等价的置换不变函数，是机器学习中一个长期存在的挑战。深集是一种常用的方法，它是连续集函数的一种普遍逼近方法。我们提供了一个深集的理论分析，表明只有当模型的潜在空间足够高维时，这种普遍逼近性质才能得到保证。如果潜在空间比所需空间低一个维度，则存在分段仿射函数，根据最坏情况误差判断，对于这些函数，深集的性能并不优于一个自然常数基线。深集可以被看作是Janossy池范式最有效的体现。我们认为这个范例包含了目前最流行的集合学习方法。基于这一联系，我们更广泛地讨论了我们的结果对集合学习的影响，并确定了一些关于Janossy池普遍性的开放性问题。摘要：Modelling functions of sets, or equivalently, permutation-invariant functions, is a long-standing challenge in machine learning. Deep Sets is a popular method which is known to be a universal approximator for continuous set functions. We provide a theoretical analysis of Deep Sets which shows that this universal approximation property is only guaranteed if the model's latent space is sufficiently high-dimensional. If the latent space is even one dimension lower than necessary, there exist piecewise-affine functions for which Deep Sets performs no better than a na\"ive constant baseline, as judged by worst-case error. Deep Sets may be viewed as the most efficient incarnation of the Janossy pooling paradigm. We identify this paradigm as encompassing most currently popular set-learning methods. Based on this connection, we discuss the implications of our results for set learning more broadly, and identify some open questions on the universality of Janossy pooling in general.

【54】 Partition and Code: learning how to compress graphs 标题：分区和代码：学习如何压缩图形

作者：Giorgos Bouritsas,Andreas Loukas,Nikolaos Karalias,Michael M. Bronstein 机构：Imperial College London, UK,EPFL,Twitter, UK 链接：https://arxiv.org/abs/2107.01952 摘要：我们能用机器学习来压缩图形数据吗？图的有序性的缺失对传统的压缩算法提出了巨大的挑战，限制了其可获得的增益以及发现相关模式的能力。另一方面，大多数图压缩方法依赖于依赖于域的手工表示，不能适应不同的底层图分布。本文旨在建立一种接近熵存储下限的无损图形压缩方法所应遵循的必要原则。我们没有对图的分布做严格的假设，而是将压缩机描述为一个概率模型，可以从数据中学习并推广到不可见的实例。我们的“划分和编码”框架包含三个步骤：首先，一个划分算法将图分解为基本结构，然后将这些结构映射到一个小字典的元素上，在这个字典上我们学习概率分布，最后，熵编码器将表示转换为位。这三个步骤都是参数化的，可以用梯度下降法进行训练。我们从理论上比较了几种图编码的压缩质量，并在温和的条件下证明了它们的期望描述长度的总顺序。此外，我们还证明，在相同的条件下，PnC可以获得压缩增益w.r.t.基线随顶点数线性或二次增长。我们的算法在不同的实际网络上进行了定量评估，获得了相对于不同的非参数和参数图压缩器系列的显著性能改进。摘要：Can we use machine learning to compress graph data? The absence of ordering in graphs poses a significant challenge to conventional compression algorithms, limiting their attainable gains as well as their ability to discover relevant patterns. On the other hand, most graph compression approaches rely on domain-dependent handcrafted representations and cannot adapt to different underlying graph distributions. This work aims to establish the necessary principles a lossless graph compression method should follow to approach the entropy storage lower bound. Instead of making rigid assumptions about the graph distribution, we formulate the compressor as a probabilistic model that can be learned from data and generalise to unseen instances. Our "Partition and Code" framework entails three steps: first, a partitioning algorithm decomposes the graph into elementary structures, then these are mapped to the elements of a small dictionary on which we learn a probability distribution, and finally, an entropy encoder translates the representation into bits. All three steps are parametric and can be trained with gradient descent. We theoretically compare the compression quality of several graph encodings and prove, under mild conditions, a total ordering of their expected description lengths. Moreover, we show that, under the same conditions, PnC achieves compression gains w.r.t. the baselines that grow either linearly or quadratically with the number of vertices. Our algorithms are quantitatively evaluated on diverse real-world networks obtaining significant performance improvements with respect to different families of non-parametric and parametric graph compressors.

【55】 Differentially Private Sliced Wasserstein Distance 标题：微分私有分片Wasserstein距离

作者：Alain Rakotomamonjy,Liva Ralaivola 备注：None 链接：https://arxiv.org/abs/2107.01848 摘要：开发保护隐私的机器学习方法是当今研究的中心课题，具有巨大的实际影响。在众多解决隐私保护学习的方法中，我们这里从计算差异隐私（DP）框架下分布之间的差异的角度来看——能够计算分布之间的差异对于许多机器学习问题至关重要，例如学习生成模型或领域适应问题。我们不是求助于流行的基于梯度的DP净化方法，而是通过关注切片的Wasserstein距离并无缝地使其差异私有化，从根本上解决问题。我们的主要贡献如下：我们分析了在切片Wasserstein距离的内在随机机制中加入高斯扰动的性质，并建立了由此产生的差分私有机制的灵敏度。我们的一个重要发现是，这种DP机制将切片Wasserstein距离转换为另一个距离，我们称之为平滑切片Wasserstein距离。这种新的差异私有分布距离可以以透明的方式插入到生成模型和域自适应算法中，我们的经验表明，与文献中基于梯度的DP方法相比，它具有很强的竞争力，几乎没有损失的准确性域适应问题，我们认为。摘要：Developing machine learning methods that are privacy preserving is today a central topic of research, with huge practical impacts. Among the numerous ways to address privacy-preserving learning, we here take the perspective of computing the divergences between distributions under the Differential Privacy (DP) framework -- being able to compute divergences between distributions is pivotal for many machine learning problems, such as learning generative models or domain adaptation problems. Instead of resorting to the popular gradient-based sanitization method for DP, we tackle the problem at its roots by focusing on the Sliced Wasserstein Distance and seamlessly making it differentially private. Our main contribution is as follows: we analyze the property of adding a Gaussian perturbation to the intrinsic randomized mechanism of the Sliced Wasserstein Distance, and we establish the sensitivityof the resulting differentially private mechanism. One of our important findings is that this DP mechanism transforms the Sliced Wasserstein distance into another distance, that we call the Smoothed Sliced Wasserstein Distance. This new differentially private distribution distance can be plugged into generative models and domain adaptation algorithms in a transparent way, and we empirically show that it yields highly competitive performance compared with gradient-based DP approaches from the literature, with almost no loss in accuracy for the domain adaptation problems that we consider.

【56】 Fast Rate Learning in Stochastic First Price Bidding 标题：随机首标中的快速费率学习

作者：Juliette Achddou,Olivier Cappé,Aurélien Garivier 机构： DI ENS, Inria, ENS Paris, Université PSL, Numberly, DI ENS, Inria, ENS Paris, Université PSL,CNRS, UMPA,CNRS, Inria, ENS Lyon 链接：https://arxiv.org/abs/2107.01835 摘要：第一价格拍卖已在很大程度上取代了传统的招标方式的基础上，维克里拍卖在节目广告。就学习而言，第一价格拍卖更具挑战性，因为最优竞价策略不仅取决于物品的价值，而且还需要了解其他出价。他们已经在连续学习中产生了许多作品，其中许多作品考虑了买方或对手的最大出价的选择以对抗的方式选择的模型。即使在最简单的设置中，这也会导致算法的遗憾随着时间范围$T$的增加而增加。针对买方在平稳随机环境下博弈的情况，我们给出了如何获得较低的后悔率：当对手的最大出价分布已知时，我们给出了一个后悔率可低至$\log^2（T）$的算法；在必须顺序学习分布的情况下，对于任何$\epsilon>0$，该算法的推广可以达到$T^{1/3+\epsilon}$遗憾。为了得到这些结果，我们介绍了两个新颖的想法，可以在自己的利益。首先，通过转置在公布的价格设置中获得的结果，我们提供了条件，在该条件下，第一个价格投标效用是局部二次的，在其最优值附近。其次，我们利用这样的观察结果：在小的子区间上，经验分布函数变化的集中度可以比使用经典的Dvoretzky-Kiefer-Wolfowitz不等式更精确地控制。数值模拟证实，我们的算法收敛速度远远快于文献中提出的各种投标分布方案，包括在实际的节目广告平台上收集的投标。摘要：First-price auctions have largely replaced traditional bidding approaches based on Vickrey auctions in programmatic advertising. As far as learning is concerned, first-price auctions are more challenging because the optimal bidding strategy does not only depend on the value of the item but also requires some knowledge of the other bids. They have already given rise to several works in sequential learning, many of which consider models for which the value of the buyer or the opponents' maximal bid is chosen in an adversarial manner. Even in the simplest settings, this gives rise to algorithms whose regret grows as $\sqrt{T}$ with respect to the time horizon $T$. Focusing on the case where the buyer plays against a stationary stochastic environment, we show how to achieve significantly lower regret: when the opponents' maximal bid distribution is known we provide an algorithm whose regret can be as low as $\log^2(T)$; in the case where the distribution must be learnt sequentially, a generalization of this algorithm can achieve $T^{1/3+ \epsilon}$ regret, for any $\epsilon>0$. To obtain these results, we introduce two novel ideas that can be of interest in their own right. First, by transposing results obtained in the posted price setting, we provide conditions under which the first-price biding utility is locally quadratic around its optimum. Second, we leverage the observation that, on small sub-intervals, the concentration of the variations of the empirical distribution function may be controlled more accurately than by using the classical Dvoretzky-Kiefer-Wolfowitz inequality. Numerical simulations confirm that our algorithms converge much faster than alternatives proposed in the literature for various bid distributions, including for bids collected on an actual programmatic advertising platform.

【57】 An Explainable AI System for the Diagnosis of High Dimensional Biomedical Data 标题：一种用于高维生物医学数据诊断的可解释人工智能系统

作者：Alfred Ultsch,Jörg Hoffmann,Maximilian Röhnert,Malte Von Bonin,Uta Oelschlägel,Cornelia Brendel,Michael C. Thrun 机构：) Databionics, Mathematics and Computer Science, Philipps-Universität Marburg, Hans-Meerwein-Straße , D-, Marburg., ) J Department of Hematology, Oncology and Immunology, Philipps-University, Baldinger Str., D-, Marbrug. 备注：22 pages, 1 figure, 5 tables 链接：https://arxiv.org/abs/2107.01820 摘要：典型的流式细胞术数据样本由10个或更多特征中超过100000个细胞组成。人工智能系统能够以与人类专家几乎相同的精确度诊断此类数据。然而，在这样的系统中有一个核心挑战：他们的决定对人们的健康和生活具有深远的影响，因此，人工智能系统的决定需要人类能够理解和证明。在这项工作中，我们提出了一种新的可解释的人工智能方法，称为ALPODS，它能够基于高维数据中的聚类（即子群体）对病例进行分类（诊断）。ALPODS能够以人类专家可以理解的形式解释它的决定。对于识别出的子群体，生成以领域专家典型语言表达的模糊推理规则。基于这些规则的可视化方法允许人类专家理解人工智能系统使用的推理。通过与一系列最先进的可解释人工智能系统的比较，可以看出ALPODS在已知的基准数据和日常案例数据上都能有效地运行。摘要：Typical state of the art flow cytometry data samples consists of measures of more than 100.000 cells in 10 or more features. AI systems are able to diagnose such data with almost the same accuracy as human experts. However, there is one central challenge in such systems: their decisions have far-reaching consequences for the health and life of people, and therefore, the decisions of AI systems need to be understandable and justifiable by humans. In this work, we present a novel explainable AI method, called ALPODS, which is able to classify (diagnose) cases based on clusters, i.e., subpopulations, in the high-dimensional data. ALPODS is able to explain its decisions in a form that is understandable for human experts. For the identified subpopulations, fuzzy reasoning rules expressed in the typical language of domain experts are generated. A visualization method based on these rules allows human experts to understand the reasoning used by the AI system. A comparison to a selection of state of the art explainable AI systems shows that ALPODS operates efficiently on known benchmark data and also on everyday routine case data.

【58】 Learning a Model for Inferring a Spatial Road Lane Network Graph using Self-Supervision 标题：基于自监督的空间车道网络图推理模型的学习

作者：Robin Karlsson,David Robert Wong,Simon Thompson,Kazuya Takeda 机构： 2Institute of Innovation for Future Society, Nagoya University, 3Graduate School of Informatics 备注：Accepted for IEEE ITSC 2021 链接：https://arxiv.org/abs/2107.01784 摘要：互联道路车道是城市道路导航的核心概念。目前，由于算法模型的设计比较困难，大多数自主车辆都依赖于预先构建的车道图。然而，这种地图的生成和维护成本高昂，阻碍了自主车辆技术的大规模采用。提出了一种基于车载传感器生成的道路场景密集分段表示的自监督学习方法，用于训练一个模型来推断空间接地车道级道路网络图。提出了一种形式化的道路-车道网络模型，证明了在保留交叉区域概念的前提下，任何结构化道路场景都可以用最深3的有向无环图来表示，这是最压缩的表示。形式化模型是由一个混合神经和搜索为基础的模型，利用一个新的屏障功能损失公式鲁棒学习部分标签。对所有常见的道路交叉口布局进行了试验。结果表明，该模型可以推广到新的道路布局，不同于以往的方法，显示了其潜在的实际应用作为一个实际的学习为基础的车道级地图生成器。摘要：Interconnected road lanes are a central concept for navigating urban roads. Currently, most autonomous vehicles rely on preconstructed lane maps as designing an algorithmic model is difficult. However, the generation and maintenance of such maps is costly and hinders large-scale adoption of autonomous vehicle technology. This paper presents the first self-supervised learning method to train a model to infer a spatially grounded lane-level road network graph based on a dense segmented representation of the road scene generated from onboard sensors. A formal road lane network model is presented and proves that any structured road scene can be represented by a directed acyclic graph of at most depth three while retaining the notion of intersection regions, and that this is the most compressed representation. The formal model is implemented by a hybrid neural and search-based model, utilizing a novel barrier function loss formulation for robust learning from partial labels. Experiments are conducted for all common road intersection layouts. Results show that the model can generalize to new road layouts, unlike previous approaches, demonstrating its potential for real-world application as a practical learning-based lane-level map generator.

【59】 A Comparison of the Delta Method and the Bootstrap in Deep Learning Classification 标题：深度学习分类中Delta方法与Bootstrap方法的比较

作者：Geir K. Nilsen,Antonella Z. Munthe-Kaas,Hans J. Skaug,Morten Brun 机构：Department of Mathematics, University of Bergen 链接：https://arxiv.org/abs/2107.01606 摘要：通过与经典Bootstrap方法的比较，验证了最近提出的深度学习分类自适应Delta方法的有效性。结果表明，在使用MNIST和CIFAR-10数据集的两个基于LeNet的神经网络分类器上，两种方法得到的定量预测认知不确定性水平之间存在很强的线性关系。此外，我们证明了Delta方法比Bootstrap方法减少了五倍的计算时间。摘要：We validate the recently introduced deep learning classification adapted Delta method by a comparison with the classical Bootstrap. We show that there is a strong linear relationship between the quantified predictive epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets. Furthermore, we demonstrate that the Delta method offers a five times computation time reduction compared to the Bootstrap.

【60】 Random Neural Networks in the Infinite Width Limit as Gaussian Processes 标题：无限宽度极限下的随机神经网络为高斯过程

作者：Boris Hanin 机构：Department of Operations Research and Financial Engineering, Princeton University 备注：26p 链接：https://arxiv.org/abs/2107.01562 摘要：本文给出了一个新的证明：在输入维、输出维和深度保持不变，而隐层宽度趋于无穷大的情况下，具有随机权值和偏差的全连通神经网络收敛于高斯过程。不同于以往的工作，收敛性表明，假设只有矩条件的权重分布和相当普遍的非线性。摘要：This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed, while the hidden layer widths tend to infinity. Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-linearities.

【61】 Certifiably Robust Interpretation via Renyi Differential Privacy 标题：基于Renyi差分隐私的可证明鲁棒解释

作者：Ao Liu,Xiaoyu Chen,Sijia Liu,Lirong Xia,Chuang Gan 机构：Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, USA, Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China, Department of Computer Science and Engineering, Michigan State University 备注：19 page main text + appendix 链接：https://arxiv.org/abs/2107.01561 摘要：基于最近发现CNN的解释地图很容易被对抗性攻击操纵，我们从一个新的角度研究了解释鲁棒性问题。我们的Renyi鲁棒平滑（基于RDP的解释方法）的优点是三倍。首先，它可以提供可证明和可证明的top-k$健壮性。也就是说，解释图的top-$k$重要属性在有界$\ell\u d$-范数的任何输入扰动下（对于任何$d\geq 1$，包括$d=\infty$）都是可证明鲁棒的。第二，我们提出的方法提供了$\sim10\%$更好的实验稳健性，比现有的方法在顶部-$k$属性方面。值得注意的是，Renyi鲁棒平滑的精度也优于现有的方法。第三，我们的方法可以在稳健性和计算效率之间提供一个平滑的折衷。实验上，当计算资源受到高度限制时，它的top-$k$属性比现有的方法{\em两倍}更健壮。摘要：Motivated by the recent discovery that the interpretation maps of CNNs could easily be manipulated by adversarial attacks against network interpretability, we study the problem of interpretation robustness from a new perspective of \Renyi differential privacy (RDP). The advantages of our Renyi-Robust-Smooth (RDP-based interpretation method) are three-folds. First, it can offer provable and certifiable top-$k$ robustness. That is, the top-$k$ important attributions of the interpretation map are provably robust under any input perturbation with bounded $\ell_d$-norm (for any $d\geq 1$, including $d = \infty$). Second, our proposed method offers $\sim10\%$ better experimental robustness than existing approaches in terms of the top-$k$ attributions. Remarkably, the accuracy of Renyi-Robust-Smooth also outperforms existing approaches. Third, our method can provide a smooth tradeoff between robustness and computational efficiency. Experimentally, its top-$k$ attributions are {\em twice} more robust than existing approaches when the computational resources are highly constrained.

【62】 Bayesian decision-making under misspecified priors with applications to meta-learning 标题：错误指定先验下的贝叶斯决策及其在元学习中的应用

作者：Max Simchowitz,Christopher Tosh,Akshay Krishnamurthy,Daniel Hsu,Thodoris Lykouris,Miroslav Dudík,Robert E. Schapire 链接：https://arxiv.org/abs/2107.01509 摘要：汤普森抽样和其他贝叶斯序列决策算法是最流行的方法，以解决探索/利用权衡（上下文）土匪。在这些算法中，先验知识的选择提供了编码领域知识的灵活性，但在错误指定时也会导致性能不佳。在本文中，我们证明了性能下降优雅与错误的规格。我们证明了具有错误先验的汤普森抽样（TS）产生的期望报酬与具有明确先验的TS最多相差$\tilde{\mathcal{O}（H^2\epsilon）$，其中$\epsilon$是先验之间的总变化距离，$H$是学习视界。我们的界不要求先验有任何参数形式。对于具有有界支撑的先验，我们的界与作用空间的基数或结构无关，并且我们证明了在最坏的情况下它紧到普适常数。在灵敏度分析的基础上，我们为最近研究的贝叶斯元学习环境中的算法建立了通用PAC保证，并导出了各种先验族的推论。我们的结果沿着两个轴概括：（1）它们适用于更广泛的贝叶斯决策算法家族，包括知识梯度算法（KG）的蒙特卡罗实现；（2）它们适用于贝叶斯POMDPs，最一般的贝叶斯决策设置，作为特例包含上下文盗贼。通过数值模拟，我们说明了在具有结构化和相关先验的多武装和背景土匪中，先验错误指定和一步前瞻（如KG）的部署如何影响元学习的收敛性。摘要：Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified prior differs by at most $\tilde{\mathcal{O}}(H^2 \epsilon)$ from TS with a well specified prior, where $\epsilon$ is the total-variation distance between priors and $H$ is the learning horizon. Our bound does not require the prior to have any parametric form. For priors with bounded support, our bound is independent of the cardinality or structure of the action space, and we show that it is tight up to universal constants in the worst case. Building on our sensitivity analysis, we establish generic PAC guarantees for algorithms in the recently studied Bayesian meta-learning setting and derive corollaries for various families of priors. Our results generalize along two axes: (1) they apply to a broader family of Bayesian decision-making algorithms, including a Monte-Carlo implementation of the knowledge gradient algorithm (KG), and (2) they apply to Bayesian POMDPs, the most general Bayesian decision-making setting, encompassing contextual bandits as a special case. Through numerical simulations, we illustrate how prior misspecification and the deployment of one-step look-ahead (as in KG) can impact the convergence of meta-learning in multi-armed and contextual bandits with structured and correlated priors.

【63】 Cleaning large-dimensional covariance matrices for correlated samples 标题：清理相关样本的高维协方差矩阵

作者：Zdzislaw Burda,Andrzej Jarosz 机构：AGH University of Science and Technology, Applied Computer Science, al. Mickiewicza ,-, Krakow, Poland 备注：5 pages, 3 figures 链接：https://arxiv.org/abs/2107.01352 摘要：在自相关样本集上导出了大维协方差矩阵的非线性收缩估计，从而推广了Ledoit-P\'{e}ch\'{e}的最新公式。随机矩阵理论便于计算。利用Ledoit-Wolf核估计技术，将计算结果转化为一个有效的算法，并建立了一个相关的Python库收缩。给出了一个指数衰减自相关的例子。摘要：A non-linear shrinkage estimator of large-dimensional covariance matrices is derived in a setting of auto-correlated samples, thus generalizing the recent formula by Ledoit-P\'{e}ch\'{e}. The calculation is facilitated by random matrix theory. The result is turned into an efficient algorithm, and an associated Python library, shrinkage, with help of Ledoit-Wolf kernel estimation technique. An example of exponentially-decaying auto-correlations is presented.

【64】 Principled, practical, flexible, fast: a new approach to phylogenetic factor analysis 标题：原则性、实用性、灵活性、快速性：一种新的系统发育因子分析方法

作者：Gabriel W. Hassler,Brigida Gallone,Leandro Aristide,William L. Allen,Max R. Tolkoff,Andrew J. Holbrook,Guy Baele,Philippe Lemey,Marc A. Suchard 机构：Department of Computational Medicine, David Geffen School of Medicine at UCLA, University, of California, Los Angeles, United States, VIB–KU Leuven Center for Microbiology, Leuven, Belgium 备注：27 pages, 7 figures, 1 table 链接：https://arxiv.org/abs/2107.01246 摘要：生物表型是复杂进化过程的产物，其中选择力以未知方式影响多种生物性状的测量。系统发育因子分析在一组生物的进化史中解开了这些关系。寻求使用这种建模框架的科学家面临着大量的建模和实现决策，这些决策的细节带来了计算和可复制性方面的挑战。一般和有效的社区就业需要一个数据科学分析计划，平衡灵活性、速度和易用性，同时最小化模型和算法调整。即使在存在非平凡的系统发育模型约束的情况下，我们也表明，人们可以通过以下方式解析解决潜在因素的不确定性：（a）有助于模型的灵活性，（b）加速计算（高达500倍）和（c）减少所需的调整。我们进一步提出了实际指导推理和建模决策以及诊断和解决这些分析中的常见问题。我们将这个分析计划编成一个自动化的管道，将潜在的大量建模决策提炼成一小部分（通常是二进制的）选择。我们在四个不同尺度的实际问题中展示了这些方法和分析方案的实用性。摘要：Biological phenotypes are products of complex evolutionary processes in which selective forces influence multiple biological trait measurements in unknown ways. Phylogenetic factor analysis disentangles these relationships across the evolutionary history of a group of organisms. Scientists seeking to employ this modeling framework confront numerous modeling and implementation decisions, the details of which pose computational and replicability challenges. General and impactful community employment requires a data scientific analysis plan that balances flexibility, speed and ease of use, while minimizing model and algorithm tuning. Even in the presence of non-trivial phylogenetic model constraints, we show that one may analytically address latent factor uncertainty in a way that (a) aids model flexibility, (b) accelerates computation (by as much as 500-fold) and (c) decreases required tuning. We further present practical guidance on inference and modeling decisions as well as diagnosing and solving common problems in these analyses. We codify this analysis plan in an automated pipeline that distills the potentially overwhelming array of modeling decisions into a small handful of (typically binary) choices. We demonstrate the utility of these methods and analysis plan in four real-world problems of varying scales.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-06，如有侵权请联系 cloudcommunity@tencent.com 删除

linux