统计学学术速递[9.8]

公众号-arXiv每日学术速递

发布于 2021-09-16 16:37:16

3920

发布于 2021-09-16 16:37:16

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

stat统计学，共计28篇

【1】 Estimating effects within nonlinear autoregressive models: a case study on the impact of child access prevention laws on firearm mortality 标题：估计非线性自回归模型的效应：儿童接触防护法对枪支死亡率影响的案例研究链接：https://arxiv.org/abs/2109.03225

作者：Matthew Cefalu,Terry Schell,Beth Ann Griffin,Rosanna Smart,Andrew Morral 摘要：自回归模型广泛用于时间序列数据的分析，但在估计干预措施的效果时，它们仍然没有得到充分利用。这在一定程度上是由于任何利益干预的滞后结果的内生性，这造成了解释模型系数的困难。这些问题只会在研究犯罪、死亡率或疾病时常见的非线性或非加性模型中恶化。在本文中，我们探讨了在估计干预措施对计数数据的影响时使用负二项自回归模型。我们推导了一个简单的近似，便于在任意阶自回归模型下直接解释模型参数。我们使用美国36年的州级枪支死亡率数据进行实证模拟研究，并使用该方法估计儿童接触预防法对枪支死亡率的影响。摘要：Autoregressive models are widely used for the analysis of time-series data, but they remain underutilized when estimating effects of interventions. This is in part due to endogeneity of the lagged outcome with any intervention of interest, which creates difficulty interpreting model coefficients. These problems are only exacerbated in nonlinear or nonadditive models that are common when studying crime, mortality, or disease. In this paper, we explore the use of negative binomial autoregressive models when estimating the effects of interventions on count data. We derive a simple approximation that facilitates direct interpretation of model parameters under any order autoregressive model. We illustrate the approach using an empirical simulation study using 36 years of state-level firearm mortality data from the United States and use the approach to estimate the effect of child access prevention laws on firearm mortality.

【2】 Adaptive variational Bayes: Optimality, computation and applications 标题：自适应变分贝叶斯：最优性、计算及应用链接：https://arxiv.org/abs/2109.03204

作者：Ilsang Ohn,Lizhen Lin 机构：Department of Applied and Computational Mathematics and Statistics, The University of Notre Dame 摘要：本文研究了基于变分贝叶斯的自适应推理。虽然已经进行了大量的研究来分析变分后验概率的压缩特性，但是仍然缺乏一种通用的、可计算的变分贝叶斯方法来实现变分后验概率的自适应最优压缩。我们提出了一种新的变分贝叶斯框架，称为自适应变分贝叶斯框架，它可以处理一系列具有不同维度和结构的模型。该框架将单个模型上的变分后验概率与一定权重相结合，以获得整个模型上的变分后验概率。结果表明，这种组合变分后验分布使Kullback-Leibler散度最小化到原始后验分布。我们证明了在非常一般的条件下，所提出的变分后验概率自适应地达到最优收缩率，并且在真实模型结构存在时达到模型选择的一致性。我们将自适应变分贝叶斯的一般结果应用于几个例子，包括深度学习模型，并得到一些新的自适应推理结果。此外，我们考虑在我们的框架中使用准似然。我们制定了准似然条件，以确保自适应最优性，并讨论了随机块模型和具有次高斯误差的非参数回归的具体应用。摘要：In this paper, we explore adaptive inference based on variational Bayes. Although a number of studies have been conducted to analyze contraction properties of variational posteriors, there is still a lack of a general and computationally tractable variational Bayes method that can achieve adaptive optimal contraction of the variational posterior. We propose a novel variational Bayes framework, called adaptive variational Bayes, which can operate on a collection of models with varying dimensions and structures. The proposed framework combines variational posteriors over individual models with certain weights to obtain a variational posterior over the entire model. It turns out that this combined variational posterior minimizes the Kullback-Leibler divergence to the original posterior distribution. We show that the proposed variational posterior achieves optimal contraction rates adaptively under very general conditions and attains model selection consistency when the true model structure exists. We apply the general results obtained for the adaptive variational Bayes to several examples including deep learning models and derive some new and adaptive inference results. Moreover, we consider the use of quasi-likelihood in our framework. We formulate conditions on the quasi-likelihood to ensure the adaptive optimality and discuss specific applications to stochastic block models and nonparametric regression with sub-Gaussian errors.

【3】 On unified framework for nonlinear grey system models: an integro-differential equation perspective 标题：非线性灰色系统模型的统一框架：积分-微分方程观点链接：https://arxiv.org/abs/2109.03101

作者：Lu Yang,Naiming Xie,Baolei Wei,Xiaolei Wang 摘要：非线性灰色系统模型用于时间序列预测，广泛应用于科学和工程的各个领域。然而，大多数研究关注改进经典模型和开发新模型，对不同模型之间的关系和建模机制的关注相对有限。本文提出了一个统一的框架，并从积分微分方程的角度重建了统一的模型。首先，我们提出了一个方法框架，将各种非线性灰色系统模型作为特例，提供了一个面向累积和序列的建模范式。然后，通过引入积分算子，将统一模型化为等价的积分微分方程；在此基础上，通过积分匹配法同时估计结构参数和初始值。建模过程比较进一步表明，基于积分匹配的积分-微分方程提供了一种直接建模范式。接下来，通过大规模蒙特卡罗模拟比较了有限样本的性能，结果表明简化模型具有更高的精度和抗噪声能力。通过对长江三角洲城市污水排放量和用水量的预测，进一步验证了重建的非线性灰色模型的有效性。摘要：Nonlinear grey system models, serving to time series forecasting, are extensively used in diverse areas of science and engineering. However, most research concerns improving classical models and developing novel models, relatively limited attention has been paid to the relationship among diverse models and the modelling mechanism. The current paper proposes a unified framework and reconstructs the unified model from an integro-differential equation perspective. First, we propose a methodological framework that subsumes various nonlinear grey system models as special cases, providing a cumulative sum series-orientated modelling paradigm. Then, by introducing an integral operator, the unified model is reduced to an equivalent integro-differential equation; on this basis, the structural parameters and initial value are estimated simultaneously via the integral matching approach. The modelling procedure comparison further indicates that the integral matching-based integro-differential equation provides a direct modelling paradigm. Next, large-scale Monte Carlo simulations are conducted to compare the finite sample performance, and the results show that the reduced model has higher accuracy and robustness to noise. Applications of forecasting the municipal sewage discharge and water consumption in the Yangtze River Delta of China further illustrate the effectiveness of the reconstructed nonlinear grey models.

【4】 Estimating the case fatality rate of a disease during the course of an epidemic with an application to COVID-19 in Argentina 标题：在阿根廷应用冠状病毒估算疾病流行过程中的病死率链接：https://arxiv.org/abs/2109.03087

作者：Agustín Alvarez,Marina Fragalá,Marina Valdora 机构： Universidad de General Sarmiento( 2) Departamento de matem´atica 备注：17 pages, 6 figures 摘要：我们提出了一个在流行病爆发过程中可以计算的病死率的精确估计，并证明了它的渐近性质。我们将其应用于2020年阿根廷的真实新冠病毒-19病例和模拟数据，并显示其非常好的性能。摘要：We present an accurate estimator of the case fatality rate that can be computed during the course of an epidemic outbreak and we prove its asymptotic properties. We apply it to the real case of covid-19 in Argentina during 2020 and to simulated data and show its very good performance.

【5】 Robust adaptive Lasso in high-dimensional logistic regression with an application to genomic classification of cancer patients 标题：高维Logistic回归中的鲁棒自适应LASSO及其在癌症患者基因组分类中的应用链接：https://arxiv.org/abs/2109.03028

作者：Ayanendranath Basu,Abhik Ghosh,María Jaenada,Leandro Pardo 备注：25 pages 摘要：惩罚logistic回归对于具有大量协变量（显著高于样本量）的二元分类非常有用，具有多个实际应用，包括基因组疾病分类。然而，现有的基于似然损失函数的方法对数据污染和其他噪声非常敏感，因此需要稳健的方法来进行稳定和更精确的推理。在本文中，我们利用流行的基于密度幂散度的损失函数和一般的自适应加权LASSO惩罚，提出了稀疏logistic模型的一系列鲁棒估计。我们通过影响函数研究了估计量的局部鲁棒性，并得到了它的预言性质和渐近分布。通过大量的实证说明，我们清楚地证明了我们提出的估计器比现有估计器的性能有了显著的提高，并且在稳健性方面有了特别的增益。最后，我们的建议被应用于分析四种不同的癌症分类真实数据集，获得稳健而准确的模型，同时执行基因选择和患者分类。摘要：Penalized logistic regression is extremely useful for binary classification with a large number of covariates (significantly higher than the sample size), having several real life applications, including genomic disease classification. However, the existing methods based on the likelihood based loss function are sensitive to data contamination and other noise and, hence, robust methods are needed for stable and more accurate inference. In this paper, we propose a family of robust estimators for sparse logistic models utilizing the popular density power divergence based loss function and the general adaptively weighted LASSO penalties. We study the local robustness of the proposed estimators through its influence function and also derive its oracle properties and asymptotic distribution. With extensive empirical illustrations, we clearly demonstrate the significantly improved performance of our proposed estimators over the existing ones with particular gain in robustness. Our proposal is finally applied to analyse four different real datasets for cancer classification, obtaining robust and accurate models, that simultaneously performs gene selection and patient classification.

【6】 Statistical analysis of locally parameterized shapes 标题：局部参数化形状的统计分析链接：https://arxiv.org/abs/2109.03027

作者：Mohsen Taheri,Jörn Schulz 机构：and, J¨orn Schulz†, Department of Mathematics & Physics, University of Stavanger 备注：25 pages, 20 figures 摘要：形状对齐是统计形状分析中的一个关键步骤，例如，在计算平均形状、检测两个形状群体之间的位置差异以及分类时。Procrustes对齐是最常用的方法和最先进的技术。在这项工作中，我们发现对齐可能会严重影响统计分析。例如，对齐可能导致错误的形状差异，并导致误导性结果和解释。提出了一种基于局部坐标系的层次形状参数化方法。局部参数化形状具有平移和旋转不变性。因此，使用此参数化可以避免形状表示常用全局坐标系的固有对齐问题。新参数化在形状变形和模拟方面也具有优越性。通过对模拟数据以及帕金森病患者和对照组的左海马的假设检验，证明了该方法的有效性。摘要：The alignment of shapes has been a crucial step in statistical shape analysis, for example, in calculating mean shape, detecting locational differences between two shape populations, and classification. Procrustes alignment is the most commonly used method and state of the art. In this work, we uncover that alignment might seriously affect the statistical analysis. For example, alignment can induce false shape differences and lead to misleading results and interpretations. We propose a novel hierarchical shape parameterization based on local coordinate systems. The local parameterized shapes are translation and rotation invariant. Thus, the inherent alignment problems from the commonly used global coordinate system for shape representation can be avoided using this parameterization. The new parameterization is also superior for shape deformation and simulation. The method's power is demonstrated on the hypothesis testing of simulated data as well as the left hippocampi of patients with Parkinson's disease and controls.

【7】 Estimation of the covariate conditional tail expectation : a depth-based level set approach 标题：协变量条件尾部期望的估计：一种基于深度的水平集方法链接：https://arxiv.org/abs/2109.03017

作者：Armaut Elisabeth,Diel Roland,Laloë Thomas 机构：Roland Diel † Thomas Laloë ‡Université de Nice Sophia-Antipolis 摘要：本文的目的是基于多元统计深度函数研究一种特殊的多元风险度量——协变量条件尾部期望（CCTE）的渐近行为。深度函数已经成为多元数据非参数推断中越来越强大的工具，因为它们测量点相对于分布的中心度。多元风险情景，然后由一个基于深度的风险因素的较低水平集，这意味着我们认为一个非紧凑的设置。更准确地说，给定一个与固定概率测度相关的多元深度函数D，我们对基于D的低水平集感兴趣。首先，我们提出了一种插件方法来估计基于深度的水平集。在第二部分中，我们提供了一个一致的估计CCTE的一般深度函数的收敛速度，我们考虑马哈拉诺比斯深度的具体情况。模拟研究补充了我们的估计器的性能。摘要：The aim of this paper is to study the asymptotic behavior of a particular multivariate risk measure, the Covariate-Conditional-Tail-Expectation (CCTE), based on a multivariate statistical depth function. Depth functions have become increasingly powerful tools in nonparametric inference for multivariate data, as they measure a degree of centrality of a point with respect to a distribution. A multivariate risks scenario is then represented by a depth-based lower level set of the risk factors, meaning that we consider a non-compact setting. More precisely, given a multivariate depth function D associated to a fixed probability measure, we are interested in the lower level set based on D. First, we present a plug-in approach in order to estimate the depth-based level set. In a second part, we provide a consistent estimator of our CCTE for a general depth function with a rate of convergence, and we consider the particular case of the Mahalanobis depth. A simulation study complements the performances of our estimator.

【8】 Tree-based boosting with functional data 标题：使用函数数据的基于树的Boost 链接：https://arxiv.org/abs/2109.02989

作者：Xiaomeng Ju,Matías Salibián-Barrera 摘要：在这篇文章中，我们提出了一个函数解释变量和标量响应的boosting回归算法。该算法使用多投影构造的决策树作为“基本学习者”，我们称之为“函数多索引树”。我们为这些树建立了可识别的条件，并引入了两种算法来计算它们：一种算法在整个树上寻找最优投影，而另一种算法在每个分割处搜索单个最优投影。我们使用数值实验来研究我们的方法的性能，并将其与几种线性和非线性回归估计进行比较，包括最近提出的非参数和半参数函数加性估计。仿真研究表明，所提出的方法始终处于最佳性能中，而任何竞争对手相对于其他竞争对手的性能在不同的环境中可能会有很大的差异。在一个真实的例子中，我们应用我们的方法使用价格曲线预测电力需求，并表明我们的估计器与竞争对手相比提供了更好的预测，尤其是在季节性调整时。摘要：In this article we propose a boosting algorithm for regression with functional explanatory variables and scalar responses. The algorithm uses decision trees constructed with multiple projections as the "base-learners", which we call "functional multi-index trees". We establish identifiability conditions for these trees and introduce two algorithms to compute them: one finds optimal projections over the entire tree, while the other one searches for a single optimal projection at each split. We use numerical experiments to investigate the performance of our method and compare it with several linear and nonlinear regression estimators, including recently proposed nonparametric and semiparametric functional additive estimators. Simulation studies show that the proposed method is consistently among the top performers, whereas the performance of any competitor relative to others can vary substantially across different settings. In a real example, we apply our method to predict electricity demand using price curves and show that our estimator provides better predictions compared to its competitors, especially when one adjusts for seasonality.

【9】 Instance-dependent Label-noise Learning under a Structural Causal Model 标题：结构因果模型下依赖实例的标签噪声学习链接：https://arxiv.org/abs/2109.02986

作者：Yu Yao,Tongliang Liu,Mingming Gong,Bo Han,Gang Niu,Kun Zhang 机构：University of Sydney; ,University of Melbourne; ,Hong Kong Baptist University;, RIKEN AIP; ,Carnegie Mellon University 摘要：标签噪声会使深度学习算法的性能退化，因为深度神经网络容易过度拟合标签误差。让X和Y分别表示实例和干净标签。当Y是X的一个原因时，根据该原因已经构建了许多数据集，例如SVHN和CIFAR，P（X）和P（Y | X）的分布是纠缠的。这意味着无监督实例有助于学习分类器，从而减少标签噪声的副作用。然而，如何利用因果信息来处理标签噪声问题仍然是个谜。在本文中，通过利用结构因果模型，我们提出了一种新的基于实例的标签噪声学习生成方法。特别是，我们表明，适当地建模实例将有助于标签噪声转移矩阵的可识别性，从而产生更好的分类器。从经验上看，我们的方法在合成和真实标签噪声数据集上都优于所有最先进的方法。摘要：Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpful to learn the classifier and thus reduce the side effect of label noise. However, it remains elusive on how to exploit the causal information to handle the label noise problem. In this paper, by leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning. In particular, we show that properly modeling the instances will contribute to the identifiability of the label noise transition matrix and thus lead to a better classifier. Empirically, our method outperforms all state-of-the-art methods on both synthetic and real-world label-noise datasets.

【10】 Dynamic Network Regression 标题：动态网络回归链接：https://arxiv.org/abs/2109.02981

作者：Yidong Zhou,Hans-Georg Müller 机构：Department of Statistics, University of California, Davis, Davis, CA , USA 备注：49 pages, 10 figures 摘要：网络数据在各个研究领域的可用性越来越高，这推动了对网络人口的统计分析，其中网络作为一个整体被视为一个数据点。由于网络的非欧几里德性质，当人们试图将网络作为结果与欧几里德协变量联系起来时，标量和向量数据的基本统计工具不再适用，而研究网络如何依赖于协变量变化往往是最重要的。这促使我们将回归的概念扩展到网络数据响应的情况。在这里，我们建议采用全局最小二乘回归和局部加权最小二乘平滑实现的条件Fr\{e}chet平均，将Fr\{e}chet回归概念扩展到由图拉普拉斯算子量化的网络。挑战在于刻画图拉普拉斯算子的空间，从而证明Fr\{e}chet回归的应用。然后，通过应用经验过程方法，该特征导致相应M-估计的渐近收敛速度。我们通过模拟和来自纽约出租车记录的网络数据，以及神经成像中的静息状态功能磁共振成像，证明了该框架的实用性和良好的实用性能。摘要：Network data are increasingly available in various research fields, motivating statistical analysis for populations of networks where a network as a whole is viewed as a data point. Due to the non-Euclidean nature of networks, basic statistical tools available for scalar and vector data are no longer applicable when one aims to relate networks as outcomes to Euclidean covariates, while the study of how a network changes in dependence on covariates is often of paramount interest. This motivates to extend the notion of regression to the case of responses that are network data. Here we propose to adopt conditional Fr\'{e}chet means implemented with both global least squares regression and local weighted least squares smoothing, extending the Fr\'{e}chet regression concept to networks that are quantified by their graph Laplacians. The challenge is to characterize the space of graph Laplacians so as to justify the application of Fr\'{e}chet regression. This characterization then leads to asymptotic rates of convergence for the corresponding M-estimators by applying empirical process methods. We demonstrate the usefulness and good practical performance of the proposed framework with simulations and with network data arising from NYC taxi records, as well as resting-state fMRI in neuroimaging.

【11】 Fast approximations of pseudo-observations in the context of right-censoring and interval-censoring 标题：右删失和区间删失背景下伪观测值的快速逼近链接：https://arxiv.org/abs/2109.02959

作者：Olivier Bouaziz 机构：MAP, (UMR CNRS ,), Universit´e de Paris 摘要：在右删失和区间删失数据的情况下，我们发展了计算生存函数和限制平均生存时间（RMST）伪观测值的渐近公式。这些公式基于原始估计量，不涉及刀切估计量的计算。对于右删失数据，使用Kaplan-Meier估计的Von Mises展开式导出伪观测值。对于区间删失数据，研究了生存函数的一类参数模型。推导了包含Hessian矩阵和密度分数向量的伪观测值的渐近表示。该公式在RMST的分段常数风险模型上进行了说明。如蒙特卡罗模拟和真实数据所示，所提出的近似值非常精确，即使是小样本。我们还研究了计算时间方面的增益，与原始的jackknife方法相比，这对于大型数据集来说是非常重要的。摘要：In the context of right-censored and interval-censored data we develop asymptotic formulas to compute pseudo-observations for the survival function and the Restricted Mean Survival Time (RMST). Those formulas are based on the original estimators and do not involve computation of the jackknife estimators. For right-censored data, Von Mises expansions of the Kaplan-Meier estimator are used to derive the pseudo-observations. For interval censored data, a general class of parametric models for the survival function is studied. An asymptotic representation of the pseudo-observations is derived involving the Hessian matrix and the score vector of the density. The formula is illustrated on the piecewise-constant hazard model for the RMST. The proposed approximations are extremely accurate, even for small sample sizes, as illustrated on Monte-Carlo simulations and real data. We also study the gain in terms of computation time, as compared to the original jackknife method, which can be substantial for large dataset.

【12】 Convergence rate of a collapsed Gibbs sampler for crossed random effects models 标题：交叉随机效应模型的折叠Gibbs采样器的收敛速度链接：https://arxiv.org/abs/2109.02849

作者：Swarnadip Ghosh,Chenyang Zhong 机构：Stanford University 摘要：在这篇文章中，我们分析了交叉随机效应模型中折叠吉布斯采样器的收敛速度。我们的结果适用于比以前的工作范围大得多的模型，包括包含缺失机制和不平衡水平数据的模型。我们分析中涉及的理论工具包括弛豫时间与自回归矩阵、浓度不等式和随机矩阵理论之间的联系。摘要：In this paper, we analyze the convergence rate of a collapsed Gibbs sampler for crossed random effects models. Our results apply to a substantially larger range of models than previous works, including models that incorporate missingness mechanism and unbalanced level data. The theoretical tools involved in our analysis include a connection between relaxation time and autoregression matrix, concentration inequalities, and random matrix theory.

【13】 Besov Function Approximation and Binary Classification on Low-Dimensional Manifolds Using Convolutional Residual Networks 标题：基于卷积残差网络的低维流形上的Besov函数逼近和二值分类链接：https://arxiv.org/abs/2109.02832

作者：Hao Liu,Minshuo Chen,Tuo Zhao,Wenjing Liao 机构：Hao Liu is affiliated with the Department of Mathematics at Hong Kong Baptist University; Wenjing Liao isaffiliated with the School of Mathematics at Georgia Tech; Minshuo Chen and Tuo Zhao are affiliated with the ISYEdepartment at Georgia Tech ; Email 备注：None 摘要：现有的大多数关于深度神经网络的统计理论都存在样本复杂度问题，因此无法很好地解释在高维数据上深度学习的经验成功。为了弥补这一差距，我们建议利用真实世界数据集的低维几何结构。我们建立了卷积残差网络（ConvResNet）在函数逼近和二元分类统计估计方面的理论保证。具体地说，假设数据位于等距嵌入$\mathbb{R}^d$中的$d$维流形上，我们证明，如果网络结构选择得当，convresnet可以（1）以任意精度逼近流形上的贝索夫函数，（2）通过最小化经验逻辑风险来学习分类器，它给出的超额风险为$n^{-\frac{s}{2s+2（s\vee d）}}$，其中$s$是平滑度参数。这意味着样本复杂性取决于内在维度$d$，而不是数据维度$d$。我们的结果表明，ConvResNets能够适应数据集的低维结构。摘要：Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a $d$-dimensional manifold isometrically embedded in $\mathbb{R}^D$, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of $n^{-\frac{s}{2s+2(s\vee d)}}$, where $s$ is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension $d$, instead of the data dimension $D$. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.

【14】 Exact and Asymptotic Tests for Sufficient Followup in Censored Survival Data 标题：删失生存数据中充分随访性的精确和渐近检验链接：https://arxiv.org/abs/2109.02817

作者：Ross Maller,Sidney Resnick,Soudabeh Shemehsavar 摘要：在医学生存分析中，免疫或治愈的个体在人群中的存在，以及是否有足够的后续调查样本对他们的生命周期进行审查，以确信他们的存在，是非常重要的问题。到目前为止，只有少数候选人被提出作为样本中是否存在足够的后续调查的可能测试统计数据。在这里，我们调查了一个这样的统计数据并进行了详细的分析，获得了一个精确的有限样本以及它的渐近分布，并使用这些数据来计算作为样本中后续结果函数的检验功率。摘要：The existence of immune or cured individuals in a population and whether there is sufficient followup in a sample of censored observations on their lifetimes to be confident of their presence are questions of major importance in medical survival analysis. So far only a few candidates have been put forward as possible test statistics for the existence of sufficient followup in a sample. Here we investigate one such statistic and give a detailed analysis, obtaining an exact finite sample as well as asymptotic distributions for it, and use these to calculate the power of the test as a function of the followup in the sample.

【15】 Finite Element Representations of Gaussian Processes: Balancing Numerical and Statistical Accuracy 标题：高斯过程的有限元表示：平衡数值和统计精度链接：https://arxiv.org/abs/2109.02777

作者：Daniel Sanz-Alonso,Ruiyi Yang 机构：University of Chicago 摘要：高斯过程的随机偏微分方程方法（GPs）以$n$有限元基函数和具有稀疏精度矩阵的高斯系数表示矩阵先验。这种表示通过设置$N\N$并利用稀疏性，增强了GP回归和分类对大数据集$N$的可伸缩性。在本文中，通过对估计性能的分析，我们重新考虑了标准选择$n \大约n$。我们的理论表明，在某些平滑度假设下，可以通过在大的$n$渐近中设置$n\ll n$来减少计算和内存开销，而不会影响估计精度。数值实验证明了我们的理论的适用性和先验长度尺度在前渐近区域的影响。摘要：The stochastic partial differential equation approach to Gaussian processes (GPs) represents Mat\'ern GP priors in terms of $n$ finite element basis functions and Gaussian coefficients with sparse precision matrix. Such representations enhance the scalability of GP regression and classification to datasets of large size $N$ by setting $n\approx N$ and exploiting sparsity. In this paper we reconsider the standard choice $n \approx N$ through an analysis of the estimation performance. Our theory implies that, under certain smoothness assumptions, one can reduce the computation and memory cost without hindering the estimation accuracy by setting $n \ll N$ in the large $N$ asymptotics. Numerical experiments illustrate the applicability of our theory and the effect of the prior lengthscale in the pre-asymptotic regime.

【16】 Ignorable and non-ignorable missing data in hidden Markov models 标题：隐马尔可夫模型中的可忽略和不可忽略缺失数据链接：https://arxiv.org/abs/2109.02770

作者：Maarten Speekenbrink,Ingmar Visser 机构：Department of Experimental Psychology, University College London, and, Department of Developmental Psychology, University of Amsterdam 备注：29 pages. Four figures. 8 tables 摘要：我们考虑丢失的数据在上下文中隐藏的马尔可夫模型，重点放在数据丢失的情况下不是随机的（MNAR）和错误取决于身份的隐藏状态。在模拟中，我们表明，当数据是MNAR和状态相关时，包含状态相关缺失的子模型可以减少偏差，而当数据随机缺失（MAR）时，不会降低准确性。当缺失依赖于时间而不是隐藏状态时，只允许依赖于状态的缺失的模型是有偏差的，而同时允许依赖于状态和时间的缺失的模型则不是。总的来说，这些结果表明，将缺失建模为状态相关，并包括其他相关协变量，是将隐马尔可夫模型应用于具有缺失数据的时间序列的有用策略。最后，我们将状态和时间相关的MNAR隐马尔可夫模型应用于真实数据集，涉及临床试验中精神分裂症症状的严重程度。摘要：We consider missing data in the context of hidden Markov models with a focus on situations where data is missing not at random (MNAR) and missingness depends on the identity of the hidden states. In simulations, we show that including a submodel for state-dependent missingness reduces bias when data is MNAR and state-dependent, whilst not reducing accuracy when data is missing at random (MAR). When missingness depends on time but not the hidden states, a model which only allows for state-dependent missingness is biased, whilst a model that allows for both state- and time-dependent missingness is not. Overall, these results show that modelling missingness as state-dependent, and including other relevant covariates, is a useful strategy in applications of hidden Markov models to time-series with missing data. We conclude with an application of the state- and time-dependent MNAR hidden Markov model to a real dataset, involving severity of schizophrenic symptoms in a clinical trial.

【17】 Screening the Discrepancy Function of a Computer Model 标题：计算机模型差异函数的筛选链接：https://arxiv.org/abs/2109.02726

作者：Pierre Barbillon,Anabel Forte,Rui Paulo 摘要：传统上，筛选是指在计算机模型中检测有效输入的问题。在本文中，我们开发了适用于筛选的方法，但主要重点不是检测计算机模型本身中的有效输入，而是在将计算机模型与现场观测联系起来时引入的差异函数，以解释模型的不足。我们认为这是一个重要的问题，因为它告诉建模者哪些输入可能在模型中被错误处理，但在哪些方向上不建议使用模型进行预测。该方法是贝叶斯方法，其灵感来自于贝叶斯变量选择文献中推广的连续尖峰和板优先。在我们的方法中，与之前的建议相反，来自完整模型的单个MCMC样本允许我们计算所有竞争模型的后验概率，从而形成计算速度非常快的方法。该方法取决于获得输入的后验包含概率的能力，这是非常直观且易于解释的量，作为选择活动输入的基础。因此，我们将该方法命名为PIPS——后验包含概率筛查。摘要：Screening traditionally refers to the problem of detecting active inputs in the computer model. In this paper, we develop methodology that applies to screening, but the main focus is on detecting active inputs not in the computer model itself but rather on the discrepancy function that is introduced to account for model inadequacy when linking the computer model with field observations. We contend this is an important problem as it informs the modeler which are the inputs that are potentially being mishandled in the model, but also along which directions it may be less recommendable to use the model for prediction. The methodology is Bayesian and is inspired by the continuous spike and slab prior popularized by the literature on Bayesian variable selection. In our approach, and in contrast with previous proposals, a single MCMC sample from the full model allows us to compute the posterior probabilities of all the competing models, resulting in a methodology that is computationally very fast. The approach hinges on the ability to obtain posterior inclusion probabilities of the inputs, which are very intuitive and easy to interpret quantities, as the basis for selecting active inputs. For that reason, we name the methodology PIPS -- posterior inclusion probability screening.

【18】 Bayesian data selection 标题：贝叶斯数据选择链接：https://arxiv.org/abs/2109.02712

作者：Eli N. Weinstein,Jeffrey W. Miller 摘要：通过发现与感兴趣的模型匹配或不匹配的数据特征，可以深入了解复杂的高维数据。为了将这项任务形式化，我们引入了“数据选择”问题：找到一个低维统计数据（如变量子集），该数据与给定的相关参数模型非常吻合。数据选择的完全贝叶斯方法是对统计数据的值进行参数化建模，对数据的剩余“背景”成分进行非参数化建模，并对统计数据的选择执行标准贝叶斯模型选择。然而，将非参数模型拟合到高维数据往往在统计和计算方面效率很低。我们提出了一个新的分数来执行数据选择和模型选择，即“Stein体积标准”，它采用广义边际似然的形式，用核化Stein差异代替Kullback-Leibler散度。Stein体积标准不需要拟合甚至指定非参数背景模型，因此计算起来很简单——在许多情况下，它与使用替代目标函数拟合相关参数模型一样简单。我们证明了Stein体积准则对于数据选择和模型选择都是一致的，并且我们建立了相应的广义后验参数的一致性和渐近正态性（Bernstein-vonmises）。我们在模拟中验证了我们的方法，并使用概率主成分分析和基因调控的自旋玻璃模型将其应用于单细胞RNA测序数据集的分析。摘要：Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic - such as a subset of variables - that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing both data selection and model selection, the "Stein volume criterion", that takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. The Stein volume criterion does not require one to fit or even specify a nonparametric background model, making it straightforward to compute - in many cases it is as simple as fitting the parametric model of interest with an alternative objective function. We prove that the Stein volume criterion is consistent for both data selection and model selection, and we establish consistency and asymptotic normality (Bernstein-von Mises) of the corresponding generalized posterior on parameters. We validate our method in simulation and apply it to the analysis of single-cell RNA sequencing datasets using probabilistic principal components analysis and a spin glass model of gene regulation.

【19】 Crash Report Data Analysis for Creating Scenario-Wise, Spatio-Temporal Attention Guidance to Support Computer Vision-based Perception of Fatal Crash Risks 标题：事故报告数据分析，用于创建情景、时空注意指南，以支持基于计算机视觉的致命事故风险感知链接：https://arxiv.org/abs/2109.02710

作者：Yu Li,Muhammad Monjurul Karim,Ruwen Qin 机构： Zeyi Sun 1AbstractReducing traffic fatalities and serious injuries is a top priority of the US Department of Transportation 备注：None 摘要：减少交通事故和重伤是美国交通部的首要任务。在接近碰撞阶段，基于计算机视觉（CV）的碰撞预测正受到越来越多的关注。提前感知致命碰撞风险的能力也至关重要，因为这将提高碰撞预测的可靠性。然而，用于训练可靠的人工智能模型以进行碰撞风险早期视觉感知的注释图像数据并不丰富。致命事故分析报告系统包含致命事故的大数据。它是学习驾驶场景特征和致命碰撞之间关系的可靠数据源，以弥补CV的限制。因此，本文从致命事故报告数据中开发了一个数据分析模型，名为场景感知时空注意引导，该模型可以根据检测到的对象的环境和上下文信息估计检测到的对象与致命事故的相关性。首先，本文确定了五个稀疏变量，这些变量允许分解5年致命车祸数据集，以制定情景注意指导。然后，对碰撞报告数据的位置和时间相关变量进行探索性分析，建议将致命碰撞减少到空间定义的组。该组的时间模式是该组致命交通事故相似性的指标。层次聚类和K-means聚类根据时间模式的相似性将空间定义的组合并为六个聚类。然后，关联规则挖掘发现每个聚类中驾驶场景的时间信息与碰撞特征之间的统计关系。本文展示了开发的注意指南如何支持初步CV模型的设计和实施，该模型可以从环境和上下文信息中识别可能涉及致命碰撞的对象。摘要：Reducing traffic fatalities and serious injuries is a top priority of the US Department of Transportation. The computer vision (CV)-based crash anticipation in the near-crash phase is receiving growing attention. The ability to perceive fatal crash risks earlier is also critical because it will improve the reliability of crash anticipation. Yet, annotated image data for training a reliable AI model for the early visual perception of crash risks are not abundant. The Fatality Analysis Reporting System contains big data of fatal crashes. It is a reliable data source for learning the relationship between driving scene characteristics and fatal crashes to compensate for the limitation of CV. Therefore, this paper develops a data analytics model, named scenario-wise, Spatio-temporal attention guidance, from fatal crash report data, which can estimate the relevance of detected objects to fatal crashes from their environment and context information. First, the paper identifies five sparse variables that allow for decomposing the 5-year fatal crash dataset to develop scenario-wise attention guidance. Then, exploratory analysis of location- and time-related variables of the crash report data suggests reducing fatal crashes to spatially defined groups. The group's temporal pattern is an indicator of the similarity of fatal crashes in the group. Hierarchical clustering and K-means clustering merge the spatially defined groups into six clusters according to the similarity of their temporal patterns. After that, association rule mining discovers the statistical relationship between the temporal information of driving scenes with crash features, for each cluster. The paper shows how the developed attention guidance supports the design and implementation of a preliminary CV model that can identify objects of a possibility to involve in fatal crashes from their environment and context information.

【20】 Estimating nuisance parameters often reduces the variance (with consistent variance estimation) 标题：估计有害参数通常会降低方差(具有一致的方差估计) 链接：https://arxiv.org/abs/2109.02690

作者：Judith J. Lok 机构：Department of Mathematics and Statistics, Boston University 备注：24 pages, and supplementary material 摘要：在许多应用中，为了估计感兴趣的参数或数量psi，首先估计有限维干扰参数θ。例如，因果推理中的许多估计器依赖于倾向性得分：给定过去治疗的概率（可能与时间相关）。θ通常在第一步进行估计，这可能会影响psi估计量的方差。θ通常由最大（部分）似然估计。逆概率加权、边际结构模型和结构嵌套模型是众所周知的因果推理示例，其中人们通常假定（合并）逻辑回归模型用于治疗（起始）和/或截尾概率，并使用标准软件进行估计，因此采用最大偏似然法。逆概率加权、边际结构模型和结构嵌套模型还有一些共同点：它们都可以基于无偏估计方程。本文给出了基于无偏估计方程（包括θ）的估计量的四个主要结果。首先，它表明，当通过解（部分）分数方程估计θ时，与已知θ并插入θ相比，psi hat的真实极限方差更小或保持不变。其次，它表明，如果忽略使用（部分）分数方程估计θ，则psi方差的三明治估计量是保守的。第三，它提供了方差校正。第四，它表明，如果插入真θ的估计量psi-hat是有效的，则psi-hat的真极限方差不取决于是否估计θ，并且忽略θ估计的psi-hat方差的三明治估计量是一致的。这些发现适用于半参数和参数设置，其中感兴趣的参数psi基于无偏估计方程进行估计。摘要：In many applications, to estimate a parameter or quantity of interest psi, a finite-dimensional nuisance parameter theta is estimated first. For example, many estimators in causal inference depend on the propensity score: the probability of (possibly time-dependent) treatment given the past. theta is often estimated in a first step, which can affect the variance of the estimator for psi. theta is often estimated by maximum (partial) likelihood. Inverse Probability Weighting, Marginal Structural Models and Structural Nested Models are well-known causal inference examples, where one often posits a (pooled) logistic regression model for the treatment (initiation) and/or censoring probabilities, and estimates these with standard software, so by maximum partial likelihood. Inverse Probability Weighting, Marginal Structural Models and Structural Nested Models have something else in common: they can all be shown to be based on unbiased estimating equations. This paper has four main results for estimators psi-hat based on unbiased estimating equations including theta. First, it shows that the true limiting variance of psi-hat is smaller or remains the same when theta is estimated by solving (partial) score equations, compared to if theta were known and plugged in. Second, it shows that if estimating theta using (partial) score equations is ignored, the resulting sandwich estimator for the variance of psi-hat is conservative. Third, it provides a variance correction. Fourth, it shows that if the estimator psi-hat with the true theta plugged in is efficient, the true limiting variance of psi-hat does not depend on whether or not theta is estimated, and the sandwich estimator for the variance of psi-hat ignoring estimation of theta is consistent. These findings hold in semiparametric and parametric settings where the parameters of interest psi are estimated based on unbiased estimating equations.

【21】 Characterizing interdisciplinarity in drug research: a translational science perspective 标题：药物研究的跨学科特征：从转化科学的角度看链接：https://arxiv.org/abs/2109.03038

作者：Xin Li,Xuli Tang 机构：School of Medicine and Health Management, Tongji Medical College, Huazhong University of Science and, Technology, Wuhan , Hubei, China, School of Information, The University of Texas, Austin , Texas, U.S.A. 备注：None 摘要：尽管生命科学取得了重大进展，但要将基本药物的发现转化为人类疾病的治疗仍然需要几十年的时间。为了加速从试验台到床边的过程，许多以前的研究强烈建议进行跨学科研究（特别是涉及基础研究和临床研究的研究）。然而，在现存的研究中，跨学科特征在药物研究中的模式和作用尚未得到深入的研究。本研究的目的是从转化科学的角度描述药物研究中的跨学科特征，并检验不同类型的跨学科特征在药物转化研究中的作用。摘要：Despite the significant advances in life science, it still takes decades to translate a basic drug discovery into a cure for human disease. To accelerate the process from bench to bedside, interdisciplinary research (especially research involving both basic research and clinical research) has been strongly recommend by many previous studies. However, the patterns and the roles of the interdisciplinary characteristics in drug research have not been deeply examined in extant studies. The purpose of this study was to characterize interdisciplinary characteristics in drug research from the perspective of translational science, and to examine the role of different kinds of interdisciplinary characteristics in translational research for drugs.

【22】 Semiparametric Bayesian Networks 标题：半参数贝叶斯网络链接：https://arxiv.org/abs/2109.03008

作者：David Atienza,Concha Bielza,Pedro Larrañaga 机构：Universidad Polit´ecnica de Madrid, Departamento de Inteligencia Artificial, Boadilla, del Monte, Spain 备注：44 pages, 13 figures, 4 tables, submitted to Information Sciences 摘要：我们介绍了结合参数和非参数条件概率分布的半参数贝叶斯网络。他们的目标是结合这两个组件的优点：参数模型的有限复杂性和非参数模型的灵活性。我们证明了半参数贝叶斯网络推广了两种著名的贝叶斯网络：高斯贝叶斯网络和核密度估计贝叶斯网络。为此，我们考虑在半参数贝叶斯网络中需要的两种不同的条件概率分布。此外，我们还对两种著名的算法（贪婪爬山算法和PC算法）进行了改进，以从数据中学习半参数贝叶斯网络的结构。为了实现这一点，我们采用了基于交叉验证的评分函数。此外，使用验证数据集，我们应用早期停止标准以避免过度拟合。为了评估该算法的适用性，我们对混合线性和非线性函数采样的合成数据、高斯贝叶斯网络采样的多元正态数据、UCI存储库中的真实数据以及轴承退化数据进行了详尽的实验。作为该实验的结果，我们得出结论，所提出的算法准确地学习了参数和非参数分量的组合，同时实现了与最先进的方法所提供的性能相当的性能。摘要：We introduce semiparametric Bayesian networks that combine parametric and nonparametric conditional probability distributions. Their aim is to incorporate the advantages of both components: the bounded complexity of parametric models and the flexibility of nonparametric ones. We demonstrate that semiparametric Bayesian networks generalize two well-known types of Bayesian networks: Gaussian Bayesian networks and kernel density estimation Bayesian networks. For this purpose, we consider two different conditional probability distributions required in a semiparametric Bayesian network. In addition, we present modifications of two well-known algorithms (greedy hill-climbing and PC) to learn the structure of a semiparametric Bayesian network from data. To realize this, we employ a score function based on cross-validation. In addition, using a validation dataset, we apply an early-stopping criterion to avoid overfitting. To evaluate the applicability of the proposed algorithm, we conduct an exhaustive experiment on synthetic data sampled by mixing linear and nonlinear functions, multivariate normal data sampled from Gaussian Bayesian networks, real data from the UCI repository, and bearings degradation data. As a result of this experiment, we conclude that the proposed algorithm accurately learns the combination of parametric and nonparametric components, while achieving a performance comparable with those provided by state-of-the-art methods.

【23】 Mutation frequency time series reveal complex mixtures of clones in the world-wide SARS-CoV-2 viral population 标题：突变频率时间序列揭示了世界范围内SARS-CoV-2病毒群体中克隆的复杂混合链接：https://arxiv.org/abs/2109.02962

作者：Hong-Li Zeng,Yue Liu,Vito Dichio,Kaisa Thorell,Rickard Nordén,Erik Aurell 机构：Aurell, ∗, School of Science, Nanjing University of Posts and Telecommunications, Nanjing, China, Inria Paris, Aramis Project Team, Paris, France, Institut du Cerveau, ICM, Inserm U , CNRS UMR , Sorbonne Universit´e 摘要：我们计算了来自GISAID储存库上近200万个基因组序列的SARS-CoV-2α（B.1.1.7）、β（B.1.351）和δ（B.167.2）变异体的等位基因频率。我们发现，大多数阿尔法定义突变的频率在2020年底上升，但在2021年春季漂移，类似的模式在2021年夏季被δ所遵循。对于β，我们发现一个更复杂的情况，一些突变的频率上升，一些保持接近零。我们的结果指出，通常被报道为单一变异的事实上是具有不同遗传特征的变异的集合。对于所有三个变异体，我们进一步发现一些具有明显偏离时间序列的等位基因。摘要：We compute the allele frequencies of the alpha (B.1.1.7), beta (B.1.351) and delta (B.167.2) variants of SARS-CoV-2 from almost two million genome sequences on the GISAID repository. We find that the frequencies of a majority of the defining mutations in alpha rose towards the end of 2020 but drifted apart during spring 2021, a similar pattern being followed by delta during summer of 2021. For beta we find a more complex scenario with frequencies of some mutations rising and some remaining close to zero. Our results point to that what is generally reported as single variants is in fact a collection of variants with different genetic characteristics. For all three variants we further find some alleles with a clearly deviating time series.

【24】 Combining data assimilation and machine learning to estimate parameters of a convective-scale model 标题：结合数据同化和机器学习来估计对流尺度模式的参数链接：https://arxiv.org/abs/2109.02953

作者：Stefanie Legler,Tijana Janjic 机构： LeglerMeteorological InstituteLudwig-Maximilians-UniversitätMunich, Janji´cMeteorological InstituteLudwig-Maximilians-UniversitätMunich 摘要：对流数值天气预报模式中云层表示的误差可能由不同来源引起。这些可以是强迫和边界条件、地形的表示、决定湿度和温度演变的数值方案的准确性，但很大程度上是由于微物理的参数化以及地表层和边界层过程的参数化。这些方案通常包含多个可调参数，这些参数不是物理参数，就是仅仅粗略地知道，从而导致模型错误。传统上，这些模型参数的数值是通过手动模型调整来选择的。更客观地说，在数据同化过程中，它们可以通过增强状态法的观测值进行估计。或者，在这项工作中，我们通过训练两种类型的人工神经网络（ANN）来研究通过人工智能透镜进行参数估计的问题，以估计作为大气状态观测或分析函数的一维修正浅水模型的几个参数。通过完善的模型实验，我们证明了贝叶斯神经网络（BNNs）和点估计神经网络（NNs）的贝叶斯近似能够估计模型参数及其相关统计量。参数估计与状态数据同化相结合，即使在同化稀疏和噪声观测时，也能减小初始状态误差。显示了对集合成员数、观测覆盖率和神经网络大小的敏感性。此外，我们使用逐层相关传播的方法来深入了解人工神经网络是如何学习的，并发现它们自然地只选择受强风和降雨影响的几个网格点来预测所选参数。摘要：Errors in the representation of clouds in convection-permitting numerical weather prediction models can be introduced by different sources. These can be the forcing and boundary conditions, the representation of orography, the accuracy of the numerical schemes determining the evolution of humidity and temperature, but large contributions are due to the parametrization of microphysics and the parametrization of processes in the surface and boundary layers. These schemes typically contain several tunable parameters that are either not physical or only crudely known, leading to model errors. Traditionally, the numerical values of these model parameters are chosen by manual model tuning. More objectively, they can be estimated from observations by the augmented state approach during the data assimilation. Alternatively, in this work, we look at the problem of parameter estimation through an artificial intelligence lens by training two types of artificial neural networks (ANNs) to estimate several parameters of the one-dimensional modified shallow-water model as a function of the observations or analysis of the atmospheric state. Through perfect model experiments, we show that Bayesian neural networks (BNNs) and Bayesian approximations of point estimate neural networks (NNs) are able to estimate model parameters and their relevant statistics. The estimation of parameters combined with data assimilation for the state decreases the initial state errors even when assimilating sparse and noisy observations. The sensitivity to the number of ensemble members, observation coverage, and neural network size is shown. Additionally, we use the method of layer-wise relevance propagation to gain insight into how the ANNs are learning and discover that they naturally select only a few gridpoints that are subject to strong winds and rain to make their predictions of chosen parameters.

【25】 Using Satellite Imagery and Machine Learning to Estimate the Livelihood Impact of Electricity Access 标题：利用卫星图像和机器学习评估供电对民生的影响链接：https://arxiv.org/abs/2109.02890

作者：Nathan Ratledge,Gabe Cadamuro,Brandon de la Cuesta,Matthieu Stigler,Marshall Burke 机构：Emmett Interdisciplinary Program in Environment and Resources, Stanford University, Palo Alto, CA , USA, Atlas AI, Palo Alto, CA , USA, King Center for International Development, Stanford University, Palo Alto, CA , USA 摘要：在世界许多地区，关键经济成果的稀疏数据阻碍了公共政策的制定、目标确定和评估。我们展示了卫星图像和机器学习的进步如何帮助改善这些数据和推理挑战。在乌干达电网扩张的背景下，我们展示了如何将卫星图像和计算机视觉结合起来，开发适合于推断电力接入对生计的因果影响的地方级生计测量。然后，我们展示了基于ML的推理技术在应用于这些数据时，如何比传统的替代方案更可靠地估计电气化的因果影响。我们估计，电网接入将乌干达农村地区的村级资产财富提高了0.17个标准差，比我们研究期间未经处理地区的增长率翻了一番多。我们的研究结果为关键基础设施投资的影响提供了国家级证据，并为数据稀疏环境下的未来政策评估提供了一种低成本、可推广的方法。摘要：In many regions of the world, sparse data on key economic outcomes inhibits the development, targeting, and evaluation of public policy. We demonstrate how advancements in satellite imagery and machine learning can help ameliorate these data and inference challenges. In the context of an expansion of the electrical grid across Uganda, we show how a combination of satellite imagery and computer vision can be used to develop local-level livelihood measurements appropriate for inferring the causal impact of electricity access on livelihoods. We then show how ML-based inference techniques deliver more reliable estimates of the causal impact of electrification than traditional alternatives when applied to these data. We estimate that grid access improves village-level asset wealth in rural Uganda by 0.17 standard deviations, more than doubling the growth rate over our study period relative to untreated areas. Our results provide country-scale evidence on the impact of a key infrastructure investment, and provide a low-cost, generalizable approach to future policy evaluation in data sparse environments.

【26】 The Second Generalization of the Hausdorff Dimension Theorem for Random Fractals 标题：随机分形的Hausdorff维数定理的第二推广链接：https://arxiv.org/abs/2109.02739

作者：Mohsen Soltanifar 机构： Dalla Lana School of Public Health, University of Toronto 摘要：本文给出了分形集的随机虚子类的基数计算问题的第二部分解。与确定性情形一致，我们证明了对于给定的Hausdorff维数和Lebesgue测度，存在两个几乎肯定Hausdorff维数为二元函数的虚拟随机分形，且期望的Lebesgue测度等于后者。其他三个分形维数的相关结果与Hausdorff维数的情况类似。对于非欧几里德抽象分形空间，这个问题仍然没有解决。摘要：In this paper, we present a second partial solution for the problem of cardinality calculation of the set of fractals for its subcategory of the random virtual ones. Consistent with the deterministic case, we show that for the given quantities of Hausdorff dimension and Lebesgue measure, there are aleph-two virtual random fractals with almost surely Hausdorff dimension of a bivariate function of them and the expected Lebesgue measure equal to the later one. The associated results for three other fractal dimensions are similar to the case given for the Hausdorff dimension. The problem remains unsolved for the case of non-Euclidean abstract fractal spaces.

【27】 Spectral properties of sample covariance matrices arising from random matrices with independent non identically distributed columns 标题：独立非同分布列随机矩阵的样本协方差矩阵的谱性质链接：https://arxiv.org/abs/2109.02644

作者：Cosme Louart,Romain Couillet 机构：GIPSA-lab, rue des Mathématiques, St Martin d’Hères, LIG-lab, avenus Centrale, St Martin d’Hères, Received (Day Month Year), Revised (Day Month Year) 备注：Main text 37p, Appendix 3 p, references 1p, 2 figures 摘要：给定一个随机矩阵$X=（X_1，ldots，X_n）\in\mathcal M{p，n}$，具有独立的列和满足度量假设的集中度，以及一个参数$z$，其到$\frac{1}{n}XX^T$频谱的距离不应依赖于$p，n$，之前已经证明泛函$\text{tr tr（AR（z））$，对于$R（z）=（\frac{1}{n}XX^T-ziu p）{-1}$和$A\in\mathcal M{p}$确定性，标准偏差为$O（\\\\ A\\ u*/\sqrt n）$。这里，我们展示了$\\\mathb E[R（z）]-\tilde R（z）\\\\u F\leq O（1/\sqrt n）$，其中$\tilde R（z）$是一个确定性矩阵，它仅依赖于$z$和列向量$x_1、\ldots、x_n$（不必是相同分布的）的均值和协方差。该估计对于提供$X$感兴趣的泛函的精确波动率（主要与其光谱特性有关）至关重要，并且由于引入了半度量$d_s$，该半度量定义在集$\mathcal d_n（\mathbb H）$的对角矩阵上，具有复数项和正虚部，并且满足所有$d，D'\in\mathcal D_n（\mathbb H）$：$D_s（D，D'）=\max_{i\in[n]}D_i-D_i'/（\Im（D_i）\Im（D_i'））^{1/2}$。可能最重要的是，测度假设的潜在集中在$X$列上，这为现代统计机器学习算法的应用找到了一个极其自然的基础，其中非线性Lipschitz映射和大量类构成了基本成分。摘要：Given a random matrix $X= (x_1,\ldots, x_n)\in \mathcal M_{p,n}$ with independent columns and satisfying concentration of measure hypotheses and a parameter $z$ whose distance to the spectrum of $\frac{1}{n} XX^T$ should not depend on $p,n$, it was previously shown that the functionals $\text{tr}(AR(z))$, for $R(z) = (\frac{1}{n}XX^T- zI_p)^{-1}$ and $A\in \mathcal M_{p}$ deterministic, have a standard deviation of order $O(\|A\|_* / \sqrt n)$. Here, we show that $\|\mathbb E[R(z)] - \tilde R(z)\|_F \leq O(1/\sqrt n)$, where $\tilde R(z)$ is a deterministic matrix depending only on $z$ and on the means and covariances of the column vectors $x_1,\ldots, x_n$ (that do not have to be identically distributed). This estimation is key to providing accurate fluctuation rates of functionals of $X$ of interest (mostly related to its spectral properties) and is proved thanks to the introduction of a semi-metric $d_s$ defined on the set $\mathcal D_n(\mathbb H)$ of diagonal matrices with complex entries and positive imaginary part and satisfying, for all $D,D' \in \mathcal D_n(\mathbb H)$: $d_s(D,D') = \max_{i\in[n]} |D_i - D_i'|/ (\Im(D_i) \Im(D_i'))^{1/2}$. Possibly most importantly, the underlying concentration of measure assumption on the columns of $X$ finds an extremely natural ground for application in modern statistical machine learning algorithms where non-linear Lipschitz mappings and high number of classes form the base ingredients.

【28】 On the Out-of-distribution Generalization of Probabilistic Image Modelling 标题：关于概率图像建模的非分布泛化链接：https://arxiv.org/abs/2109.02639

作者：Mingtian Zhang,Andi Zhang,Steven McDonagh 机构：AI Center, University College London, Computer Science, University of Cambridge, Huawei Noah’s Ark Lab 摘要：分布外（OOD）检测和无损压缩构成了两个问题，可以通过在第一个数据集上训练概率模型，然后在数据分布不同的第二个数据集上进行似然评估来解决。通过定义概率模型的似然泛化，我们表明，在图像模型的情况下，OOD泛化能力由局部特征决定。这促使我们提出局部自回归模型，专门对局部图像特征建模，以提高OOD性能。我们将所提出的模型应用于OOD检测任务，并在不引入额外数据的情况下实现最先进的无监督OOD检测性能。此外，我们使用我们的模型构建了一个新的无损图像压缩程序：NeLLoC（神经局部无损压缩程序），并报告了最先进的压缩率和模型大小。摘要：Out-of-distribution (OOD) detection and lossless compression constitute two problems that can be solved by the training of probabilistic models on a first dataset with subsequent likelihood evaluation on a second dataset, where data distributions differ. By defining the generalization of probabilistic models in terms of likelihood we show that, in the case of image models, the OOD generalization ability is dominated by local features. This motivates our proposal of a Local Autoregressive model that exclusively models local image features towards improving OOD performance. We apply the proposed model to OOD detection tasks and achieve state-of-the-art unsupervised OOD detection performance without the introduction of additional data. Additionally, we employ our model to build a new lossless image compressor: NeLLoC (Neural Local Lossless Compressor) and report state-of-the-art compression rates and model size.

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-09-08，如有侵权请联系 cloudcommunity@tencent.com 删除

linux