统计学学术速递[10.18]

公众号-arXiv每日学术速递

发布于 2021-10-21 16:02:22

6180

发布于 2021-10-21 16:02:22

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

stat统计学，共计36篇

【1】 Choice functions based multi-objective Bayesian optimisation 标题：基于选择函数的多目标贝叶斯优化链接：https://arxiv.org/abs/2110.08217

作者：Alessio Benavoli,Dario Azzimonti,Dario Piga 机构：School of Computer Science and Statistics, Trinity College Dublin, Ireland, Dalle Molle Institute for Artificial, Intelligence Research (IDSIA), USISUPSI, Manno, Switzerland 摘要：在这项工作中，我们引入了一个新的多目标贝叶斯优化框架，其中多目标函数只能通过选择判断来访问，例如“我在五个选项a、B、C、D、E中选择选项a、B、C”。选项D被拒绝的事实意味着在所选选项A、B、C中至少有一个选项是我严格喜欢的，而不是D（但我不必指定哪一个）。我们假设对于某些维度$n_e$存在一个潜在的向量函数f，它将期权嵌入到维度n的实向量空间中，因此选择集可以通过非支配期权的Pareto集来表示。通过将高斯过程置于f之前，并推导出一种新的选择数据似然模型，我们提出了一种选择函数学习的贝叶斯框架。然后，我们应用这个代理模型来解决一个新的多目标贝叶斯优化选择数据问题。摘要：In this work we introduce a new framework for multi-objective Bayesian optimisation where the multi-objective functions can only be accessed via choice judgements, such as ``I pick options A,B,C among this set of five options A,B,C,D,E''. The fact that the option D is rejected means that there is at least one option among the selected ones A,B,C that I strictly prefer over D (but I do not have to specify which one). We assume that there is a latent vector function f for some dimension $n_e$ which embeds the options into the real vector space of dimension n, so that the choice set can be represented through a Pareto set of non-dominated options. By placing a Gaussian process prior on f and deriving a novel likelihood model for choice data, we propose a Bayesian framework for choice functions learning. We then apply this surrogate model to solve a novel multi-objective Bayesian optimisation from choice data problem.

【2】 Fast Online Changepoint Detection via Functional Pruning CUSUM statistics 标题：基于函数剪枝累积和统计的快速在线变点检测链接：https://arxiv.org/abs/2110.08205

作者：Gaetano Romano,Idris Eckley,Paul Fearnhead,Guillem Rigaill 机构：Department of Mathematics and Statistics, Lancaster University, Lancaster, United Kingdom, LA,YF, Universit´e Paris-Saclay, CNRS, INRAE, Univ Evry, Institute of Plant Sciences Paris-Saclay (IPS,), Orsay, France 摘要：在线变化点检测的许多现代应用需要处理高频观测的能力，有时可用计算资源有限。用于检测均值变化的在线算法通常涉及使用移动窗口，或指定变化的预期大小。这些选择会影响算法最有能力检测的更改。我们介绍了一种算法，Functional Online CuSUM（FOCuS），它相当于对所有大小的窗口同时运行这些早期方法，或者对更改大小同时运行所有可能的值。我们的理论结果给出了聚焦每次迭代的预期计算成本的严格界限，这是观察次数的对数。我们展示了如何将FOCuS应用于许多不同的均值变化场景，并通过其检测计算机服务器数据中异常行为的最新性能来展示其实用性。摘要：Many modern applications of online changepoint detection require the ability to process high-frequency observations, sometimes with limited available computational resources. Online algorithms for detecting a change in mean often involve using a moving window, or specifying the expected size of change. Such choices affect which changes the algorithms have most power to detect. We introduce an algorithm, Functional Online CuSUM (FOCuS), which is equivalent to running these earlier methods simultaneously for all sizes of window, or all possible values for the size of change. Our theoretical results give tight bounds on the expected computational cost per iteration of FOCuS, with this being logarithmic in the number of observations. We show how FOCuS can be applied to a number of different change in mean scenarios, and demonstrate its practical utility through its state-of-the art performance at detecting anomalous behaviour in computer server data.

【3】 SAFFRON and LORD Ensure Online Control of the False Discovery Rate Under Positive Dependence 标题：正相关条件下SAFFRON和LORD保证错误发现率的在线控制链接：https://arxiv.org/abs/2110.08161

作者：Aaron Fisher 摘要：在线测试程序假定假设是按顺序观察的，并允许即将进行的测试的显著性阈值取决于迄今为止观察到的测试统计数据。一些最流行的在线方法包括alpha投资、LORD++（以下简称LORD）和SAFFRON。这三种方法已被证明可以在线控制“修改的”错误发现率（mFDR）。然而，据我们所知，它们只能在测试统计独立的条件下控制传统的错误发现率（FDR）。我们的工作通过显示藏红花和洛德在非负依赖的情况下额外确保FDR的在线控制来支持这些结果。由于alpha投资可以作为SAFFRON框架的一个特例进行恢复，因此同样的结果也适用于此方法。我们的结果还允许某些形式的自适应停止时间，例如，在观察到一定数量的拒绝后停止。摘要：Online testing procedures assume that hypotheses are observed in sequence, and allow the significance thresholds for upcoming tests to depend on the test statistics observed so far. Some of the most popular online methods include alpha investing, LORD++ (hereafter, LORD), and SAFFRON. These three methods have been shown to provide online control of the "modified" false discovery rate (mFDR). However, to our knowledge, they have only been shown to control the traditional false discovery rate (FDR) under an independence condition on the test statistics. Our work bolsters these results by showing that SAFFRON and LORD additionally ensure online control of the FDR under nonnegative dependence. Because alpha investing can be recovered as a special case of the SAFFRON framework, the same result applies to this method as well. Our result also allows for certain forms of adaptive stopping times, for example, stopping after a certain number of rejections have been observed.

【4】 Quickest Inference of Network Cascades with Noisy Information 标题：带噪声信息的网络级联的最快推断链接：https://arxiv.org/abs/2110.08115

作者：Anirudh Sridhar,H. Vincent Poor 备注：47 pages, 3 figures 摘要：我们研究的问题估计来源的网络级联给定一个时间序列的噪声信息的传播。最初，有一个顶点受级联（源）影响，级联在网络中以离散的时间步长传播。级联演化是隐藏的，但可以观察到来自每个顶点的噪声信号的时间序列。在级联影响顶点之前，假设顶点的时间序列是来自变更前分布$Q_0$的i.i.d.样本序列，一旦级联影响顶点，时间序列是来自变更后分布$Q_1$的i.i.d.样本序列。考虑到噪声信号的时间序列可以看作是级联演化的噪声测量，我们的目标是设计一种程序来尽可能快地可靠地估计级联源。我们研究了源估计问题的贝叶斯和极大极小公式，并推导了简单级联动态和网络拓扑的近似最优估计。在贝叶斯设置中，观察样本直到贝叶斯最优估计器的误差降至阈值以下的估计器可获得最佳性能。在minimax环境下，通过设计一种新的多假设序贯概率比检验（MSPRT）来实现最优性能。我们发现，当网络拓扑为$k$-正则树时，这些最优估计需要噪声时间序列的$\log\log n/\log（k-1）$观测值，而$\ell$-维格需要$（\log n）^{\frac{1}{\ell+1}}观测值。最后，我们讨论了如何将我们的方法扩展到任意图上的级联。摘要：We study the problem of estimating the source of a network cascade given a time series of noisy information about the spread. Initially, there is a single vertex affected by the cascade (the source) and the cascade spreads in discrete time steps across the network. The cascade evolution is hidden, but one can observe a time series of noisy signals from each vertex. The time series of a vertex is assumed to be a sequence of i.i.d. samples from a pre-change distribution $Q_0$ before the cascade affects the vertex, and the time series is a sequence of i.i.d. samples from a post-change distribution $Q_1$ once the cascade has affected the vertex. Given the time series of noisy signals, which can be viewed as a noisy measurement of the cascade evolution, we aim to devise a procedure to reliably estimate the cascade source as fast as possible. We investigate Bayesian and minimax formulations of the source estimation problem, and derive near-optimal estimators for simple cascade dynamics and network topologies. In the Bayesian setting, an estimator which observes samples until the error of the Bayes-optimal estimator falls below a threshold achieves optimal performance. In the minimax setting, optimal performance is achieved by designing a novel multi-hypothesis sequential probability ratio test (MSPRT). We find that these optimal estimators require $\log \log n / \log (k - 1)$ observations of the noisy time series when the network topology is a $k$-regular tree, and $(\log n)^{\frac{1}{\ell + 1}}$ observations are required for $\ell$-dimensional lattices. Finally, we discuss how our methods may be extended to cascades on arbitrary graphs.

【5】 An active learning approach for improving the performance of equilibrium based chemical simulations 标题：一种提高基于平衡的化学模拟性能的主动学习方法链接：https://arxiv.org/abs/2110.08111

作者：Mary Savino,Céline Lévy-Leduc,Marc Leconte,Benoit Cochepin 备注：22 pages, 17 figures 摘要：在本文中，我们提出了一种新的顺序数据驱动方法来处理基于平衡的化学模拟，这可以看作是一种特殊的机器学习方法，称为主动学习。我们的方法的基本思想是考虑函数估计作为高斯过程的样本，这使得我们能够计算函数估计的全局不确定性。由于这种估计，并且几乎没有需要调整的参数，所提出的方法依次选择最相关的输入数据，在该数据处必须对要估计的函数进行评估，以构建代理模型。因此，要估计的函数的评估数量非常有限。我们的主动学习方法通过数值实验得到验证，并应用于地球科学中常用的复杂化学系统。摘要：In this paper, we propose a novel sequential data-driven method for dealing with equilibrium based chemical simulations, which can be seen as a specific machine learning approach called active learning. The underlying idea of our approach is to consider the function to estimate as a sample of a Gaussian process which allows us to compute the global uncertainty on the function estimation. Thanks to this estimation and with almost no parameter to tune, the proposed method sequentially chooses the most relevant input data at which the function to estimate has to be evaluated to build a surrogate model. Hence, the number of evaluations of the function to estimate is dramatically limited. Our active learning method is validated through numerical experiments and applied to a complex chemical system commonly used in geoscience.

【6】 Testing for long-range dependence in non-stationary time series time-varying regression 标题：非平稳时间序列时变回归的长期相关性检验链接：https://arxiv.org/abs/2110.08089

作者：Lujia Bai,Weichi Wu 机构：Center for Statistical Science, Department of Industrial Engineering, Tsinghua University 摘要：我们考虑时变系数回归模型的长期依赖性检验问题。协变量和误差假设为局部平稳，这允许复杂的时间动态和异方差。我们开发了基于非参数残差的KPSS、R/S、V/S和K/S型统计，并提出了带有基于差分的长期协方差矩阵估计器的bootstrap方法用于实际实现。在零假设、局部方案和固定方案下，我们推导了检验统计量的极限分布，建立了基于差分的长期协方差估计量的一致一致性，并从理论上证明了bootstrap算法。特别是，我们的测试程序的精确局部渐近幂次为$O（\log^{-1}n）$，与无协变量的严格平稳序列中长记忆的经典KPSS测试相同。我们通过大量的模拟研究证明了我们测试的有效性。所提出的测试被应用于COVID-19数据集，有利于在几个国家的累积证实系列COVID-19和香港循环和呼吸数据集中的长距离依赖性，识别出一种新型的“假长记忆”。摘要：We consider the problem of testing for long-range dependence for time-varying coefficient regression models. The covariates and errors are assumed to be locally stationary, which allows complex temporal dynamics and heteroscedasticity. We develop KPSS, R/S, V/S, and K/S-type statistics based on the nonparametric residuals, and propose bootstrap approaches equipped with a difference-based long-run covariance matrix estimator for practical implementation. Under the null hypothesis, the local alternatives as well as the fixed alternatives, we derive the limiting distributions of the test statistics, establish the uniform consistency of the difference-based long-run covariance estimator, and justify the bootstrap algorithms theoretically. In particular, the exact local asymptotic power of our testing procedure enjoys the order $O( \log^{-1} n)$, the same as that of the classical KPSS test for long memory in strictly stationary series without covariates. We demonstrate the effectiveness of our tests by extensive simulation studies. The proposed tests are applied to a COVID-19 dataset in favor of long-range dependence in the cumulative confirmed series of COVID-19 in several countries, and to the Hong Kong circulatory and respiratory dataset, identifying a new type of 'spurious long memory'.

【7】 Causal Identification with Additive Noise Models: Quantifying the Effect of Noise 标题：利用加性噪声模型进行因果识别：量化噪声的影响链接：https://arxiv.org/abs/2110.08087

作者：Benjamin Kap,Marharyta Aleksandrova,Thomas Engel 机构：University of Luxembourg, avenue de l’Universit´e, L-, Esch-sur-Alzette, Luxembourg 备注：Presented at 10\`emes Journ\'ees Francophones sur les R\'eseaux Bay\'esiens et les Mod\`eles Graphiques Probabilistes (JFRB-2021), this https URL 摘要：近年来，在因果推理和因果学习领域进行了大量的研究。已经开发了许多方法来识别模型中的因果对，并成功地应用于实际观测数据，以确定因果关系的方向。然而，在双变量情况下，因果发现问题仍然具有挑战性。其中一类方法基于加性噪声模型（ANM），也可以处理双变量情况。不幸的是，到目前为止，这些方法的一个方面还没有得到太多的关注：不同的噪声水平对这些方法识别因果关系方向的能力有什么影响。这项工作旨在通过实证研究弥补这一差距。我们通过随后的独立性测试（RESIT）测试回归，使用一系列模型，其中加性噪声水平从原因噪声水平的1%逐渐变化到10000%（后者保持不变）。此外，在这项工作中的实验考虑几种不同类型的分布以及线性和非线性模型。实验结果表明，ANMs方法无法捕捉某些噪声水平的真正原因方向。摘要：In recent years, a lot of research has been conducted within the area of causal inference and causal learning. Many methods have been developed to identify the cause-effect pairs in models and have been successfully applied to observational real-world data to determine the direction of causal relationships. Yet in bivariate situations, causal discovery problems remain challenging. One class of such methods, that also allows tackling the bivariate case, is based on Additive Noise Models (ANMs). Unfortunately, one aspect of these methods has not received much attention until now: what is the impact of different noise levels on the ability of these methods to identify the direction of the causal relationship. This work aims to bridge this gap with the help of an empirical study. We test Regression with Subsequent Independence Test (RESIT) using an exhaustive range of models where the level of additive noise gradually changes from 1\% to 10000\% of the causes' noise level (the latter remains fixed). Additionally, the experiments in this work consider several different types of distributions as well as linear and non-linear models. The results of the experiments show that ANMs methods can fail to capture the true causal direction for some levels of noise.

【8】 GaussED: A Probabilistic Programming Language for Sequential Experimental Design 标题：GaussED：序贯试验设计的概率程序设计语言链接：https://arxiv.org/abs/2110.08072

作者：Matthew A. Fisher,Onur Teymur,Chris. J. Oates 机构：Newcastle University, UK, Alan Turing Institute, UK 摘要：顺序算法在实验设计中很流行，可以有效地进行仿真、优化和推理。对于大多数此类应用，已经开发了定制软件，但该方法是通用的，并且在此类软件中执行的许多实际计算是相同的。受原则上可以用通用代码解决的各种问题的启发，本文介绍了GaussED，一种简单的概率编程语言，它与一个强大的实验设计引擎相结合，共同自动进行顺序实验设计，以近似a（可能是非线性的）高斯过程模型中感兴趣的数量。使用少量命令，GaussED可用于：求解线性偏微分方程，从积分数据执行层析重建，并使用梯度数据实现贝叶斯优化。摘要：Sequential algorithms are popular for experimental design, enabling emulation, optimisation and inference to be efficiently performed. For most of these applications bespoke software has been developed, but the approach is general and many of the actual computations performed in such software are identical. Motivated by the diverse problems that can in principle be solved with common code, this paper presents GaussED, a simple probabilistic programming language coupled to a powerful experimental design engine, which together automate sequential experimental design for approximating a (possibly nonlinear) quantity of interest in Gaussian processes models. Using a handful of commands, GaussED can be used to: solve linear partial differential equations, perform tomographic reconstruction from integral data and implement Bayesian optimisation with gradient data.

【9】 Compressive Independent Component Analysis: Theory and Algorithms 标题：压缩独立分量分析：理论与算法链接：https://arxiv.org/abs/2110.08045

作者：Michael P. Sheehan,Mike E. Davies 机构：Institute of Digital Communications, University of Edinburgh, Edinburgh, UK 备注：27 pages, 8 figures, under review 摘要：压缩学习形成了压缩感知和统计学习之间令人兴奋的交叉点，人们利用稀疏性和结构形式来降低学习任务的记忆和/或计算复杂性。在本文中，我们通过压缩学习透镜来研究独立分量分析（ICA）模型。特别是，我们证明了基于累积量的ICA模型的解具有特定的结构，从而产生了驻留在累积量张量空间中的低维模型集。通过显示随机累积量（例如高斯集合）的受限等距性质，我们证明了压缩ICA方案的存在性。此后，我们提出了两种压缩ICA的迭代投影梯度（IPG）算法和交替最速下降（ASD）算法，其中通过经验结果实现了从受限等距特性断言的压缩顺序。我们提供了CICA算法的分析，包括有限样本的影响。压缩效应的特点是草图尺寸和ICA估计的统计效率之间的权衡。通过考虑合成数据集和真实数据集，我们展示了通过使用其中一种提出的CICA算法，与著名的ICA算法相比所获得的巨大内存增益。最后，我们总结了论文中的开放性问题，包括来自压缩学习新兴领域的有趣挑战。摘要：Compressive learning forms the exciting intersection between compressed sensing and statistical learning where one exploits forms of sparsity and structure to reduce the memory and/or computational complexity of the learning task. In this paper, we look at the independent component analysis (ICA) model through the compressive learning lens. In particular, we show that solutions to the cumulant based ICA model have particular structure that induces a low dimensional model set that resides in the cumulant tensor space. By showing a restricted isometry property holds for random cumulants e.g. Gaussian ensembles, we prove the existence of a compressive ICA scheme. Thereafter, we propose two algorithms of the form of an iterative projection gradient (IPG) and an alternating steepest descent (ASD) algorithm for compressive ICA, where the order of compression asserted from the restricted isometry property is realised through empirical results. We provide analysis of the CICA algorithms including the effects of finite samples. The effects of compression are characterised by a trade-off between the sketch size and the statistical efficiency of the ICA estimates. By considering synthetic and real datasets, we show the substantial memory gains achieved over well-known ICA algorithms by using one of the proposed CICA algorithms. Finally, we conclude the paper with open problems including interesting challenges from the emerging field of compressive learning.

【10】 Second-level randomness test based on the Kolmogorov-Smirnov test 标题：基于Kolmogorov-Smirnov检验的二级随机性检验链接：https://arxiv.org/abs/2110.08023

作者：Akihiro Yamaguchi,Asaki Saito 机构： Fukuoka Institute of Technology,-,-, Wajiro-higashi, Higashi-ku, Fukuoka ,-, Japan, Future University Hakodate,-, Kamedanakano-cho, Hakodate, Hokkaido ,-, Japan 备注：5 pages, 2 figures 摘要：我们分析了p值的精确分布偏离均匀分布对Kolmogorov-Smirnov（K-S）检验（作为二级随机性检验）的影响。我们推导了一个不等式，当零假设的分布不同于精确分布时，该不等式提供了K-S检验统计量期望值的上界。此外，我们提出了一个基于两样本K-S检验的二级检验，以理想的经验分布作为改进的候选。摘要：We analyzed the effect of the deviation of the exact distribution of the p-values from the uniform distribution on the Kolmogorov-Smirnov (K-S) test that was implemented as the second-level randomness test. We derived an inequality that provides an upper bound on the expected value of the K-S test statistic when the distribution of the null hypothesis differs from the exact distribution. Furthermore, we proposed a second-level test based on the two-sample K-S test with an ideal empirical distribution as a candidate for improvement.

【11】 Spatially Adaptive Calibrations of AirBox PM_{2.5} Data标题：气箱PM_{2.5}数据的空间自适应标定链接：https://arxiv.org/abs/2110.08005

作者：ShengLi Tzeng,Chi-Wei Lai,Hsin-Cheng Huang 机构：Department of Applied Mathematics, National Sun Yat-sen University, Taiwan, R.O.C., Institute of Statistics, National Tsing Hua University, Taiwan, R.O.C., Institute of Statistical Science, Academia Sinica, Taiwan, R.O.C. 备注：21 pages, 9 figures 摘要：台湾有两个监测PM${2.5}$的网络，包括台湾空气质量监测网络（TAQMN）和空气箱网络。台湾环境保护署（EPA）管理的TAQMN在77美元的监测站提供高质量的PM${2.5}$测量。最近，启动了AirBox网络，该网络由数千个地点的低成本小型物联网（IoT）微型传感器（即AirBox）组成。虽然AirBox网络提供了广泛的空间覆盖，但其测量不可靠，需要校准。然而，将通用校准程序应用于所有气箱并不奏效，因为校准曲线随几个因素而变化，包括PM${2.5}$的化学成分，这些成分在空间上是不均匀的。因此，需要在具有不同本地环境的不同位置进行不同的校准。不幸的是，大多数空气箱并不靠近EPA站，这使得校准任务具有挑战性。在本文中，我们提出了一个具有空间变化系数的空间模型来解释数据的异方差性。我们的方法根据当地条件对空气盒进行自适应校准，并结合两种测量，在台湾任何地点提供精确的PM${2.5}$浓度。此外，一旦新的空气箱添加到网络中，建议的方法会自动校准其测量值。我们使用2020年每小时PM${2.5}$数据来说明我们的方法。校准后的结果表明，对于匹配EPA数据，PM${2.5}$预测的均方根预测误差提高了约37%至67%。特别是，一旦建立了校准曲线，即使我们忽略EPA数据，我们也可以在台湾任何地方获得可靠的PM${2.5}$值。摘要：Two networks are available to monitor PM$_{2.5}$ in Taiwan, including the Taiwan Air Quality Monitoring Network (TAQMN) and the AirBox network. The TAQMN, managed by Taiwan's Environmental Protection Administration (EPA), provides high-quality PM$_{2.5}$ measurements at $77$ monitoring stations. More recently, the AirBox network was launched, consisting of low-cost, small internet-of-things (IoT) microsensors (i.e., AirBoxes) at thousands of locations. While the AirBox network provides broad spatial coverage, its measurements are not reliable and require calibrations. However, applying a universal calibration procedure to all AirBoxes does not work well because the calibration curves vary with several factors, including the chemical compositions of PM$_{2.5}$, which are not homogeneous in space. Therefore, different calibrations are needed at different locations with different local environments. Unfortunately, most AirBoxes are not close to EPA stations, making the calibration task challenging. In this article, we propose a spatial model with spatially varying coefficients to account for heteroscedasticity in the data. Our method gives adaptive calibrations of AirBoxes according to their local conditions and provides accurate PM$_{2.5}$ concentrations at any location in Taiwan, incorporating two types of measurements. In addition, the proposed method automatically calibrates measurements from a new AirBox once it is added to the network. We illustrate our approach using hourly PM$_{2.5}$ data in the year 2020. After the calibration, the results show that the PM$_{2.5}$ prediction improves about 37% to 67% in root mean-squared prediction error for matching EPA data. In particular, once the calibration curves are established, we can obtain reliable PM$_{2.5}$ values at any location in Taiwan, even if we ignore EPA data.

【12】 Fast Partial Quantile Regression 标题：快速部分分位数回归链接：https://arxiv.org/abs/2110.07998

作者：Alvaro Mendez Civieta,M. Carmen Aguilera-Morillo,Rosa E. Lillo 机构：Department of Statistics, Universidad Carlos III de Madrid, †uc 3m-Santander Big Data Institute, ‡Department of Applied Statistics and Operational Research, Universitat Politecnica deValencia 1arXiv 备注：22 pages, 5 figures and 5 tables 摘要：偏最小二乘法（PLS）是一种降维技术，用于在数据共线或高维的情况下替代普通最小二乘法（OLS）。PLS和OLS都提供了基于均值的估计，这对异常值或重尾分布的存在极为敏感。相比之下，分位数回归是计算基于分位数的稳健估计的OLS的替代方法。在这项工作中，多元PLS被扩展到分位数回归框架，获得了问题的理论公式和稳健的降维技术，我们称之为快速部分分位数回归（fPQR），它提供了基于分位数的估计。通过仿真实验和化学计量学著名的饼干面团数据集（一个真实的高维示例），研究了fPQR的性能。摘要：Partial least squares (PLS) is a dimensionality reduction technique used as an alternative to ordinary least squares (OLS) in situations where the data is colinear or high dimensional. Both PLS and OLS provide mean based estimates, which are extremely sensitive to the presence of outliers or heavy tailed distributions. In contrast, quantile regression is an alternative to OLS that computes robust quantile based estimates. In this work, the multivariate PLS is extended to the quantile regression framework, obtaining a theoretical formulation of the problem and a robust dimensionality reduction technique that we call fast partial quantile regression (fPQR), that provides quantile based estimates. An efficient implementation of fPQR is also derived, and its performance is studied through simulation experiments and the chemometrics well known biscuit dough dataset, a real high dimensional example.

【13】 Multivariate Mean Comparison under Differential Privacy 标题：差分隐私下的多元均值比较链接：https://arxiv.org/abs/2110.07996

作者：Martin Dunsche,Tim Kutta,Holger Dette 机构：Ruhr-University Bochum 摘要：多元总体均值的比较是统计推断的中心任务。虽然统计理论提供了各种分析工具，但它们通常不保护个人隐私。这种知识可以激励研究参与者隐藏真实数据（特别是异常值），这可能导致分析失真。在本文中，我们通过开发一个多变量均值比较的假设检验来解决这个问题，以保证用户的不同隐私。测试统计数据基于流行的霍特林的$t^2$-统计数据，该统计数据对马氏距离有着自然的解释。为了控制1型错误，我们提出了一种差分隐私下的bootstrap算法，该算法可证明产生可靠的测试决策。在实证研究中，我们证明了这种方法的适用性。摘要：The comparison of multivariate population means is a central task of statistical inference. While statistical theory provides a variety of analysis tools, they usually do not protect individuals' privacy. This knowledge can create incentives for participants in a study to conceal their true data (especially for outliers), which might result in a distorted analysis. In this paper we address this problem by developing a hypothesis test for multivariate mean comparisons that guarantees differential privacy to users. The test statistic is based on the popular Hotelling's $t^2$-statistic, which has a natural interpretation in terms of the Mahalanobis distance. In order to control the type-1-error, we present a bootstrap algorithm under differential privacy that provably yields a reliable test decision. In an empirical study we demonstrate the applicability of this approach.

【14】 A new class of α-transformations for the spatial analysis of Compositional Data标题：用于成分数据空间分析的一类新的α变换链接：https://arxiv.org/abs/2110.07967

作者：Lucia Clarotto,Denis Allard,Alessandra Menafoglio 机构：MINES ParisTech, PSL University, Centre de Geosciences, Fontainebleau, France, Biostatistics and Spatial Processes (BioSP), INRAE, Avignon, France, MOX, Department of Mathematics, Politecnico Milano, Milano, Italy 备注：31 pages and 13 Figures. Supplementary material with 3 Figures 摘要：地理参考成分数据在许多科学领域和空间统计中都很重要。这项工作解决了通过克里格法提出模型和方法来分析和预测这类数据的问题。为此，提出了一类新的变换，称为等距$\alpha$-变换（$\alpha$-IT），它包含了传统的等距对数比（ILR）变换。结果表明，ILR是$\alpha$-It的极限情况，因为$\alpha$趋于0，$\alpha=1$对应于数据的线性变换。与ILR不同，当$\alpha>0$时，建议的转换在组合中接受0。建立参数$\alpha$的最大似然估计。在$\alpha$-IT转换数据上使用克里格法进行预测在合成空间成分数据上进行验证，使用由$\alpha$-IT诱导的几何体或单纯形中计算的预测分数。对土地覆盖数据的应用表明，各种预测方法的相对优势取决于成分是否包含任何零成分。当所有分量都为正值时，极限情况（ILR或线性变换）对于所考虑的任何度量都不是最优的。与具有最大似然估计的$\alpha$-IT相对应的中间几何体更好地描述了地质统计学设置中的数据集。当0的合成量不可忽略时，转换的一些副作用会随着$\alpha$的减少而放大，从而导致$\alpha$-IT几何体和单纯形中的度量的克里格性能较差。摘要：Georeferenced compositional data are prominent in many scientific fields and in spatial statistics. This work addresses the problem of proposing models and methods to analyze and predict, through kriging, this type of data. To this purpose, a novel class of transformations, named the Isometric $\alpha$-transformation ($\alpha$-IT), is proposed, which encompasses the traditional Isometric Log-Ratio (ILR) transformation. It is shown that the ILR is the limit case of the $\alpha$-IT as $\alpha$ tends to 0 and that $\alpha=1$ corresponds to a linear transformation of the data. Unlike the ILR, the proposed transformation accepts 0s in the compositions when $\alpha>0$. Maximum likelihood estimation of the parameter $\alpha$ is established. Prediction using kriging on $\alpha$-IT transformed data is validated on synthetic spatial compositional data, using prediction scores computed either in the geometry induced by the $\alpha$-IT, or in the simplex. Application to land cover data shows that the relative superiority of the various approaches w.r.t. a prediction objective depends on whether the compositions contained any zero component. When all components are positive, the limit cases (ILR or linear transformations) are optimal for none of the considered metrics. An intermediate geometry, corresponding to the $\alpha$-IT with maximum likelihood estimate, better describes the dataset in a geostatistical setting. When the amount of compositions with 0s is not negligible, some side-effects of the transformation gets amplified as $\alpha$ decreases, entailing poor kriging performances both within the $\alpha$-IT geometry and for metrics in the simplex.

【15】 Minimax-robust estimation problems for sequences with periodically stationary increments observed with noise 标题：有噪声观测的周期平稳增量序列的极小极大鲁棒估计问题链接：https://arxiv.org/abs/2110.07952

作者：Maksym Luz,Mikhail Moklyachuk 备注：arXiv admin note: substantial text overlap with arXiv:2007.11581; text overlap with arXiv:2110.07189 摘要：研究了由周期平稳增量随机序列的未观测值构造的线性泛函的最优估计问题。对于已知谱密度的序列，我们得到了计算均方误差值的公式和泛函最优估计的谱特性。在序列的谱密度不完全已知的情况下，给出了确定泛函最优线性估计的最不利谱密度和极小极大稳健谱特征的公式，同时给出了一些可容许谱密度集。摘要：The problem of optimal estimation of linear functionals constructed from the unobserved values of a stochastic sequence with periodically stationary increments based on observations of the sequence with stationary noise is considered. For sequences with known spectral densities, we obtain formulas for calculating values of the mean square errors and the spectral characteristics of the optimal estimates of the functionals. Formulas that determine the least favorable spectral densities and the minimax-robust spectral characteristics of the optimal linear estimates of functionals are proposed in the case where spectral densities of the sequence are not exactly known while some sets of admissible spectral densities are specified.

【16】 Different coefficients for studying dependence 标题：研究相依性的不同系数链接：https://arxiv.org/abs/2110.07928

作者：Oona Rainio 备注：16 pages, 6 figures, 4 tables 摘要：通过计算机模拟，我们研究了几种不同的依赖性度量，包括Pearson和Spearman的相关系数、最大相关、距离相关、互信息函数（称为相关信息系数）和最大信息系数（MIC）。我们比较了这些系数在多大程度上满足了一般性、幂和公平性的标准。此外，我们考虑的确切类型的依赖性，噪声量和观测的数量影响其性能。摘要：Through computer simulations, we research several different measures of dependence, including Pearson's and Spearman's correlation coefficients, the maximal correlation, the distance correlation, a function of the mutual information called the information coefficient of correlation, and the maximal information coefficient (MIC). We compare how well these coefficients fulfill the criteria of generality, power, and equitability. Furthermore, we consider how the exact type of dependence, the amount of noise and the number of observations affect their performance.

【17】 pimeta: an R package of prediction intervals for random-effects meta-analysis 标题：Pimeta：随机效应Meta分析的预测区间R包链接：https://arxiv.org/abs/2110.07856

作者：Kengo Nagashima,Hisashi Noma,Toshi A. Furukawa 机构： The Institute of Statistical Mathematics, 2Department of Data Science, Kyoto University Graduate School of MedicineSchool ofPublic Health 备注：15 pages, 2 figures 摘要：预测区间在荟萃分析中越来越突出，因为它能够评估治疗效果的不确定性和研究之间的异质性。然而，目前用于构建预测区间的标准方法的覆盖概率通常不能保持其标称水平，特别是当综合研究的数量为中等或较小时，因为其有效性取决于大样本近似值。最近，已经开发了几种方法来解决这个问题。本文简要总结了预测区间方法的最新发展，并提供了使用R进行多类型数据的可读示例，代码简单。pimeta软件包是一个R软件包，它提供了这些改进的方法来计算准确的预测区间，并提供了图形工具来说明这些结果。pimeta软件包列在“CRAN任务视图：Meta Analysis”中。使用一系列R软件包可以轻松地在R中执行分析。摘要：The prediction interval is gaining prominence in meta-analysis as it enables the assessment of uncertainties in treatment effects and heterogeneity between studies. However, coverage probabilities of the current standard method for constructing prediction intervals cannot retain their nominal levels in general, particularly when the number of synthesized studies is moderate or small, because their validities depend on large sample approximations. Recently, several methods have developed been to address this issue. This paper briefly summarizes the recent developments in methods of prediction intervals and provides readable examples using R for multiple types of data with simple code. The pimeta package is an R package that provides these improved methods to calculate accurate prediction intervals and graphical tools to illustrate these results. The pimeta package is listed in ``CRAN Task View: Meta-Analysis.'' The analysis is easily performed in R using a series of R packages.

【18】 Multiple Observers Ranked Set Samples for Shrinkage Estimators 标题：多个观察者对收缩估计器的集合样本进行排序链接：https://arxiv.org/abs/2110.07851

作者：Andrew David Pearce,Armin Hatefi 机构：Department of Mathematics and Statistics, Memorial University of Newfoundland, St. John’s, NL, Canada. 备注：32 pages, 5 Figures, 9 Tables 摘要：分级集抽样（RSS）被用作一种强大的数据收集技术，用于测量研究变量需要一个昂贵和/或繁琐的过程，而抽样单位可以很容易地进行分级（例如骨质疏松症研究）。在本文中，我们发展了基于RSS数据的岭型和Liu型收缩估计，以处理线性回归、随机约束回归和logistic回归系数估计中的共线问题。通过广泛的数值研究，我们表明，收缩方法和多观察员RSS的结果更有效的系数估计。所开发的方法最终应用于骨矿物质数据，以分析50岁及以上妇女的骨疾病状况。摘要：Ranked set sampling (RSS) is used as a powerful data collection technique for situations where measuring the study variable requires a costly and/or tedious process while the sampling units can be ranked easily (e.g., osteoporosis research). In this paper, we develop ridge and Liu-type shrinkage estimators under RSS data from multiple observers to handle the collinearity problem in estimating coefficients of linear regression, stochastic restricted regression and logistic regression. Through extensive numerical studies, we show that shrinkage methods with the multi-observer RSS result in more efficient coefficient estimates. The developed methods are finally applied to bone mineral data for analysis of bone disorder status of women aged 50 and older.

【19】 Gaussian Process Bandit Optimization with Few Batches 标题：少批次高斯过程的Bandit优化链接：https://arxiv.org/abs/2110.07788

作者：Zihan Li,Jonathan Scarlett 机构：The authors are with the Department of Computer Science, School of Computing, National University of Singapore (NUS), Jonathan Scarlett is also with the Department of Mathematics and the Institute of Data Science 摘要：在本文中，我们考虑的问题，使用高斯过程（GP）Butd-优化与少量批次的黑盒优化。假设未知函数在再生核Hilbert空间（RKHS）中有一个低范数，我们引入了一个受批处理有限arm bandit算法启发的批处理算法，并证明它在时间范围$T$内使用$O（\log\log T）$批处理$O^\ast（\sqrt{T\gamma T}）实现了累积后悔上界$O^\ast（\sqrt{T\gamma T}）$，其中$O^\ast（\cdot）$notation隐藏与维度无关的对数因子，$\gamma\u T$是与内核相关的最大信息增益。这个界限对于几个感兴趣的内核来说是接近最优的，并且改进了典型的$O^\ast（\sqrt{T}\gamma_T）$界限，我们的方法可以说是实现这个改进的算法中最简单的。此外，在批次数量不变（不取决于$T$）的情况下，我们提出了我们的算法的改进版本，并描述了批次数量对遗憾的影响，重点是平方指数核和Matern核。通过与算法无关的类似下界，证明了算法的上界是近似极大极小最优的。摘要：In this paper, we consider the problem of black-box optimization using Gaussian Process (GP) bandit optimization with a small number of batches. Assuming the unknown function has a low norm in the Reproducing Kernel Hilbert Space (RKHS), we introduce a batch algorithm inspired by batched finite-arm bandit algorithms, and show that it achieves the cumulative regret upper bound $O^\ast(\sqrt{T\gamma_T})$ using $O(\log\log T)$ batches within time horizon $T$, where the $O^\ast(\cdot)$ notation hides dimension-independent logarithmic factors and $\gamma_T$ is the maximum information gain associated with the kernel. This bound is near-optimal for several kernels of interest and improves on the typical $O^\ast(\sqrt{T}\gamma_T)$ bound, and our approach is arguably the simplest among algorithms attaining this improvement. In addition, in the case of a constant number of batches (not depending on $T$), we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches, focusing on the squared exponential and Mat\'ern kernels. The algorithmic upper bounds are shown to be nearly minimax optimal via analogous algorithm-independent lower bounds.

【20】 Learning Mean-Field Equations from Particle Data Using WSINDy 标题：利用WSINDy从粒子数据中学习平均场方程链接：https://arxiv.org/abs/2110.07756

作者：Daniel A. Messenger,David M. Bortz 摘要：我们开发了一种用于交互粒子系统（IPS）的弱形式稀疏识别方法，其主要目标是降低大粒子数$N$的计算复杂度，并提供对内在或外在噪声的鲁棒性。特别是，我们将IPS的平均场理论的概念与非线性动力学算法（WSINDy）的弱形式稀疏辨识相结合提供一种快速可靠的系统辨识方案，用于在每次实验的粒子数$N$为数千且实验次数$M$小于100时恢复IPS的控制随机微分方程。这与现有工作形成对比，现有工作表明，使用强形式方法，对低于100美元的N$和几千美元的M$进行系统识别是可行的。我们证明了在一些标准的正则性假设下，该格式在普通最小二乘条件下以$\mathcal{O}（N^{-1/2}）$的速度收敛，并在一维和二维空间上数值证明了收敛速度。我们的例子包括来自均匀化理论的典型问题（作为学习粗粒度模型的第一步）、吸引排斥群的动力学以及抛物椭圆Keller-Segel趋化模型的IPS描述。摘要：We develop a weak-form sparse identification method for interacting particle systems (IPS) with the primary goals of reducing computational complexity for large particle number $N$ and offering robustness to either intrinsic or extrinsic noise. In particular, we use concepts from mean-field theory of IPS in combination with the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy) to provide a fast and reliable system identification scheme for recovering the governing stochastic differential equations for an IPS when the number of particles per experiment $N$ is on the order of several thousand and the number of experiments $M$ is less than 100. This is in contrast to existing work showing that system identification for $N$ less than 100 and $M$ on the order of several thousand is feasible using strong-form methods. We prove that under some standard regularity assumptions the scheme converges with rate $\mathcal{O}(N^{-1/2})$ in the ordinary least squares setting and we demonstrate the convergence rate numerically on several systems in one and two spatial dimensions. Our examples include a canonical problem from homogenization theory (as a first step towards learning coarse-grained models), the dynamics of an attractive-repulsive swarm, and the IPS description of the parabolic-elliptic Keller-Segel model for chemotaxis.

【21】 More Efficient, Doubly Robust, Nonparametric Estimators of Treatment Effects in Multilevel Studies 标题：多水平研究中治疗效果的更有效、双稳健、非参数估计链接：https://arxiv.org/abs/2110.07740

作者：Chan Park,Hyunseung Kang 机构：Department of Statistics, University of Wisconsin–Madison 摘要：在多水平研究中研究治疗效果时，研究者通常使用（半）参数估计，对结果、治疗和/或个体之间的相关性做出强有力的参数假设。我们提出了两个非参数、双稳健、渐近正态的治疗效果估计量，不做这样的假设。第一个估计器是交叉拟合估计器在聚类环境下的扩展。第二个估计器是一个新的估计器，它使用条件倾向分数和结果协方差模型来提高效率。我们将我们的估计应用于模拟和实证研究，发现它们始终获得最小的标准误差。摘要：When studying treatment effects in multilevel studies, investigators commonly use (semi-)parametric estimators, which make strong parametric assumptions about the outcome, the treatment, and/or the correlation between individuals. We propose two nonparametric, doubly robust, asymptotically Normal estimators of treatment effects that do not make such assumptions. The first estimator is an extension of the cross-fitting estimator applied to clustered settings. The second estimator is a new estimator that uses conditional propensity scores and an outcome covariance model to improve efficiency. We apply our estimators in simulation and empirical studies and find that they consistently obtain the smallest standard errors.

【22】 Model-Change Active Learning in Graph-Based Semi-Supervised Learning 标题：基于图的半监督学习中的模型变换主动学习链接：https://arxiv.org/abs/2110.07739

作者：Kevin Miller,Andrea L. Bertozzi 机构：†Department of Mathematics, University of California 备注：Submitted to SIAM Journal on Mathematics of Data Science (SIMODS) 摘要：半监督分类中的主动学习包括为未标记的数据引入额外的标签，以提高底层分类器的准确性。一个挑战是确定要标记哪些点以最好地提高性能，同时限制新标签的数量。“模型变化”主动学习通过引入额外的标签来量化分类器中产生的变化。我们将这一思想与基于图的半监督学习方法相结合，该方法使用图拉普拉斯矩阵的谱，该谱可以被截断以避免过高的计算和存储成本。我们考虑一个凸损失函数族，利用后验分布的拉普拉斯近似，可以有效地逼近捕获函数。我们展示了各种多类示例，这些示例说明了与现有技术相比性能的改进。摘要：Active learning in semi-supervised classification involves introducing additional labels for unlabelled data to improve the accuracy of the underlying classifier. A challenge is to identify which points to label to best improve performance while limiting the number of new labels. "Model-change" active learning quantifies the resulting change incurred in the classifier by introducing the additional label(s). We pair this idea with graph-based semi-supervised learning methods, that use the spectrum of the graph Laplacian matrix, which can be truncated to avoid prohibitively large computational and storage costs. We consider a family of convex loss functions for which the acquisition function can be efficiently approximated using the Laplace approximation of the posterior distribution. We show a variety of multiclass examples that illustrate improved performance over prior state-of-art.

【23】 Algorithms for Sparse Support Vector Machines 标题：稀疏支持向量机的算法链接：https://arxiv.org/abs/2110.07691

作者：Alfonso Landeros,Kenneth Lange 机构：Departments of Computational Medicine, Human Genetics, and Statistics, University of California, Los Angeles, CA , Research supported in part by, USPHS grants R, GM, and HG,., arXiv:,.,v, [stat.ME] , Oct 备注：Main text: 21 pages, 3 figures, 4 tables; Appendix: 6 pages, 2 figures 摘要：分类中的许多问题都涉及大量不相关的特征。模型选择揭示了关键特征，降低了特征空间的维数，提高了模型的解释能力。在支持向量机文献中，模型选择是通过$\ellu 1$惩罚来实现的。这些凸松弛严重地将参数估计偏向于0，并且倾向于承认太多不相关的特征。本文提出了一种用稀疏集约束代替惩罚的方法。惩罚仍然存在，但目的不同。近距离原理采用损失函数$L（\boldsymbol{\beta}）$，并将惩罚$\frac{\rho}{2}\mathrm{dist}（\boldsymbol{\beta}，S_k）^2$添加到稀疏集$S_k$，其中$\boldsymbol{\beta}$的$k$分量最多为非零。如果$\boldsymbol{\beta}\urho$表示目标$f{rho（\boldsymbol{\beta}）=L（\boldsymbol{\beta}）+\frac{\rho}{2}\mathrm{dist}（\boldsymbol{\beta}，S{k}^2$）的最小值，那么$\boldsymbol{\beta}\urho$倾向于$L（\boldsymbol{\beta}\beta}）的受约束最小值超过$S{k$。我们推导了两个密切相关的算法来实现这一策略。我们的模拟和实际例子生动地展示了算法如何在不损失分类能力的情况下实现更好的稀疏性。摘要：Many problems in classification involve huge numbers of irrelevant features. Model selection reveals the crucial features, reduces the dimensionality of feature space, and improves model interpretation. In the support vector machine literature, model selection is achieved by $\ell_1$ penalties. These convex relaxations seriously bias parameter estimates toward 0 and tend to admit too many irrelevant features. The current paper presents an alternative that replaces penalties by sparse-set constraints. Penalties still appear, but serve a different purpose. The proximal distance principle takes a loss function $L(\boldsymbol{\beta})$ and adds the penalty $\frac{\rho}{2}\mathrm{dist}(\boldsymbol{\beta}, S_k)^2$ capturing the squared Euclidean distance of the parameter vector $\boldsymbol{\beta}$ to the sparsity set $S_k$ where at most $k$ components of $\boldsymbol{\beta}$ are nonzero. If $\boldsymbol{\beta}_\rho$ represents the minimum of the objective $f_\rho(\boldsymbol{\beta})=L(\boldsymbol{\beta})+\frac{\rho}{2}\mathrm{dist}(\boldsymbol{\beta}, S_k)^2$, then $\boldsymbol{\beta}_\rho$ tends to the constrained minimum of $L(\boldsymbol{\beta})$ over $S_k$ as $\rho$ tends to $\infty$. We derive two closely related algorithms to carry out this strategy. Our simulated and real examples vividly demonstrate how the algorithms achieve much better sparsity without loss of classification power.

【24】 A Distribution-Free Independence Test for High Dimension Data 标题：一种高维数据的无分布独立性检验链接：https://arxiv.org/abs/2110.07652

作者：Zhanrui Cai,Jing Lei,Kathryn Roeder 机构：Department of Statistics and Data Science, Carnegie Mellon University 摘要：独立性测试在现代数据分析中至关重要，在变量选择、图形模型和因果推理中有着广泛的应用。当数据为高维且潜在依赖信号稀疏时，独立性测试在没有分布或结构假设的情况下变得非常具有挑战性。在本文中，我们提出了一个独立性测试的一般框架，首先拟合一个区分联合分布和乘积分布的分类器，然后测试拟合分类器的显著性。该框架允许我们借鉴现代机器学习社区开发的最先进分类算法的优点，使其适用于高维复杂数据。通过结合样本分割和固定排列，我们的测试统计具有一个通用的、固定的高斯零分布，该分布独立于基础数据分布。大量仿真结果表明，与现有方法相比，新提出的测试方法具有优势。我们进一步将新的测试应用于单细胞数据集，以测试两种类型的单细胞测序测量之间的独立性，其高维性和稀疏性使得现有方法难以应用。摘要：Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier. This framework allows us to borrow the strength of the most advanced classification algorithms developed from the modern machine learning community, making it applicable to high dimensional, complex data. By combining a sample split and a fixed permutation, our test statistic has a universal, fixed Gaussian null distribution that is independent of the underlying data distribution. Extensive simulations demonstrate the advantages of the newly proposed test compared with existing methods. We further apply the new test to a single cell data set to test the independence between two types of single cell sequencing measurements, whose high dimensionality and sparsity make existing methods hard to apply.

【25】 Sparse Implicit Processes for Approximate Inference 标题：近似推理的稀疏隐式过程链接：https://arxiv.org/abs/2110.07618

作者：Simón Rodríguez Santana,Bryan Zaldivar,Daniel Hernández-Lobato 机构： 1IntroductionApproximate inference has recently attracted an enor-mous interest from the neural networks (NNs) com-munity due to the high performance and popularity 1Institute of Mathematical Sciences (ICMAT-CSIC) 备注：10 pages for the main text (with 3 figures and 1 table), and 9 pages of supplementary material (with 6 figures and 3 tables) 摘要：隐式过程（IPs）是一种灵活的先验知识，可以描述贝叶斯神经网络、神经采样器和数据生成器等模型。IPs允许在函数空间中进行近似推理。这避免了由于参数数量多、依赖性强而导致的参数空间近似推理的退化问题。为此，通常使用额外IP来近似先前IP的后部。然而，同时调整先验知识产权和近似后验知识产权的参数是一项具有挑战性的任务。现有的方法可以调整先验IP，导致高斯预测分布，无法捕获重要的数据模式。相比之下，通过使用另一个IP来近似后验过程来产生灵活的预测分布的方法不能使前一个IP与观测数据相匹配。我们在此提出了一种可以同时执行这两项任务的方法。为此，我们依赖于先前IP的诱导点表示，就像在稀疏高斯过程中经常做的那样。其结果是一种可扩展的近似推理方法，该方法可以根据数据调整先验IP参数，并提供准确的非高斯预测分布。摘要：Implicit Processes (IPs) are flexible priors that can describe models such as Bayesian neural networks, neural samplers and data generators. IPs allow for approximate inference in function-space. This avoids some degenerate problems of parameter-space approximate inference due to the high number of parameters and strong dependencies. For this, an extra IP is often used to approximate the posterior of the prior IP. However, simultaneously adjusting the parameters of the prior IP and the approximate posterior IP is a challenging task. Existing methods that can tune the prior IP result in a Gaussian predictive distribution, which fails to capture important data patterns. By contrast, methods producing flexible predictive distributions by using another IP to approximate the posterior process cannot fit the prior IP to the observed data. We propose here a method that can carry out both tasks. For this, we rely on an inducing-point representation of the prior IP, as often done in the context of sparse Gaussian processes. The result is a scalable method for approximate inference with IPs that can tune the prior IP parameters to the data, and that provides accurate non-Gaussian predictive distributions.

【26】 Astronomical source finding services for the CIRASA visual analytic platform 标题：CIRASA可视化分析平台的天文找源服务链接：https://arxiv.org/abs/2110.08211

作者：S. Riggia,C. Bordiu,F. Vitello,G. Tudisco,E. Sciacca,D. Magro,R. Sortino,C. Pino,M. Molinaro,M. Benedettini,S. Leurini,F. Bufano,M. Raciti,U. Becciani 机构：INAF - Osservatorio Astrofisico di Catania, Via Santa Sofia , Catania, Italy, Centro de Astrobiología (INTA-CSIC), Ctra. M-, km. , Torrejón de Ardoz, Madrid, Spain, INAF - Istituto. di Radioastronomia, Via Gobetti , Bologna, Italy 备注：16 pages, 6 figures 摘要：如今，数据处理、归档、分析和可视化方面的创新发展对于处理射电天文学下一代设施（如平方公里阵列（SKA）及其前身）中预期的数据洪流是不可避免的。在这种情况下，将源提取和分析算法集成到数据可视化工具中可以显著改进和加快大面积调查的编目过程，提高天文学家的生产率，缩短出版时间。为此，我们正在开发一个用于高级源查找和分类的可视化分析平台（CIRASA），集成最先进的工具，如CAESAR源查找器、ViaLactea可视化分析（VLVA）和知识库（VLKB）。在这项工作中，我们提出了项目目标和平台架构，重点是实现的源查找服务。摘要：Innovative developments in data processing, archiving, analysis, and visualization are nowadays unavoidable to deal with the data deluge expected in next-generation facilities for radio astronomy, such as the Square Kilometre Array (SKA) and its precursors. In this context, the integration of source extraction and analysis algorithms into data visualization tools could significantly improve and speed up the cataloguing process of large area surveys, boosting astronomer productivity and shortening publication time. To this aim, we are developing a visual analytic platform (CIRASA) for advanced source finding and classification, integrating state-of-the-art tools, such as the CAESAR source finder, the ViaLactea Visual Analytic (VLVA) and Knowledge Base (VLKB). In this work, we present the project objectives and the platform architecture, focusing on the implemented source finding services.

【27】 Halpern-Type Accelerated and Splitting Algorithms For Monotone Inclusions 标题：单调包含的Halpern型加速和分裂算法链接：https://arxiv.org/abs/2110.08150

作者：Quoc Tran-Dinh,Yang Luo 机构：Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, Hanes Hall, UNC-Chapel Hill, NC ,-,. 备注：33 pages 摘要：在本文中，我们发展了一种新的加速算法来求解几类极大单调方程和单调包含。我们的方法没有使用Nesterov的加速方法，而是依赖于[32]中所谓的Halpern型定点迭代，最近被许多研究人员利用，包括[24,70]。首先，我们基于Popov过去的额外梯度法推导了[70]中锚定额外梯度格式的一个新变体，以求解一个最大单调方程$G（x）=0$。我们证明了我们的方法在算子范数$\Vert G（x_k）\Vert$上实现了与锚定额外梯度算法相同的$\mathcal{O}（1/k）$收敛速度（达到常数因子），但每次迭代只需要一次计算$G$，其中$k$是迭代计数器。接下来，我们发展了两个分裂算法来逼近两个最大单调算子之和的零点。第一种算法源于锚定额外梯度法并结合分裂技术，而第二种算法是它的Popov变体，它可以降低每次迭代的复杂度。这两种算法似乎都是新的，可以看作是道格拉斯·拉赫福德（Douglas Rachford，DR）分裂法的加速变体。它们都在与问题相关联的向前向后残差算子$G{\gamma}（\cdot）$的范数$\Vert G{\gamma}（x\u k）\Vert$上实现$\mathcal{O}（1/k）$速率。我们还提出了一个新的加速Douglas-Rachford分裂方案来解决这个问题，它在$\Vert G{\gamma}（x\u k）\Vert$上仅在最大单调假设下达到$\mathcal{O}（1/k）$收敛速度。最后，我们指定了我们的第一个算法来解决凸凹极小极大问题，并应用我们的加速DR方案来推导出一种新的交替方向乘子法（ADMM）。摘要：In this paper, we develop a new type of accelerated algorithms to solve some classes of maximally monotone equations as well as monotone inclusions. Instead of using Nesterov's accelerating approach, our methods rely on a so-called Halpern-type fixed-point iteration in [32], and recently exploited by a number of researchers, including [24, 70]. Firstly, we derive a new variant of the anchored extra-gradient scheme in [70] based on Popov's past extra-gradient method to solve a maximally monotone equation $G(x) = 0$. We show that our method achieves the same $\mathcal{O}(1/k)$ convergence rate (up to a constant factor) as in the anchored extra-gradient algorithm on the operator norm $\Vert G(x_k)\Vert$, , but requires only one evaluation of $G$ at each iteration, where $k$ is the iteration counter. Next, we develop two splitting algorithms to approximate a zero point of the sum of two maximally monotone operators. The first algorithm originates from the anchored extra-gradient method combining with a splitting technique, while the second one is its Popov's variant which can reduce the per-iteration complexity. Both algorithms appear to be new and can be viewed as accelerated variants of the Douglas-Rachford (DR) splitting method. They both achieve $\mathcal{O}(1/k)$ rates on the norm $\Vert G_{\gamma}(x_k)\Vert$ of the forward-backward residual operator $G_{\gamma}(\cdot)$ associated with the problem. We also propose a new accelerated Douglas-Rachford splitting scheme for solving this problem which achieves $\mathcal{O}(1/k)$ convergence rate on $\Vert G_{\gamma}(x_k)\Vert$ under only maximally monotone assumptions. Finally, we specify our first algorithm to solve convex-concave minimax problems and apply our accelerated DR scheme to derive a new variant of the alternating direction method of multipliers (ADMM).

【28】 Convergence of Laplacian Eigenmaps and its Rate for Submanifolds with Singularities 标题：具有奇异子流形的Laplacian特征映射的收敛性及其速度链接：https://arxiv.org/abs/2110.08138

作者：Masayuki Aino 备注：63 pages 摘要：本文利用子流形上的随机点构造的$\epsilon$-邻域图，给出了具有奇点的欧氏空间子流形上拉普拉斯算子的谱逼近结果。拉普拉斯算子特征值的收敛速度为$O\left（\left（\logn/n\right）^{1/（m+2）}\right）$，其中$m$和$n$分别表示流形的维数和样本大小。摘要：In this paper, we give a spectral approximation result for the Laplacian on submanifolds of Euclidean spaces with singularities by the $\epsilon$-neighborhood graph constructed from random points on the submanifold. Our convergence rate for the eigenvalue of the Laplacian is $O\left(\left(\log n/n\right)^{1/(m+2)}\right)$, where $m$ and $n$ denote the dimension of the manifold and the sample size, respectively.

【29】 Gradient Descent on Infinitely Wide Neural Networks: Global Convergence and Generalization 标题：无限宽神经网络的梯度下降：全局收敛性和泛化链接：https://arxiv.org/abs/2110.08084

作者：Francis Bach,Lenaïc Chizat 机构：Inria & Ecole Normale Sup´erieure, PSL Research University, L´ena¨ıc Chizat, Ecole Polytechnique F´ed´erale de Lausanne 摘要：许多有监督的机器学习方法自然地被转化为优化问题。对于参数为线性的预测模型，这通常会导致存在许多数学保证的凸问题。参数非线性的模型（如神经网络）会导致难以获得保证的非凸优化问题。在这篇评论文章中，我们考虑两层神经网络的均匀激活函数，隐藏神经元的数量趋于无穷大，并显示如何定性收敛保证可以得到。摘要：Many supervised machine learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many mathematical guarantees exist. Models which are non-linear in their parameters such as neural networks lead to non-convex optimization problems for which guarantees are harder to obtain. In this review paper, we consider two-layer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived.

【30】 Low-rank Matrix Recovery With Unknown Correspondence 标题：未知对应的低秩矩阵恢复算法链接：https://arxiv.org/abs/2110.07959

作者：Zhiwei Tang,Tsung-Hui Chang,Xiaojing Ye,Hongyuan Zha 机构：The Chinese University of Hong Kong, Shenzhen, Georgia State University 摘要：我们研究了一个具有未知对应关系的矩阵恢复问题：给定观测矩阵$M_o=[a，\tilde P B]$，其中$\tilde P$是一个未知置换矩阵，我们的目标是恢复底层矩阵$M=[a，B]$。此类问题通常出现在许多应用中，其中使用了异构数据，并且它们之间的对应关系未知，例如，由于隐私问题。我们证明了在适当的低秩条件下，通过求解核范数极小化问题在$M$上恢复$M$是可能的，可证明的非渐近误差界为$M$。我们提出了一种算法$\text{M}^3\text{O}$（通过最小-最大优化进行矩阵恢复），该算法将该组合问题转化为一个连续的最小-最大优化问题，并使用最大预言机通过近似梯度进行求解$\text{M}^3\text{O}$也可以应用于更一般的场景，在这种场景中，$M_O$中缺少条目，并且有多组数据具有明显的未知对应关系。在模拟数据、MovieLens 100K数据集和耶鲁B数据库上的实验表明，$\text{M}^3\text{O}$在多条基线上实现了最先进的性能，并且能够以高精度恢复地面真实值对应关系。摘要：We study a matrix recovery problem with unknown correspondence: given the observation matrix $M_o=[A,\tilde P B]$, where $\tilde P$ is an unknown permutation matrix, we aim to recover the underlying matrix $M=[A,B]$. Such problem commonly arises in many applications where heterogeneous data are utilized and the correspondence among them are unknown, e.g., due to privacy concerns. We show that it is possible to recover $M$ via solving a nuclear norm minimization problem under a proper low-rank condition on $M$, with provable non-asymptotic error bound for the recovery of $M$. We propose an algorithm, $\text{M}^3\text{O}$ (Matrix recovery via Min-Max Optimization) which recasts this combinatorial problem as a continuous minimax optimization problem and solves it by proximal gradient with a Max-Oracle. $\text{M}^3\text{O}$ can also be applied to a more general scenario where we have missing entries in $M_o$ and multiple groups of data with distinct unknown correspondence. Experiments on simulated data, the MovieLens 100K dataset and Yale B database show that $\text{M}^3\text{O}$ achieves state-of-the-art performance over several baselines and can recover the ground-truth correspondence with high accuracy.

【31】 A novel framework to quantify uncertainty in peptide-tandem mass spectrum matches with application to nanobody peptide identification 标题：一种新的多肽不确定度量化框架-串联质谱及其在纳米体肽鉴定中的应用链接：https://arxiv.org/abs/2110.07818

作者：Chris McKennan,Zhe Sang,Yi Shi 机构：Department of Statistics, University of Pittsburgh, Department of Cell Biology, University of Pittsburgh 备注：19 pages, 7 figures in the main text; 59 pages, 15 figures including supplement 摘要：纳米体是从骆驼类动物中提取的小抗体片段，可选择性地与抗原结合。这些蛋白质具有显著的理化性质，支持先进的治疗方法，包括SARS-CoV-2治疗。为了实现其潜力，已经提出了通过液相色谱-串联质谱（LC-MS/MS）的自下而上蛋白质组学来在蛋白质组规模上鉴定抗原特异性纳米体，其中该管道的一个关键组成部分是将纳米体肽与其最初的串联质谱相匹配。虽然肽谱匹配是一个研究得很好的问题，但我们发现纳米体肽之间的序列相似性违反了用标准目标诱饵范式推断纳米体肽谱匹配（PSM）所需的关键假设，并证明这些违反导致了过高的错误率。为了解决这些问题，我们开发了一种新的框架和方法，将肽谱匹配视为具有不完整模型空间的贝叶斯模型选择问题，据我们所知，这是第一个在不依赖上述假设的情况下解释PSM误差所有来源的问题。除了说明我们的方法在模拟和真实纳米体数据上的改进性能外，我们的工作还演示了如何利用新的保留时间和光谱预测工具来开发准确和有区别的数据生成模型，并且据我们所知，提供了MS/MS光谱噪声的第一个严格描述。摘要：Nanobodies are small antibody fragments derived from camelids that selectively bind to antigens. These proteins have marked physicochemical properties that support advanced therapeutics, including treatments for SARS-CoV-2. To realize their potential, bottom-up proteomics via liquid chromatography-tandem mass spectrometry (LC-MS/MS) has been proposed to identify antigen-specific nanobodies at the proteome scale, where a critical component of this pipeline is matching nanobody peptides to their begotten tandem mass spectra. While peptide-spectrum matching is a well-studied problem, we show the sequence similarity between nanobody peptides violates key assumptions necessary to infer nanobody peptide-spectrum matches (PSMs) with the standard target-decoy paradigm, and prove these violations beget inflated error rates. To address these issues, we then develop a novel framework and method that treats peptide-spectrum matching as a Bayesian model selection problem with an incomplete model space, which are, to our knowledge, the first to account for all sources of PSM error without relying on the aforementioned assumptions. In addition to illustrating our method's improved performance on simulated and real nanobody data, our work demonstrates how to leverage novel retention time and spectrum prediction tools to develop accurate and discriminating data-generating models, and, to our knowledge, provides the first rigorous description of MS/MS spectrum noise.

【32】 Towards Statistical and Computational Complexities of Polyak Step Size Gradient Descent 标题：Polyak步长梯度下降的统计和计算复杂度研究链接：https://arxiv.org/abs/2110.07810

作者：Tongzheng Ren,Fuheng Cui,Alexia Atsidakou,Sujay Sanghavi,Nhat Ho 机构：Department of Computer Science, University of Texas at Austin⋄, Department of Statistics and Data Sciences, University of Texas at Austin♭, Department of Electrical and Computer Engineering, University of Texas at Austin† 备注：First three authors contributed equally. 40 pages, 4 figures 摘要：我们研究了在总体损失函数的广义光滑性和Lojasiewicz条件下，即当样本量趋于无穷大时，经验损失函数的极限，Polyak步长梯度下降算法的统计和计算复杂性，以及经验损失函数梯度与总体损失函数梯度之间的稳定性，即样本损失函数梯度与总体损失函数梯度之间浓度界的多项式增长。我们证明了Polyak步长梯度下降迭代在按照样本大小进行对数迭代后，在真参数周围达到最终的统计收敛半径。当总体损失函数不是局部强凸时，与固定步长梯度下降算法在样本大小上的多项式迭代次数相比，它在计算上更便宜，以达到相同的最终统计半径。最后，我们通过三个统计例子来说明我们的一般理论：广义线性模型、混合模型和混合线性回归模型。摘要：We study the statistical and computational complexities of the Polyak step size gradient descent algorithm under generalized smoothness and Lojasiewicz conditions of the population loss function, namely, the limit of the empirical loss function when the sample size goes to infinity, and the stability between the gradients of the empirical and population loss functions, namely, the polynomial growth on the concentration bound between the gradients of sample and population loss functions. We demonstrate that the Polyak step size gradient descent iterates reach a final statistical radius of convergence around the true parameter after logarithmic number of iterations in terms of the sample size. It is computationally cheaper than the polynomial number of iterations on the sample size of the fixed-step size gradient descent algorithm to reach the same final statistical radius when the population loss function is not locally strongly convex. Finally, we illustrate our general theory under three statistical examples: generalized linear model, mixture model, and mixed linear regression model.

【33】 Scalable Causal Structure Learning: New Opportunities in Biomedicine 标题：可扩展因果结构学习：生物医学的新机遇链接：https://arxiv.org/abs/2110.07785

作者：Pulakesh Upadhyaya,Kai Zhang,Can Li,Xiaoqian Jiang,Yejin Kim 机构：School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas 摘要：本文提供了一个关于流行的因果结构学习模型的实用教程，并以实际数据为例，帮助医疗保健受众理解和应用这些模型。我们回顾了传统的、基于分数的和基于机器学习的因果结构发现方案，研究了它们在一些基准数据集上的性能，并讨论了它们在生物医学中的一些应用。在数据充足的情况下，基于机器学习的方法可以扩展，可以包含比传统方法更多的变量，并且可以潜在地应用于许多生物医学应用。摘要：This paper gives a practical tutorial on popular causal structure learning models with examples of real-world data to help healthcare audiences understand and apply them. We review prominent traditional, score-based and machine-learning based schemes for causal structure discovery, study some of their performance over some benchmark datasets, and discuss some of the applications to biomedicine. In the case of sufficient data, machine learning-based approaches can be scalable, can include a greater number of variables than traditional approaches, and can potentially be applied in many biomedical applications.

【34】 Areas on the space of smooth probability density functions on S^2标题：S^2上光滑概率密度函数空间上的面积链接：https://arxiv.org/abs/2110.07773

作者：J. C. Ruíz-Pantaleón,P. Suárez-Serrato 备注：9 pages, 2 algorithms, 3 tables 摘要：我们提出了计算平面、2-环面和2-球面正密度测度空间上泊松括号的符号和数值方法。我们应用我们的方法来计算2-球情况下有限区域的辛面积，包括正密度高斯测度的一个显式例子。摘要：We present symbolic and numerical methods for computing Poisson brackets on the spaces of measures with positive densities of the plane, the 2-torus, and the 2-sphere. We apply our methods to compute symplectic areas of finite regions for the case of the 2-sphere, including an explicit example for Gaussian measures with positive densities.

【35】 Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation 标题：利用时空相关性进行稀疏均值估计链接：https://arxiv.org/abs/2110.07751

作者：Divyansh Jhunjhunwala,Ankur Mallick,Advait Gadhikar,Swanand Kadhe,Gauri Joshi 机构：Carnegie Mellon University, Pittsburgh, PA , University of California Berkeley, Berkeley, CA 备注：Accepted to NeurIPS 2021 摘要：我们研究在中央服务器上估计分布在多个节点上的一组向量的平均值（每个节点一个向量）的问题。当向量是高维的时，发送整个向量的通信成本可能是禁止的，并且它们可能必须使用稀疏化技术。虽然大多数关于稀疏平均估计的现有工作与数据向量的特性无关，但在许多实际应用中，如联合学习，可能存在空间相关性（不同节点发送的向量的相似性）或时间相关性（单个节点在算法的不同迭代中发送的数据的相似性）在数据向量中。我们通过简单地修改服务器用于估计平均值的解码方法来利用这些相关性。我们提供了对产生的估计误差的分析以及PCA、K-均值和逻辑回归的实验，这表明我们的估计值始终优于更复杂和昂贵的SPAR分类方法。摘要：We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are high-dimensional, the communication cost of sending entire vectors may be prohibitive, and it may be imperative for them to use sparsification techniques. While most existing work on sparsified mean estimation is agnostic to the characteristics of the data vectors, in many practical applications such as federated learning, there may be spatial correlations (similarities in the vectors sent by different nodes) or temporal correlations (similarities in the data sent by a single node over different iterations of the algorithm) in the data vectors. We leverage these correlations by simply modifying the decoding method used by the server to estimate the mean. We provide an analysis of the resulting estimation error as well as experiments for PCA, K-Means and Logistic Regression, which show that our estimators consistently outperform more sophisticated and expensive sparsification methods.

【36】 Emerging Directions in Geophysical Inversion 标题：地球物理反演的新兴方向链接：https://arxiv.org/abs/2110.06017

作者：Andrew Valentine,Malcolm Sambridge 机构：Department of Earth Science, Durham University, South Road, Durham, DH,LE, UK., Research School of Earth Sciences, The Australian National University, Mills Road, Acton ACT , Australia. 备注：30 pages. Original manuscript submitted for review as a chapter of "Data Assimilation and Inverse Problems in Geophysical Sciences", eds. Alik Ismail-Zadeh, Fabio Castelli, Dylan Jones and Sabrina Sanchez 摘要：在本章中，我们概述了地球物理反演领域的一些最新发展。我们的目的是对当前研究的广度提供一个易于理解的概括性介绍，而不是深入讨论特定主题。特别是，我们希望读者能够欣赏不同应用程序之间的相似性和联系蟑螂及其相对优势和劣势。摘要：In this chapter, we survey some recent developments in the field of geophysical inversion. We aim to provide an accessible general introduction to the breadth of current research, rather than focussing in depth on particular topics. In particular, we hope to give the reader an appreciation for the similarities and connections between different approaches, and their relative strengths and weaknesses.

机器翻译，仅供参考

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-10-18，如有侵权请联系 cloudcommunity@tencent.com 删除

linux