统计学学术速递[9.13]

公众号-arXiv每日学术速递

发布于 2021-09-16 10:58:49

4230

发布于 2021-09-16 10:58:49

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

stat统计学，共计20篇

【1】 Best-Arm Identification in Correlated Multi-Armed Bandits 标题：相关多臂强流中的最佳臂识别链接：https://arxiv.org/abs/2109.04941

作者：Samarth Gupta,Gauri Joshi,Osman Yağan 机构：Carnegie Mellon University, Pittsburgh, PA , Editor: No editors 备注：None 摘要：在本文中，我们考虑在固定的信任设置中的多武装匪徒的最佳臂识别问题，其目标是识别概率为1美元\δ$的一些$ \ delta＞0美元，ARM具有最高的平均奖励在最小可能的样本中从武器集$\ MathCAL{K}$。大多数现有的最佳手臂识别算法和分析都是在假设不同手臂对应的奖励相互独立的情况下进行的。我们提出了一个新的相关bandit框架，该框架以一个arm的预期条件奖励上界的形式捕获关于arm之间相关性的领域知识，给定另一个arm的奖励实现。我们提出的算法C-LUCB是LUCB算法的推广，它利用了部分相关知识，大大降低了最佳arm识别的样本复杂度。更有趣的是，我们显示C-LUCB获得的总样本的形式为$\mathcal{O}\left（\sum{k\in\mathcal{C}\log\left（\frac{1}{\delta}\right）\right）$，而不是独立奖励设置中所需的典型$\mathcal{O}\left（\sum{k\in\mathcal{k}\log\left（\frac{1}{delta}\right）$right）$samples。改进之处在于$\mathcal{O}（\log（1/\delta））$项仅对竞争性武器集$\mathcal{C}$求和，这是原始武器集$\mathcal{K}$的子集。根据问题设置，集合$\mathcal{C}$的大小可以小到$2$，因此在相关bandits设置中使用C-LUCB可以显著提高性能。我们的理论发现得到了Movielens和Goodreads推荐数据集实验的支持。摘要：In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability $1-\delta$ for some $\delta>0$, the arm with the highest mean reward in minimum possible samples from the set of arms $\mathcal{K}$. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form $\mathcal{O}\left(\sum_{k \in \mathcal{C}} \log\left(\frac{1}{\delta}\right)\right)$ as opposed to the typical $\mathcal{O}\left(\sum_{k \in \mathcal{K}} \log\left(\frac{1}{\delta}\right)\right)$ samples required in the independent reward setting. The improvement comes, as the $\mathcal{O}(\log(1/\delta))$ term is summed only for the set of competitive arms $\mathcal{C}$, which is a subset of the original set of arms $\mathcal{K}$. The size of the set $\mathcal{C}$, depending on the problem setting, can be as small as $2$, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets.

【2】 On the eigenvalues associated with the limit null distribution of the Epps-Pulley test of normality 标题：关于Epps-Pulley正态检验极限零分布的特征值链接：https://arxiv.org/abs/2109.04897

作者：Bruno Ebner,Norbert Henze 备注：11 pages, 3 Tables 摘要：Shapiro-Wilk检验（SW）和Anderson-Darling检验（AD）被证明是检验正态性的有力方法。与SW和AD相反，Epps和滑轮提出了一类正态性测试，Baringhaus和Henze对其进行了扩展，以产生易于使用的仿射不变量和任意维正态性的普遍一致性测试。Epps——滑轮试验的极限零分布涉及由极限高斯过程的协方差核诱导的某个积分算子的特征值序列。我们求解相关的积分方程并给出相应的特征值。摘要：The Shapiro--Wilk test (SW) and the Anderson--Darling test (AD) turned out to be strong procedures for testing for normality. They are joined by a class of tests for normality proposed by Epps and Pulley that, in contrary to SW and AD, have been extended by Baringhaus and Henze to yield easy-to-use affine invariant and universally consistent tests for normality in any dimension. The limit null distribution of the Epps--Pulley test involves a sequences of eigenvalues of a certain integral operator induced by the covariance kernel of the limiting Gaussian process. We solve the associated integral equation and present the corresponding eigenvalues.

【3】 Neural Networks for Latent Budget Analysis of Compositional Data 标题：神经网络在成分数据潜在预算分析中的应用链接：https://arxiv.org/abs/2109.04875

作者：Zhenwei Yang,Ayoub Bagheri,P. G. M van der Heijden 机构：Utrecht University, Sjoerd Groenman building, Padualaan , CH, Utrecht, The, Netherlands, ARTICLE HISTORY 摘要：成分数据是在具有恒定行和的矩形矩阵中收集的非负数据。由于非负性，重点放在每行加起来为1的条件比例上。一行有条件的比例称为观察预算。潜在预算分析（LBA）假设潜在预算的混合，以解释观察到的预算。LBA通常适用于列联表，其中行是一个或多个解释变量的级别，列是响应变量的级别。在前瞻性研究中，只有关于个体解释变量的知识，并且对预测反应变量感兴趣。因此，需要一种具有预测功能的LBA形式。以前的研究提出了LBA的约束神经网络（NN）扩展，但由于预测能力不令人满意而受到阻碍。在这里，我们提出了LBA-NN，这是一种前馈神经网络模型，它产生了与LBA类似的解释，但使LBA具有更好的预测能力。通过使用重要性图和表格，可以获得LBA-NN的稳定且合理的解释，这些图和表格显示了所有解释变量对响应变量的相对重要性。在重要性表上应用K-均值聚类的LBA-NN-K-均值方法用于生成与LBA中的K个潜在预算相当的K个聚类。在这里，我们提供了实现LBA-NN的不同实验，并与LBA进行了比较。在我们的分析中，LBA-NN在预测准确性、特异性、召回率和均方误差方面优于LBA。我们在GitHub提供开源软件。摘要：Compositional data are non-negative data collected in a rectangular matrix with a constant row sum. Due to the non-negativity the focus is on conditional proportions that add up to 1 for each row. A row of conditional proportions is called an observed budget. Latent budget analysis (LBA) assumes a mixture of latent budgets that explains the observed budgets. LBA is usually fitted to a contingency table, where the rows are levels of one or more explanatory variables and the columns the levels of a response variable. In prospective studies, there is only knowledge about the explanatory variables of individuals and interest goes out to predicting the response variable. Thus, a form of LBA is needed that has the functionality of prediction. Previous studies proposed a constrained neural network (NN) extension of LBA that was hampered by an unsatisfying prediction ability. Here we propose LBA-NN, a feed forward NN model that yields a similar interpretation to LBA but equips LBA with a better ability of prediction. A stable and plausible interpretation of LBA-NN is obtained through the use of importance plots and table, that show the relative importance of all explanatory variables on the response variable. An LBA-NN-K- means approach that applies K-means clustering on the importance table is used to produce K clusters that are comparable to K latent budgets in LBA. Here we provide different experiments where LBA-NN is implemented and compared with LBA. In our analysis, LBA-NN outperforms LBA in prediction in terms of accuracy, specificity, recall and mean square error. We provide open-source software at GitHub.

【4】 Reducing bias and alleviating the influence of excess of zeros with multioutcome adaptive LAD-lasso 标题：用多输出自适应LAD-LASSO减小偏差和减轻过零点的影响链接：https://arxiv.org/abs/2109.04798

作者：Jyrki Möttönen,Tero Lähderanta,Janne Salonen,Mikko J. Sillanpää 机构：Research Department, Finnish Centre for Pensions and, Keva, Helsinki, Finland, Mikko J. Sillanp¨a¨a, Research Unit of Mathematical Sciences, University of Oulu, Oulu, Finland 摘要：零膨胀解释变量在生态学和金融等领域很常见。在这篇文章中，我们讨论了在一些解释变量中存在过零值的问题，这些解释变量受多结果lasso正则化变量选择的约束。简单地说，这个问题是由于套索收缩法无法识别回归系数或解释变量对应值中出现的零值之间的任何差异造成的。这种混淆会明显增加假阳性的数量——所有非零回归系数不一定代表真实的结果影响。我们在这里提出了适用于多种结果的自适应LAD套索，它扩展了具有自适应惩罚的多变量LAD套索的早期工作。除了众所周知的具有较少偏差回归系数的特性外，我们在此展示了自适应性如何提高方法从连续协变量中测量的零值过多的影响中恢复的能力。摘要：Zero-inflated explanatory variables are common in fields such as ecology and finance. In this paper we address the problem of having excess of zero values in some explanatory variables which are subject to multioutcome lasso-regularized variable selection. Briefly, the problem results from the failure of the lasso-type of shrinkage methods to recognize any difference between zero value occurring either in the regression coefficient or in the corresponding value of the explanatory variable. This kind of confounding will obviously increase number of false positives - all non-zero regression coefficients do not necessarily represent real outcome effects. We present here the adaptive LAD-lasso for multiple outcomes which extends the earlier work of multivariate LAD-lasso with adaptive penalization. In addition of well known property of having less biased regression coefficients, we show here how the adaptivity improves also method's ability to recover from influences of excess of zero values measured in continuous covariates.

【5】 Low-rank statistical finite elements for scalable model-data synthesis 标题：用于可伸缩模型数据综合的低阶统计有限元链接：https://arxiv.org/abs/2109.04757

作者：Connor Duffin,Edward Cripps,Thomas Stemler,Mark Girolami 机构：Department of Mathematics and Statistics, The University of Western, Australia, Crawley, WA, Australia, Department of Engineering, University of Cambridge, Cambridge, CB,PZ, UK, Complex Systems Group, The University of Western 备注：32 pages, 12 figures, submitted version 摘要：在物理推导的数学模型中增加的统计学习在文献中越来越有吸引力。最近的一种方法是使用数据驱动的贝叶斯统计方法来增强控制方程的基础物理。该方法通过在控制方程中嵌入随机强迫来确认先验模型的错误指定。在收到额外数据后，使用经典贝叶斯滤波技术更新离散有限元解的后验分布。由此产生的后验联合量化了与普遍存在的模型错误指定问题相关的不确定性，以及用于表示真实感兴趣过程的数据。尽管如此，计算可伸缩性对于statFEM在物理和工业环境中通常遇到的高维问题的应用是一个挑战。本文通过嵌入底层密集协方差矩阵的低秩近似，克服了这一障碍，该矩阵是从满秩方案的前导阶模式获得的。以一系列增维反应扩散问题为例，利用实验和模拟数据，该方法以最小的信息损失，在后验均值和方差中重建稀疏观测的数据生成过程，为复杂系统的物理和概率方法的进一步集成铺平道路。摘要：Statistical learning additions to physically derived mathematical models are gaining traction in the literature. A recent approach has been to augment the underlying physics of the governing equations with data driven Bayesian statistical methodology. Coined statFEM, the method acknowledges a priori model misspecification, by embedding stochastic forcing within the governing equations. Upon receipt of additional data, the posterior distribution of the discretised finite element solution is updated using classical Bayesian filtering techniques. The resultant posterior jointly quantifies uncertainty associated with the ubiquitous problem of model misspecification and the data intended to represent the true process of interest. Despite this appeal, computational scalability is a challenge to statFEM's application to high-dimensional problems typically experienced in physical and industrial contexts. This article overcomes this hurdle by embedding a low-rank approximation of the underlying dense covariance matrix, obtained from the leading order modes of the full-rank alternative. Demonstrated on a series of reaction-diffusion problems of increasing dimension, using experimental and simulated data, the method reconstructs the sparsely observed data-generating processes with minimal loss of information, in both posterior mean and the variance, paving the way for further integration of physical and probabilistic approaches to complex systems.

【6】 Implicit Copulas: An Overview 标题：隐式Copulas：综述链接：https://arxiv.org/abs/2109.04718

作者：Michael Stanley Smith 机构：Melbourne Business School, University of Melbourne, Leicester Street, Carlton, Australia 摘要：隐式copula是高维依赖建模中最常见的copula选择。本文介绍并考察了这一大类copula，包括椭圆copula、斜交t$copula、因子copula、时间序列copula和回归copula。概述了隐式连接函数的常用辅助表示，以及这如何使它们在统计建模中既可伸缩又易于处理。考虑了诸如参数识别、离散或混合数据的扩展似然性、高维简约性以及copula模型的模拟等问题。概述了估计copula参数和从隐式copula模型预测copula参数的贝叶斯方法。特别关注由时间序列和回归模型构建的隐式copula过程，这是当前研究的前沿。两个计量经济学应用——一个来自宏观经济时间序列，另一个来自金融资产定价——说明了隐式copula模型的优势。摘要：Implicit copulas are the most common copula choice for modeling dependence in high dimensions. This broad class of copulas is introduced and surveyed, including elliptical copulas, skew $t$ copulas, factor copulas, time series copulas and regression copulas. The common auxiliary representation of implicit copulas is outlined, and how this makes them both scalable and tractable for statistical modeling. Issues such as parameter identification, extended likelihoods for discrete or mixed data, parsimony in high dimensions, and simulation from the copula model are considered. Bayesian approaches to estimate the copula parameters, and predict from an implicit copula model, are outlined. Particular attention is given to implicit copula processes constructed from time series and regression models, which is at the forefront of current research. Two econometric applications -- one from macroeconomic time series and the other from financial asset pricing -- illustrate the advantages of implicit copula models.

【7】 Latent space projection predictive inference 标题：潜在空间投影预测推理链接：https://arxiv.org/abs/2109.04702

作者：Alejandro Catalina,Paul Bürkner,Aki Vehtari 摘要：给定一个包含所有可用变量的参考模型，投影预测推理用仅包含所有变量子集的约束投影替换其后验模型。我们扩展了投影预测推理，以在指数族之外的模型中实现计算效率高的变量和结构选择。通过采用潜在空间投影预测视角，我们能够：1）在充分尊重原始模型结构的同时，提出一个在复杂模型中进行变量选择的统一和通用框架，2）正确识别相关结构并保留原始模型的后验不确定性，3）也为指数族中的非高斯模型提供了一种改进的方法。我们通过在广泛的环境（包括真实的数据集）中与流行的变量选择方法进行彻底的测试和比较，证明了我们方法的优越性能。我们的结果表明，我们的方法成功地恢复了复杂模型中的相关术语和模型结构，为现实数据集选择的变量比竞争方法少。摘要：Given a reference model that includes all the available variables, projection predictive inference replaces its posterior with a constrained projection including only a subset of all variables. We extend projection predictive inference to enable computationally efficient variable and structure selection in models outside the exponential family. By adopting a latent space projection predictive perspective we are able to: 1) propose a unified and general framework to do variable selection in complex models while fully honouring the original model structure, 2) properly identify relevant structure and retain posterior uncertainties from the original model, and 3) provide an improved approach also for non-Gaussian models in the exponential family. We demonstrate the superior performance of our approach by thoroughly testing and comparing it against popular variable selection approaches in a wide range of settings, including realistic data sets. Our results show that our approach successfully recovers relevant terms and model structure in complex models, selecting less variables than competing approaches for realistic datasets.

【8】 Interaction Models and Generalized Score Matching for Compositional Data 标题：成分数据的交互模型与广义得分匹配链接：https://arxiv.org/abs/2109.04671

作者：Shiqing Yu,Mathias Drton,Ali Shojaie 机构：Department of Statistics, University of Washington, Seattle, Washington, U.S.A., Department of Mathematics, Technical University of Munich, Garching bei, M¨unchen, Germany, Department of Biostatistics, University of Washington, Seattle, Washington, U.S.A. 备注：41 pages, 19 figures 摘要：微生物组数据分析等应用重新引起了人们对成分数据统计方法的兴趣，即以概率向量形式包含相对比例的多元数据。特别是，对这种相对比例之间的相互作用建模有着相当大的兴趣。为此，我们提出了一类指数族模型，它在概率单纯形的支持下，适应了一般的成对交互模式。特例包括Dirichlet分布族以及Aitchison加性logistic正态分布。通常，我们考虑的分布具有难以计算标准化常数的密度。为了避免这个问题，我们设计了基于广义分数匹配的有效估计方法。对我们的估计方法的高维分析表明，单纯形域的处理效率与之前研究的全维域一样高。摘要：Applications such as the analysis of microbiome data have led to renewed interest in statistical methods for compositional data, i.e., multivariate data in the form of probability vectors that contain relative proportions. In particular, there is considerable interest in modeling interactions among such relative proportions. To this end we propose a class of exponential family models that accommodate general patterns of pairwise interaction while being supported on the probability simplex. Special cases include the family of Dirichlet distributions as well as Aitchison's additive logistic normal distributions. Generally, the distributions we consider have a density that features a difficult to compute normalizing constant. To circumvent this issue, we design effective estimation methods based on generalized versions of score matching. A high-dimensional analysis of our estimation methods shows that the simplex domain is handled as efficiently as previously studied full-dimensional domains.

【9】 Principal component analysis for high-dimensional compositional data 标题：高维成分数据的主成分分析链接：https://arxiv.org/abs/2109.04657

作者：Jingru Zhang,Wei Lin 机构：University of Pennsylvania, edu†Peking University 摘要：高维成分数据的降维在许多领域发挥着重要作用，其中基协方差矩阵的主成分分析具有科学意义。然而，在实践中，基本变量是潜在的，很少被观察到，并且由于单纯形约束，主成分分析的标准技术不适用于成分数据。为了解决这一具有挑战性的问题，我们将中心对数比成分协方差的主子空间与基协方差的主子空间联系起来，并证明了在某些子空间稀疏性假设下，基协方差的主子空间与发散维数近似可识别。维数现象的有趣好处使我们能够利用以样本为中心的对数比率协方差提出主要的子空间估计方法。我们还推导了子空间估计量的非渐近误差界，这显示了识别和估计之间的折衷。此外，我们发展了有效的近似交替方向乘子算法来解决非凸和非光滑优化问题。仿真结果表明，所提出的方法与已知基的oracle方法性能相当。通过对统计学家用词模式的分析，说明了它们的有用性。摘要：Dimension reduction for high-dimensional compositional data plays an important role in many fields, where the principal component analysis of the basis covariance matrix is of scientific interest. In practice, however, the basis variables are latent and rarely observed, and standard techniques of principal component analysis are inadequate for compositional data because of the simplex constraint. To address the challenging problem, we relate the principal subspace of the centered log-ratio compositional covariance to that of the basis covariance, and prove that the latter is approximately identifiable with the diverging dimensionality under some subspace sparsity assumption. The interesting blessing-of-dimensionality phenomenon enables us to propose the principal subspace estimation methods by using the sample centered log-ratio covariance. We also derive nonasymptotic error bounds for the subspace estimators, which exhibits a tradeoff between identification and estimation. Moreover, we develop efficient proximal alternating direction method of multipliers algorithms to solve the nonconvex and nonsmooth optimization problems. Simulation results demonstrate that the proposed methods perform as well as the oracle methods with known basis. Their usefulness is illustrated through an analysis of word usage pattern for statisticians.

【10】 Differential Privacy in Personalized Pricing with Nonparametric Demand Models 标题：非参数需求模型下个性化定价中的差分隐私链接：https://arxiv.org/abs/2109.04615

作者：Xi Chen,Sentao Miao,Yining Wang 机构：Leonard N. Stern School of Business, New York University, New York, NY , USA, McGill University, Montreal, QC H,A ,G, Canada, Warrington College of Business, University of Florida, Gainesville, FL , USA 摘要：近几十年来，信息技术的进步和丰富的个人数据为算法个性化定价的应用提供了便利。然而，这导致人们越来越担心对抗性攻击可能会侵犯隐私。为了解决隐私问题，本文研究了数据隐私保护下的非参数需求模型的动态个性化定价问题。介绍了两个在实践中得到广泛应用的数据隐私概念：\textit{central differential privacy（CDP）}和\textit{local differential privacy（LDP）}，这两个概念在许多情况下被证明比CDP更强。我们开发了两种算法，分别在满足CDP和LDP保证的情况下进行定价决策和动态学习未知需求。特别地，对于具有CDP保证的算法，证明了遗憾最多为$\tilde O（T^{（d+2）/（d+4）}+\varepsilon^{-1}T^{d/（d+4）}$。这里，参数$T$表示时间范围的长度，$d$表示个性化信息向量的维度，关键参数$\varepsilon>0$测量隐私的强度（较小的$\varepsilon$表示更强的隐私保护）。另一方面，对于具有LDP保证的算法，它的遗憾被证明最多为$\tilde O（\varepsilon^{-2/（d+2）}T^{（d+1）/（d+2）}），这是接近最优的，因为我们证明了对于任何具有LDP保证的算法$\Omega（\varepsilon^{-2/（d+2）}T^{（d+1）/（d+2）}）的下界。摘要：In the recent decades, the advance of information technology and abundant personal data facilitate the application of algorithmic personalized pricing. However, this leads to the growing concern of potential violation of privacy due to adversarial attack. To address the privacy issue, this paper studies a dynamic personalized pricing problem with \textit{unknown} nonparametric demand models under data privacy protection. Two concepts of data privacy, which have been widely applied in practices, are introduced: \textit{central differential privacy (CDP)} and \textit{local differential privacy (LDP)}, which is proved to be stronger than CDP in many cases. We develop two algorithms which make pricing decisions and learn the unknown demand on the fly, while satisfying the CDP and LDP gurantees respectively. In particular, for the algorithm with CDP guarantee, the regret is proved to be at most $\tilde O(T^{(d+2)/(d+4)}+\varepsilon^{-1}T^{d/(d+4)})$. Here, the parameter $T$ denotes the length of the time horizon, $d$ is the dimension of the personalized information vector, and the key parameter $\varepsilon>0$ measures the strength of privacy (smaller $\varepsilon$ indicates a stronger privacy protection). On the other hand, for the algorithm with LDP guarantee, its regret is proved to be at most $\tilde O(\varepsilon^{-2/(d+2)}T^{(d+1)/(d+2)})$, which is near-optimal as we prove a lower bound of $\Omega(\varepsilon^{-2/(d+2)}T^{(d+1)/(d+2)})$ for any algorithm with LDP guarantee.

【11】 Supervising the Decoder of Variational Autoencoders to Improve Scientific Utility 标题：监督可变自动编码器解码器提高科学实用性链接：https://arxiv.org/abs/2109.04561

作者：Liyun Tu,Austin Talbot,Neil Gallagher,David Carlson 机构： Gallagher is with the Department of Neurobiology 摘要：概率生成模型对于科学建模具有吸引力，因为它们的推断参数可用于生成假设和设计实验。这就要求学习模型提供输入数据的准确表示，并产生一个有效预测与科学问题相关结果的潜在空间。监督变分自动编码器（SVAE）以前曾用于此目的，其中精心设计的解码器可以用作可解释的生成模型，同时监督目标确保预测潜在表示。不幸的是，监督目标迫使编码器学习生成后验分布的有偏近似，这使得生成参数在科学模型中使用时不可靠。由于通常用于评估模型性能的重建损失未检测到编码器中的偏差，因此该问题尚未被检测到。我们通过开发一个二阶监督框架（SOS-VAE）来解决这个以前未报告的问题，该框架影响解码器以诱导预测潜在表示。这确保了相关编码器保持可靠的生成解释。我们扩展了这项技术，允许用户权衡生成参数中的一些偏差，以提高预测性能，充当SVAE和我们新的SOS-VAE之间的中间选项。我们还使用这种方法来解决合并多个科学实验记录时经常出现的数据缺失问题。我们使用合成数据和电生理记录证明了这些发展的有效性，重点是我们学习的表征如何用于设计科学实验。摘要：Probabilistic generative models are attractive for scientific modeling because their inferred parameters can be used to generate hypotheses and design experiments. This requires that the learned model provide an accurate representation of the input data and yield a latent space that effectively predicts outcomes relevant to the scientific question. Supervised Variational Autoencoders (SVAEs) have previously been used for this purpose, where a carefully designed decoder can be used as an interpretable generative model while the supervised objective ensures a predictive latent representation. Unfortunately, the supervised objective forces the encoder to learn a biased approximation to the generative posterior distribution, which renders the generative parameters unreliable when used in scientific models. This issue has remained undetected as reconstruction losses commonly used to evaluate model performance do not detect bias in the encoder. We address this previously-unreported issue by developing a second order supervision framework (SOS-VAE) that influences the decoder to induce a predictive latent representation. This ensures that the associated encoder maintains a reliable generative interpretation. We extend this technique to allow the user to trade-off some bias in the generative parameters for improved predictive performance, acting as an intermediate option between SVAEs and our new SOS-VAE. We also use this methodology to address missing data issues that often arise when combining recordings from multiple scientific experiments. We demonstrate the effectiveness of these developments using synthetic data and electrophysiological recordings with an emphasis on how our learned representations can be used to design scientific experiments.

【12】 Ergodic Limits, Relaxations, and Geometric Properties of Random Walk Node Embeddings 标题：随机游动节点嵌入的遍历极限、松弛和几何性质链接：https://arxiv.org/abs/2109.04526

作者：Christy Lin,Daniel Sussman,Prakash Ishwar 机构： Boston University 摘要：基于随机游动的节点嵌入算法通过优化节点嵌入向量的目标函数来学习节点的向量表示，并跳过由网络上的随机游动计算的二元统计。它们已被应用于许多有监督学习问题，如链路预测和节点分类，并展示了最先进的性能。然而，人们对它们的性质知之甚少。本文研究了在发现网络中隐藏块结构的无监督环境下，基于随机游走的节点嵌入的性质，即学习节点表示，其在欧氏空间中的簇结构反映了其在网络中的邻接结构。我们刻画了嵌入目标的遍历极限、它的推广和相关的凸松弛，从而得到相应的非随机版本的节点嵌入目标。我们还刻画了两社区随机块模型（SBM）的期望图的非随机目标的最优节点嵌入文法。我们证明了对于非随机化目标的适当核范数松弛，解文法师的秩为$1$。在SBM随机网络上的综合实验结果表明，我们的非随机遍历目标产生的节点嵌入分布类似高斯分布，集中在每个社区内预期网络的节点嵌入，并随着节点数量的增加集中在线性度标度区域。摘要：Random walk based node embedding algorithms learn vector representations of nodes by optimizing an objective function of node embedding vectors and skip-bigram statistics computed from random walks on the network. They have been applied to many supervised learning problems such as link prediction and node classification and have demonstrated state-of-the-art performance. Yet, their properties remain poorly understood. This paper studies properties of random walk based node embeddings in the unsupervised setting of discovering hidden block structure in the network, i.e., learning node representations whose cluster structure in Euclidean space reflects their adjacency structure within the network. We characterize the ergodic limits of the embedding objective, its generalization, and related convex relaxations to derive corresponding non-randomized versions of the node embedding objectives. We also characterize the optimal node embedding Grammians of the non-randomized objectives for the expected graph of a two-community Stochastic Block Model (SBM). We prove that the solution Grammian has rank $1$ for a suitable nuclear norm relaxation of the non-randomized objective. Comprehensive experimental results on SBM random networks reveal that our non-randomized ergodic objectives yield node embeddings whose distribution is Gaussian-like, centered at the node embeddings of the expected network within each community, and concentrate in the linear degree-scaling regime as the number of nodes increases.

【13】 A Proximal Distance Algorithm for Likelihood-Based Sparse Covariance Estimation 标题：一种基于似然稀疏协方差估计的最近距离算法链接：https://arxiv.org/abs/2109.04497

作者：Jason Xu,Kenneth Lange 机构：Department of Statistical Science, Duke University, Department of Computational Medicine, Human Genetics, and Statistics, University of California, Los Angeles 备注：29 pages; 5 figures; 4 tables 摘要：本文讨论在无模式稀疏假设下估计协方差矩阵的任务。与现有的基于阈值或收缩惩罚的方法相比，我们提出了一种基于似然的方法，将协方差估计到对称稀疏集的距离正则化。该公式避免了由更常见的范数惩罚引起的不必要的收缩，并通过求解一系列光滑的、无约束的子问题来优化生成的非凸目标。这些子问题通过优化-最小化原则的近距离版本生成和解决。由此产生的算法执行迅速，优雅地处理参数数量超过案例数量的设置，产生正定解，并具有理想的收敛特性。从经验上看，我们通过一系列模拟实验证明，我们的方法在几个指标上优于竞争方法。它的优点在国际移民数据集和流式细胞术的经典案例研究中得到了说明。我们的研究结果表明，细胞信号数据的边缘依赖网络和条件依赖网络比以前得出的结论更为相似。摘要：This paper addresses the task of estimating a covariance matrix under a patternless sparsity assumption. In contrast to existing approaches based on thresholding or shrinkage penalties, we propose a likelihood-based method that regularizes the distance from the covariance estimate to a symmetric sparsity set. This formulation avoids unwanted shrinkage induced by more common norm penalties and enables optimization of the resulting non-convex objective by solving a sequence of smooth, unconstrained subproblems. These subproblems are generated and solved via the proximal distance version of the majorization-minimization principle. The resulting algorithm executes rapidly, gracefully handles settings where the number of parameters exceeds the number of cases, yields a positive definite solution, and enjoys desirable convergence properties. Empirically, we demonstrate that our approach outperforms competing methods by several metrics across a suite of simulated experiments. Its merits are illustrated on an international migration dataset and a classic case study on flow cytometry. Our findings suggest that the marginal and conditional dependency networks for the cell signalling data are more similar than previously concluded.

【14】 Fairness without the sensitive attribute via Causal Variational Autoencoder 标题：基于因果变分自动编码器的无敏感属性公平性链接：https://arxiv.org/abs/2109.04999

作者：Vincent Grari,Sylvain Lamprier,Marcin Detyniecki 机构： Sorbonne Universit´e, LIP,CNRS, Paris, France, AXA, Paris, France, Polish Academy of Science, IBS PAN, Warsaw, Poland 备注：8 pages, 9 figures 摘要：近年来，机器学习模型中的大多数公平策略都集中在通过假设观察到敏感信息来减少不必要的偏差。然而，这在实践中并不总是可能的。由于隐私目的和欧盟的RGPD等各种法规，许多个人敏感属性经常未被收集。我们注意到，在这种困难的环境中，缺乏减少偏见的方法，特别是在实现人口均等和均等赔率等经典公平目标方面。通过利用最近的发展进行近似推断，我们提出了一种填补这一空白的方法。基于因果图，我们依赖一个新的基于变分自动编码的框架SRCVAE来推断敏感信息代理，该代理在对抗性公平方法中用于减少偏差。我们以经验证明，该领域的现有工作有显著改进。我们观察到，生成的代理的潜在空间恢复了敏感信息，并且我们的方法在两个真实数据集上获得相同的公平性水平的同时，获得了更高的精度，这与使用com mon公平性定义测量的结果相同。摘要：In recent years, most fairness strategies in machine learning models focus on mitigating unwanted biases by assuming that the sensitive information is observed. However this is not always possible in practice. Due to privacy purposes and var-ious regulations such as RGPD in EU, many personal sensitive attributes are frequently not collected. We notice a lack of approaches for mitigating bias in such difficult settings, in particular for achieving classical fairness objectives such as Demographic Parity and Equalized Odds. By leveraging recent developments for approximate inference, we propose an approach to fill this gap. Based on a causal graph, we rely on a new variational auto-encoding based framework named SRCVAE to infer a sensitive information proxy, that serve for bias mitigation in an adversarial fairness approach. We empirically demonstrate significant improvements over existing works in the field. We observe that the generated proxy's latent space recovers sensitive information and that our approach achieves a higher accuracy while obtaining the same level of fairness on two real datasets, as measured using com-mon fairness definitions.

【15】 A Neural Tangent Kernel Perspective of Infinite Tree Ensembles 标题：无限树系综的神经切核透视链接：https://arxiv.org/abs/2109.04983

作者：Ryuichi Kanoh,Mahito Sugiyama 机构：National Institute of Informatics, The Graduate University for Advanced Studies, SOKENDAI 摘要：在实际应用中，集成树模型是与神经网络一起最流行的模型之一。软树是决策树的一种变体。不使用贪婪方法搜索分割规则，而是使用梯度方法训练软树，其中整个分割操作以可微形式表示。尽管近年来这种软树的集合被越来越多地使用，但对于理解它们的行为几乎没有做过什么理论工作。在本文中，通过考虑无限软树的集合，我们引入并研究了树神经切线核（TNTK），它为无限软树集合的行为提供了新的见解。利用TNTK，我们成功地从理论上找到了一些非平凡的性质，如不经意树结构的影响和树加深引起的TNTK简并。此外，我们还使用TNTK实证检验了无限软树集合的性能。摘要：In practical situations, the ensemble tree model is one of the most popular models along with neural networks. A soft tree is one of the variants of a decision tree. Instead of using a greedy method for searching splitting rules, the soft tree is trained using a gradient method in which the whole splitting operation is formulated in a differentiable form. Although ensembles of such soft trees have been increasingly used in recent years, little theoretical work has been done for understanding their behavior. In this paper, by considering an ensemble of infinite soft trees, we introduce and study the Tree Neural Tangent Kernel (TNTK), which provides new insights into the behavior of the infinite ensemble of soft trees. Using the TNTK, we succeed in theoretically finding several non-trivial properties, such as the effect of the oblivious tree structure and the degeneracy of the TNTK induced by the deepening of the trees. Moreover, we empirically examine the performance of an ensemble of infinite soft trees using the TNTK.

【16】 Inverse design of 3d molecular structures with conditional generative neural networks 标题：基于条件生成神经网络的三维分子结构逆向设计链接：https://arxiv.org/abs/2109.04824

作者：Niklas W. A. Gebauer,Michael Gastegger,Stefaan S. P. Hessmann,Klaus-Robert Müller,Kristof T. Schütt 机构：Hessmann, Klaus-Robert M¨uller, and Kristof T. Sch¨utt, Machine Learning Group, Technische Universit¨at Berlin, Berlin, Germany, Berlin Institute for the Foundations of Learning and Data, Berlin, Germany, BASLEARN – TU BerlinBASF Joint Lab for Machine Learning 摘要：合理设计具有所需性质的分子是化学中一个长期存在的挑战。生成性神经网络已成为从已知分布中取样新分子的有力方法。在这里，我们提出了一种用于具有特定结构和化学性质的三维分子结构的条件生成神经网络。这种方法对化学键是不可知的，可以从条件分布中有针对性地取样新分子，即使在参考计算很少的领域也是如此。我们通过生成具有特定组成或基序的分子，发现特别稳定的分子，并联合针对训练区域以外的多种电子特性，证明了我们的逆设计方法的实用性。摘要：The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified structural and chemical properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified composition or motifs, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.

【17】 Projected State-action Balancing Weights for Offline Reinforcement Learning 标题：离线强化学习的投影状态-动作平衡权重链接：https://arxiv.org/abs/2109.04640

作者：Jiayi Wang,Zhengling Qi,Raymond K. W. Wong 机构：Texas A&M University, George Washington University 摘要：离线策略评估（OPE）被认为是强化学习（RL）中一个基本且具有挑战性的问题。本文主要研究在无限水平马尔可夫决策过程框架下，基于从可能不同的策略生成的预收集数据对目标策略的价值估计。基于最近发展起来的RL中的边际重要性抽样方法和因果推理中的协变量平衡思想，我们提出了一种新的估计方法，该方法具有近似投影的状态-动作平衡权，用于策略值估计。我们得到了这些权值的收敛速度，并证明了在技术条件下所提出的值估计是半参数有效的。在渐近性方面，我们的结果以轨迹的数量和每条轨迹上的决策点的数量来衡量。因此，当决策点的数量不同时，仍然可以通过有限数量的受试者实现一致性。此外，我们首次尝试描述OPE问题的难度，这可能是独立的兴趣。数值实验证明了我们提出的估计器的良好性能。摘要：Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we make a first attempt towards characterizing the difficulty of OPE problems, which may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.

【18】 A Fast PC Algorithm with Reversed-order Pruning and A Parallelization Strategy 标题：一种反序剪枝的快速PC算法及其并行化策略链接：https://arxiv.org/abs/2109.04626

作者：Kai Zhang,Chao Tian,Kun Zhang,Todd Johnson,Xiaoqian Jiang 机构： Texas A&M University 备注：37 pages 摘要：PC算法是观测数据因果结构发现的最新算法。在最坏的情况下，由于条件独立性测试是以穷举搜索方式执行的，因此计算成本可能会很高。这使得当任务包含数百或数千个节点时，特别是当真正的底层因果图是稠密的时，算法在计算上很难处理。我们提出了一个关键的观察结果，即呈现两个独立节点的条件集是非唯一的，并且包含某些冗余节点不会牺牲结果的准确性。基于这一发现，我们工作的创新有两个方面。首先，我们创新了一种保留顺序链接剪枝PC算法，该算法显著提高了算法的效率。其次，我们提出了一种利用张量计算进行统计独立性测试的并行计算策略，从而进一步提高了计算速度。我们还证明了在温和的图和数据维数假设下，该算法不会导致统计功率损失。实验结果表明，在稠密的95节点图上，与PC算法相比，该算法的单线程版本可以实现6倍的加速，而并行版本可以实现825倍的加速。我们还证明了该算法在与传统PC算法相同的条件下是一致的。摘要：The PC algorithm is the state-of-the-art algorithm for causal structure discovery on observational data. It can be computationally expensive in the worst case due to the conditional independence tests are performed in an exhaustive-searching manner. This makes the algorithm computationally intractable when the task contains several hundred or thousand nodes, particularly when the true underlying causal graph is dense. We propose a critical observation that the conditional set rendering two nodes independent is non-unique, and including certain redundant nodes do not sacrifice result accuracy. Based on this finding, the innovations of our work are two-folds. First, we innovate on a reserve order linkage pruning PC algorithm which significantly increases the algorithm's efficiency. Second, we propose a parallel computing strategy for statistical independence tests by leveraging tensor computation, which brings further speedup. We also prove the proposed algorithm does not induce statistical power loss under mild graph and data dimensionality assumptions. Experimental results show that the single-threaded version of the proposed algorithm can achieve a 6-fold speedup compared to the PC algorithm on a dense 95-node graph, and the parallel version can make a 825-fold speed-up. We also provide proof that the proposed algorithm is consistent under the same set of conditions with conventional PC algorithm.

【19】 ReLU Regression with Massart Noise 标题：具有Massart噪声的RELU回归链接：https://arxiv.org/abs/2109.04623

作者：Ilias Diakonikolas,Jongho Park,Christos Tzamos 机构：University of Wisconsin-Madison 摘要：我们研究ReLU回归的基本问题，其目标是将校正线性单位（ReLU）拟合到数据中。这种有监督的学习任务在可实现的环境中是有效的，但已知在对抗性标签噪声下计算困难。在这项工作中，我们重点研究了马萨特噪声模型中的ReLU回归，这是一种自然的、经过充分研究的半随机噪声模型。在该模型中，每个点的标签都是根据类中的一个函数生成的，但允许对手以某种概率任意更改该值，即{\em至多}$\eta<1/2$。我们开发了一种有效的算法，在对基础分布进行轻度反集中假设的情况下，在该模型中实现精确的参数恢复。这些假设对于理论上可能的准确恢复是必要的。我们证明，在合成数据和真实数据上，我们的算法明显优于$\ellu 1$和$\ellu 2$回归的简单应用程序。摘要：We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $\eta < 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.

【20】 Bootstrapped Meta-Learning 标题：自助式元学习链接：https://arxiv.org/abs/2109.04504

作者：Sebastian Flennerhag,Yannick Schroecker,Tom Zahavy,Hado van Hasselt,David Silver,Satinder Singh 机构：DeepMind 备注：31 pages, 19 figures, 7 tables 摘要：元学习使人工智能能够通过学习如何学习来提高其效率。释放这种潜力需要克服一个具有挑战性的元优化问题，该问题通常表现出病态和短视的元目标。我们提出了一种算法，通过让元学习者自学来解决这些问题。该算法首先从元学习器中引导一个目标，然后通过在选择的（伪）度量下最小化到该目标的距离来优化元学习器。以梯度元学习为重点，我们建立了保证性能改进的条件，并表明改进与目标距离有关。因此，通过控制曲率，可以使用距离度量来简化元优化，例如通过减少病态。此外，自举机制可以扩展有效的元学习范围，而无需通过所有更新进行反向传播。该算法通用性强，易于实现。我们在Atari ALE基准上实现了无模型代理的最新水平，在Few-Shot学习中改进了MAML，并展示了我们的方法如何通过在Q-learning代理中进行元学习和高效探索来打开新的可能性。摘要：Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem that often exhibits ill-conditioning, and myopic meta-objectives. We propose an algorithm that tackles these issues by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric. Focusing on meta-learning with gradients, we establish conditions that guarantee performance improvements and show that the improvement is related to the target distance. Thus, by controlling curvature, the distance measure can be used to ease meta-optimization, for instance by reducing ill-conditioning. Further, the bootstrapping mechanism can extend the effective meta-learning horizon without requiring backpropagation through all updates. The algorithm is versatile and easy to implement. We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark, improve upon MAML in few-shot learning, and demonstrate how our approach opens up new possibilities by meta-learning efficient exploration in a Q-learning agent.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-09-13，如有侵权请联系 cloudcommunity@tencent.com 删除

linux