统计学学术速递[12.23]

公众号-arXiv每日学术速递

发布于 2021-12-27 17:02:32

4530

发布于 2021-12-27 17:02:32

stat统计学，共计26篇

【1】 Predicting treatment effects from observational studies using machine learning methods: A simulation study 标题：用机器学习方法预测观察性研究的治疗效果：一项模拟研究链接：https://arxiv.org/abs/2112.12083

作者：Bevan I. Smith,Charles Chimedza 机构：School of Statistics and Actuarial Science, University of the Witwatersrand, Johannesburg 摘要：在观察性研究中测量治疗效果具有挑战性，因为存在混淆偏差。当一个变量同时影响治疗和结果时，就会发生混淆。传统方法，如倾向评分匹配，通过对混杂因素的条件作用来评估治疗效果。最近的文献提出了新的方法，使用机器学习来预测观察研究中的反事实，从而可以估计治疗效果。然而，这些研究已经应用于真实世界的数据，而真实的治疗效果尚不清楚。本研究旨在通过模拟两种主要情景来研究这种反事实预测方法的有效性：有无混杂。每种类型还包括输入和输出数据之间的线性和非线性关系。模拟的关键是我们产生了已知的真正因果效应。线性回归、套索回归和随机森林模型用于预测反事实和治疗效果。将其与真实的治疗效果以及天真的治疗效果进行比较。结果表明，影响这种机器学习方法性能的最重要因素是数据的非线性程度。令人惊讶的是，对于非混杂和混杂，机器学习模型在线性数据集上都表现良好。然而，当引入非线性时，模型的性能非常差。因此，在本模拟研究的条件下，机器学习方法在线性条件下表现良好，即使存在混淆，但在这一阶段，当引入非线性时，不应信任机器学习方法。摘要：Measuring treatment effects in observational studies is challenging because of confounding bias. Confounding occurs when a variable affects both the treatment and the outcome. Traditional methods such as propensity score matching estimate treatment effects by conditioning on the confounders. Recent literature has presented new methods that use machine learning to predict the counterfactuals in observational studies which then allow for estimating treatment effects. These studies however, have been applied to real world data where the true treatment effects have not been known. This study aimed to study the effectiveness of this counterfactual prediction method by simulating two main scenarios: with and without confounding. Each type also included linear and non-linear relationships between input and output data. The key item in the simulations was that we generated known true causal effects. Linear regression, lasso regression and random forest models were used to predict the counterfactuals and treatment effects. These were compared these with the true treatment effect as well as a naive treatment effect. The results show that the most important factor in whether this machine learning method performs well, is the degree of non-linearity in the data. Surprisingly, for both non-confounding \textit{and} confounding, the machine learning models all performed well on the linear dataset. However, when non-linearity was introduced, the models performed very poorly. Therefore under the conditions of this simulation study, the machine learning method performs well under conditions of linearity, even if confounding is present, but at this stage should not be trusted when non-linearity is introduced.

【2】 Tutorial on principal component analysis, with applications in R 标题：主成分分析教程及其在R中的应用链接：https://arxiv.org/abs/2112.12067

作者：Henk van Elst 机构：parcIT GmbH, Erftstraße , Köln, Germany 备注：37 pages, 6 *.png figures, 10 *.pdf figures, LaTeX2e, hyperlinked references 摘要：本教程回顾了多元数据集主成分分析的主要步骤，以及基于确定的主要主成分的后续降维。基本计算通过在统计软件包R中编写的脚本进行演示和执行。摘要：This tutorial reviews the main steps of the principal component analysis of a multivariate data set and its subsequent dimensional reduction on the grounds of identified dominant principal components. The underlying computations are demonstrated and performed by means of a script written in the statistical software package R.

【3】 Driver Behavior Post Cannabis Consumption: A Driving Simulator Study in Collaboration with Montgomery County Maryland 标题：吸食大麻后的驾驶员行为：与马里兰州蒙哥马利县合作的驾驶模拟器研究链接：https://arxiv.org/abs/2112.12026

作者：Snehanshu Banerjee,Nashid K Khadem,Md Muhib Kabir,Mansoureh Jeihani 机构：Omni Strategy LLC, Baltimore, Maryland, United States;, Department of Transportation and Urban Infrastructure Studies, Morgan State, University, Baltimore, Maryland, United States 备注：3 Tables & 8 Figures 摘要：马里兰州蒙哥马利县警察局主办了一个大麻中毒受损驾驶实验室，以评估医用大麻使用者在大麻消费前后的驾驶行为。使用了一个带有眼睛跟踪装置的便携式驾驶模拟器，10名参与者在两个不同的事件中驾驶一个虚拟网络：交通灯从绿色变为黄色，以及突然出现一个乱穿马路的行人。加速失效时间（AFT）模型用于计算两种情况下的减速时间。结果表明，当参与者遇到交通灯的变化时，他们的减速时间较低，即在吸食大麻后刹车更用力，而在吸食大麻前和吸食大麻后，当他们遇到乱穿马路的行人时，减速时间没有统计差异。凝视分析发现，在这两个事件中，吸食大麻前后的眼睛凝视没有显著差异。摘要：Montgomery County Police Department in Maryland hosted a cannabis intoxication impaired driving lab, to evaluate the driving behavior of medical marijuana users, pre and post cannabis consumption. A portable driving simulator with an eye-tracking device was used where ten participants drove a virtual network in two different events: a traffic light changing from green to yellow and the sudden appearance of a jaywalking pedestrian. An accelerated failure time (AFT) model was used to calculate the speed reduction times in both scenarios. The results showed that the participant speed reduction times are lower i.e., they brake harder post cannabis consumption, when they encounter a change in traffic light whereas there is no statistical difference in speed reduction times pre and post cannabis consumption, when they encounter a jaywalking pedestrian. The gaze analysis finds no significant difference in eye gaze pre and post cannabis consumption in both events.

【4】 Refining Independent Component Selection via Local Projection Pursuit 标题：基于局部投影追打的独立分量精选算法链接：https://arxiv.org/abs/2112.11998

作者：Lutz Duembgen,Katrin Gysel,Fabrice Perler 机构：University of Bern 摘要：Tyler等人（2009年，JRSS B）介绍的独立成分选择（ICS）是一个强大的工具，可用于发现多变量数据的潜在有趣预测。在某些情况下，ICS提出的一些投影接近于真正有趣的投影，但微小的偏差可能会导致模糊的视图，而该视图不会显示原本清晰可见的特征（例如聚类）。为了解决这个问题，我们提出了一种投影寻踪（PP）的自动化和本地化版本，参见Huber（1985，Ann.Statist.}。精确地说，我们的局部搜索基于梯度下降，应用于作为投影矩阵函数的估计差分熵。摘要：Independent component selection (ICS), introduced by Tyler et al. (2009, JRSS B), is a powerful tool to find potentially interesting projections of multivariate data. In some cases, some of the projections proposed by ICS come close to really interesting ones, but little deviations can result in a blurred view which does not reveal the feature (e.g. a clustering) which would otherwise be clearly visible. To remedy this problem, we propose an automated and localized version of projection pursuit (PP), cf. Huber (1985, Ann. Statist.}. Precisely, our local search is based on gradient descent applied to estimated differential entropy as a function of the projection matrix.

【5】 Efficient Multifidelity Likelihood-Free Bayesian Inference with Adaptive Computational Resource Allocation 标题：自适应计算资源分配的高效多保真无似然贝叶斯推理链接：https://arxiv.org/abs/2112.11971

作者：Thomas P Prescott,David J Warne,Ruth E Baker 机构：Alan Turing Institute, London NW,DB, United Kingdom, QUT Centre for Data Science, Queensland University of Technology, Brisbane, QLD , Australia, Wolfson Centre for Mathematical Biology, Mathematical Institute, University of, Oxford, Oxford OX,GG, United Kingdom 摘要：无似然贝叶斯推理算法是校准复杂随机模型参数的常用方法，在观测数据的似然性难以确定时需要使用。这些算法在很大程度上依赖于重复的模型模拟。然而，当模拟的计算成本甚至相当昂贵时，无似然算法所带来的巨大负担使得它们在许多实际应用中不可行。引入了多理想方法（最初是在近似贝叶斯计算的背景下），通过使用模拟廉价的近似模型而不是感兴趣的模型所提供的信息，在不损失准确性的情况下减少无似然推理的模拟负担。这项工作的第一个贡献是证明多理想性技术可以应用于一般的无似然贝叶斯推理环境。推导了不同保真度下仿真计算资源优化分配的分析结果，并进行了实际应用。我们提供了一种自适应的多理想无似然推理算法，该算法在不同的可信度下学习模型之间的关系并相应地调整资源分配，并且证明了该算法以接近最优的效率产生后验估计。摘要：Likelihood-free Bayesian inference algorithms are popular methods for calibrating the parameters of complex, stochastic models, required when the likelihood of the observed data is intractable. These algorithms characteristically rely heavily on repeated model simulations. However, whenever the computational cost of simulation is even moderately expensive, the significant burden incurred by likelihood-free algorithms leaves them unviable in many practical applications. The multifidelity approach has been introduced (originally in the context of approximate Bayesian computation) to reduce the simulation burden of likelihood-free inference without loss of accuracy, by using the information provided by simulating computationally cheap, approximate models in place of the model of interest. The first contribution of this work is to demonstrate that multifidelity techniques can be applied in the general likelihood-free Bayesian inference setting. Analytical results on the optimal allocation of computational resources to simulations at different levels of fidelity are derived, and subsequently implemented practically. We provide an adaptive multifidelity likelihood-free inference algorithm that learns the relationships between models at different fidelities and adapts resource allocation accordingly, and demonstrate that this algorithm produces posterior estimates with near-optimal efficiency.

【6】 Conditional Drug Approval with the Harmonic Mean Chi-Squared Test 标题：用调和均数卡方检验进行有条件药物审批链接：https://arxiv.org/abs/2112.11898

作者：Manja Deforth,Charlotte Micheloud,Kit C Roes,Leonhard Held 机构：Department of Biostatistics at the Epidemiology, Biostatistics and Prevention Institute (EBPI), and Center for Reproducible Science (CRS), University of Zurich, Zurich, Switzerland 摘要：在高医疗需求领域的治疗批准可能不遵循两个试验的模式，但可能在有条件批准的情况下授予。在有条件批准的情况下，上市前临床试验的治疗效果证据必须在独立上市后临床试验或更长的随访期内得到证实。有几种方法可以量化两次试验提供的总体证据。我们研究了最近发展的调和均方检验在条件药物批准框架中的适用性。所提出的方法既可用于上市后试验的设计，也可用于分析两项试验提供的综合证据。考虑的其他方法有两次试验规则、费希尔标准和斯托弗方法。为了说明，我们将调和均方检验应用于EMA有条件（最终）市场许可的药物。为了更详细地研究操作特性，进行了模拟研究。最后，我们研究了谐波均方检验在正在进行的上市后试验期间计算功率的适用性。与一些传统方法相比，调和均方检验总是需要上市后临床试验。此外，如果上市前临床试验的p值<0.025，则上市后临床试验的样本量需要比两次试验规则更小。摘要：Approval of treatments in areas of high medical need may not follow the two-trials paradigm, but might be granted under conditional approval. Under conditional approval, the evidence for a treatment effect from a pre-market clinical trial has to be substantiated in an independent post-market clinical trial or a longer follow-up duration. Several ways exist to quantify the overall evidence provided by the two trials. We study the applicability of the recently developed harmonic mean Chi-squared test to the conditional drug approval framework. The proposed approach can be used both for the design of the post-market trial and the analysis of the combined evidence provided by both trials. Other methods considered are the two-trials rule, Fisher's criterion and Stouffer's method. For illustration, we apply the harmonic mean Chi-squared test to a drug which received conditional (and eventually final) market licensing by the EMA. A simulation study is conducted to study the operating characteristics in more detail. We finally investigate the applicability of the harmonic mean Chi-squared test to compute the power at interim of an ongoing post-market trial. In contrast to some of the traditional methods, the harmonic mean Chi-squared test always requires a post-market clinical trial. In addition, if the p-value from the pre-market clinical trial is << 0.025, a smaller sample size for the post-market clinical trial is needed than with the two-trials rule.

【7】 Bag of DAGs: Flexible & Scalable Modeling of Spatiotemporal Dependence 标题：DAG包：时空依赖的柔性可扩展建模链接：https://arxiv.org/abs/2112.11870

作者：Bora Jin,Michele Peruzzi,David B. Dunson 机构：Duke University, Durham, USA. 摘要：我们提出了一种计算效率高的方法来构造点参考地质统计学模型中的一类非平稳时空过程。当前直接对高斯过程（GPs）的协方差函数施加非平稳性的方法经常遇到计算瓶颈，导致研究人员在许多应用中选择不太合适的替代方案。本文的主要贡献是使用多个简单的有向无环图（DAG）开发了一个定义良好的非平稳过程，从而提高了计算效率、灵活性和可解释性。我们通过跨域分区的稀疏DAG（其边缘被解释为空间和时间上的方向相关模式）来诱导非平稳性，而不是作用于协方差函数。我们通过考虑DAG的局部混合来解释这些模式的不确定性，从而得出“DAG袋”方法。我们的动机是对空气污染物进行时空建模，其中DAG中的定向边缘表示主导风向，从而导致污染物中的某些相关协方差；例如，西北风到东南风的边缘。我们建立了贝叶斯层次模型，嵌入了从DAGs方法中得到的非平稳过程，并说明了与现有方法相比，这些方法在推理和性能方面的改进。我们可入肺颗粒物（PM2.5）的分析在韩国和美国的应用。所有分析的代码都可以在Github上公开获得。摘要：We propose a computationally efficient approach to construct a class of nonstationary spatiotemporal processes in point-referenced geostatistical models. Current methods that impose nonstationarity directly on covariance functions of Gaussian processes (GPs) often suffer from computational bottlenecks, causing researchers to choose less appropriate alternatives in many applications. A main contribution of this paper is the development of a well-defined nonstationary process using multiple yet simple directed acyclic graphs (DAGs), which leads to computational efficiency, flexibility, and interpretability. Rather than acting on the covariance functions, we induce nonstationarity via sparse DAGs across domain partitions, whose edges are interpreted as directional correlation patterns in space and time. We account for uncertainty about these patterns by considering local mixtures of DAGs, leading to a ``bag of DAGs'' approach. We are motivated by spatiotemporal modeling of air pollutants in which a directed edge in DAGs represents a prevailing wind direction causing some associated covariance in the pollutants; for example, an edge for northwest to southeast winds. We establish Bayesian hierarchical models embedding the resulting nonstationary process from the bag of DAGs approach and illustrate inferential and performance gains of the methods compared to existing alternatives. We consider a novel application focusing on the analysis of fine particulate matter (PM2.5) in South Korea and the United States. The code for all analyses is publicly available on Github.

【8】 Omnibus goodness-of-fit tests for count distributions 标题：计数分布的综合拟合优度检验链接：https://arxiv.org/abs/2112.11861

作者：Antonio Di Noia,Lucio Barabesi,Marzia Marcheselli,Caterina Pisani,Luca Pratelli 机构：Department of Economics and Statistics, University of Siena, Piazza S. Francesco , Siena, Italy, Naval Academy, Viale Italia , Livorno, Italy 摘要：提出了计数分布的一致性综合拟合优度检验。该检验具有广泛的适用性，因为在零假设下，可以考虑由$k$变量参数索引的、阶数为$2k$的有限矩的任何计数分布。测试统计量基于概率母函数，除了具有相当简单的形式外，它是渐近正态分布的，允许直接实现测试。通过广泛的模拟研究，对测试的有限样本特性进行了研究，其中根据一些常见的替代分布和连续的替代分布评估了经验功率。该检验表明，对于中等样本量，经验显著性水平始终非常接近名义显著性水平，并且与卡方检验相比，经验功效相当令人满意。摘要：A consistent omnibus goodness-of-fit test for count distributions is proposed. The test is of wide applicability since any count distribution indexed by a $k$-variate parameter with finite moment of order $2k$ can be considered under the null hypothesis. The test statistic is based on the probability generating function and, in addition to have a rather simple form, it is asymptotically normally distributed, allowing a straightforward implementation of the test. The finite-sample properties of the test are investigated by means of an extensive simulation study, where the empirical power is evaluated against some common alternative distributions and against contiguous alternatives. The test shows an empirical significance level always very close to the nominal one already for moderate sample sizes and the empirical power is rather satisfactory, also compared to that of the chi-squared test.

【9】 Investigating the 'old boy network' using latent space models 标题：用潜在空间模型研究“老男孩网络” 链接：https://arxiv.org/abs/2112.11857

作者：Ian Hamilton 机构： University of Warwick, AbstractThis paper investigates the nature of institutional ties between a group of English schools, including a large proportion of private schools that might be thought of as contributing tothe ‘old boy network’ 摘要：本文调查了一组英国学校之间的制度联系的性质，包括可能被认为对“老男孩网络”有贡献的大部分私立学校。该分析基于双边确定的学校橄榄球联盟固定装置网络。在确定这些固定装置时，地理邻近性的首要重要性提供了一个空间“基本事实”，据此评估模型的性能。发现潜在位置聚类模型的贝叶斯拟合可提供所检查模型的最佳拟合。这用于演示各种方法，这些方法共同提供对影响网络中社区和边缘形成的因素的一致和细微解释。在费用和寄宿生比例方面，嗜同性的影响是显著的，有证据表明，这是由寄宿生比例最高、收费最高的学校社区所驱动的，这暗示了机构层面上“老男孩网络”的存在和性质。摘要：This paper investigates the nature of institutional ties between a group of English schools, including a large proportion of private schools that might be thought of as contributing to the 'old boy network'. The analysis is based on a network of bilaterally-determined school rugby union fixtures. The primary importance of geographical proximity in the determination of these fixtures supplies a spatial 'ground truth' against which the performance of models is assessed. A Bayesian fitting of the latent position cluster model is found to provide the best fit of the models examined. This is used to demonstrate a variety of methods that together provide a consistent and nuanced interpretation of the factors influencing community and edge formation in the network. The influence of homophily in fees and the proportion of boarders is identified as notable, with evidence that this is driven by a community of schools having the highest proportion of boarders and charging the highest fees, suggestive of the existence and nature of an 'old boy network' at an institutional level.

【10】 COVID-19-Associated Orphanhood and Caregiver Death in the United States 标题：美国与冠状病毒相关的孤儿院和照顾者死亡链接：https://arxiv.org/abs/2112.11777

作者：Susan D. Hillis,Alexandra Blenkinsop,Andrés Villaveces,Francis B. Annor,Leandris Liburd,Greta M. Massetti,Zewditu Demissie,James A. Mercy,Charles A. Nelson III,Lucie Cluver,Seth Flaxman,Lorraine Sherr,Christl A. Donnelly,Oliver Ratmann,H. Juliette T. Unwin 机构：Unwin, PhD, Contributed equally as co-first authors, Contributed equally as co-senior authors, Affiliations: ,Centers for Disease Control and Prevention, Atlanta, Georgia; ,Department of 摘要：背景2019冠状病毒疾病2019冠状病毒疾病发生在成人，而非儿童，注意力集中在减轻成人COVID-19的负担。然而，成人死亡的悲惨后果是，大量儿童可能因新冠病毒相关死亡而失去父母和照顾者。方法：我们量化2019冠状病毒疾病的美国和每个国家的生育率和过剩和COVID-19死亡率数据。我们评估了与新冠病毒-19相关的孤儿的负担和比率，以及监护和同居祖父母的死亡，总体和种族/族裔。我们进一步研究了每个州的新冠肺炎相关孤儿的种族/民族差异。结果：我们发现从2020年4月1日到2021年6月30日，美国有超过140000名儿童经历了父母或祖父母照顾者的死亡。与非西班牙裔白人儿童相比，少数种族和少数族裔儿童遭受此类损失的风险高出1.1至4.5倍。与新冠病毒相关的父母和照料者死亡的最高负担发生在南部边境州的拉美裔儿童、东南部州的黑人儿童以及美洲印第安人/阿拉斯加土著人口的部落地区。结论：我们发现不同种族和族裔人群中父母和照顾者的新冠病毒相关死亡的分布存在显著差异。2019冠状病毒疾病的儿童需要照顾和安全、稳定和培育家庭经济支持，优质的育儿和循证养育支持计划。迫切需要在受影响最严重的国家，针对那些风险最大的儿童，采取基于证据的全面对策。摘要：Background: Most COVID-19 deaths occur among adults, not children, and attention has focused on mitigating COVID-19 burden among adults. However, a tragic consequence of adult deaths is that high numbers of children might lose their parents and caregivers to COVID-19-associated deaths. Methods: We quantified COVID-19-associated caregiver loss and orphanhood in the US and for each state using fertility and excess and COVID-19 mortality data. We assessed burden and rates of COVID-19-associated orphanhood and deaths of custodial and co-residing grandparents, overall and by race/ethnicity. We further examined variations in COVID-19-associated orphanhood by race/ethnicity for each state. Results: We found that from April 1, 2020 through June 30, 2021, over 140,000 children in the US experienced the death of a parent or grandparent caregiver. The risk of such loss was 1.1 to 4.5 times higher among children of racial and ethnic minorities, compared to Non-Hispanic White children. The highest burden of COVID-19-associated death of parents and caregivers occurred in Southern border states for Hispanic children, Southeastern states for Black children, and in states with tribal areas for American Indian/Alaska Native populations. Conclusions: We found substantial disparities in distributions of COVID-19-associated death of parents and caregivers across racial and ethnic groups. Children losing caregivers to COVID-19 need care and safe, stable, and nurturing families with economic support, quality childcare and evidence-based parenting support programs. There is an urgent need to mount an evidence-based comprehensive response focused on those children at greatest risk, in the states most affected.

【11】 Robust learning of data anomalies with analytically-solvable entropic outlier sparsification 标题：基于解析可解熵离群点稀疏的数据异常鲁棒学习链接：https://arxiv.org/abs/2112.11768

作者：Illia Horenko 机构：Universit´a della Svizzera Italiana (USI), Institute of Computing, Via, G. Buffi , TI-, Lugano, Switzerland 备注：9 pages, 1 figure 摘要：熵离群点稀疏化（EOS）是一种稳健的计算策略，用于检测一大类学习方法中的数据异常，包括无监督问题（如检测大部分高斯数据中的非高斯离群点）和有监督的错误标记数据学习。状态方程组致力于导出香农熵正则化下（加权）期望误差最小化问题的解析闭式解。与通常的正则化策略不同，这种策略需要计算量随数据维数按多项式进行缩放，而确定的闭式解被证明会带来额外的迭代量，这些迭代量与统计数据的大小成线性关系，并且与数据维数无关。获得的分析结果还解释了为什么球对称高斯混合（在许多流行的数据分析算法中启发式使用）在使用平方欧几里德距离时，代表了非参数概率分布的最佳选择，结合了预期误差最小性、最大熵/无偏性，和线性成本缩放。EOS的性能与一系列常用工具在合成问题和生物医学的部分错误标记监督分类问题上进行了比较。摘要：Entropic Outlier Sparsification (EOS) is proposed as a robust computational strategy for the detection of data anomalies in a broad class of learning methods, including the unsupervised problems (like detection of non-Gaussian outliers in mostly-Gaussian data) and in the supervised learning with mislabeled data. EOS dwells on the derived analytic closed-form solution of the (weighted) expected error minimization problem subject to the Shannon entropy regularization. In contrast to common regularization strategies requiring computational costs that scale polynomial with the data dimension, identified closed-form solution is proven to impose additional iteration costs that depend linearly on statistics size and are independent of data dimension. Obtained analytic results also explain why the mixtures of spherically-symmetric Gaussians - used heuristically in many popular data analysis algorithms - represent an optimal choice for the non-parametric probability distributions when working with squared Euclidean distances, combining expected error minimality, maximal entropy/unbiasedness, and a linear cost scaling. The performance of EOS is compared to a range of commonly-used tools on synthetic problems and on partially-mislabeled supervised classification problems from biomedicine.

【12】 A Comparison of Bayesian Inference Techniques for Sparse Factor Analysis 标题：稀疏因子分析的贝叶斯推理技术比较链接：https://arxiv.org/abs/2112.11719

作者：Yong See Foo,Heejung Shim 机构：University of Melbourne, Parkville, VIC, Australia, School of Mathematics and Statistics and, Melbourne Integrative Genomics 备注：16 pages, 4 figures 摘要：降维算法旨在发现高维数据中描述底层结构的潜在变量。因子分析和主成分分析等方法的缺点是不能对其推断的潜在变量提供太多的解释性。稀疏因子分析通过对其因子载荷施加稀疏性来解决此问题，允许每个潜在变量仅与特征子集相关，从而提高可解释性。稀疏因子分析已被广泛应用于基因组学、信号处理和经济学等领域。我们比较了两种用于稀疏因子分析的贝叶斯推理技术，即马尔可夫链蒙特卡罗（MCMC）和变分推理（VI）。VI的计算速度比MCMC快，但代价是精度降低。我们推导了MCMC和VI算法，并使用模拟数据和生物数据进行了比较，结果表明，在使用MCMC时，VI的较高计算效率优于较小的精度增益。我们用于稀疏因子分析的MCMC和VI算法的实现可在https://github.com/ysfoo/sparsefactor. 摘要：Dimension reduction algorithms aim to discover latent variables which describe underlying structures in high-dimensional data. Methods such as factor analysis and principal component analysis have the downside of not offering much interpretability of its inferred latent variables. Sparse factor analysis addresses this issue by imposing sparsity on its factor loadings, allowing each latent variable to be related to only a subset of features, thus increasing interpretability. Sparse factor analysis has been used in a wide range of areas including genomics, signal processing, and economics. We compare two Bayesian inference techniques for sparse factor analysis, namely Markov chain Monte Carlo (MCMC), and variational inference (VI). VI is computationally faster than MCMC, at the cost of a loss in accuracy. We derive MCMC and VI algorithms and perform a comparison using both simulated and biological data, demonstrating that the higher computational efficiency of VI is desirable over the small gain in accuracy when using MCMC. Our implementation of MCMC and VI algorithms for sparse factor analysis is available at https://github.com/ysfoo/sparsefactor.

【13】 Partial recovery and weak consistency in the non-uniform hypergraph Stochastic Block Model 标题：非均匀超图随机挡路模型的部分恢复和弱相合性链接：https://arxiv.org/abs/2112.11671

作者：Ioana Dumitriu,Haixiao Wang,Yizhe Zhu 备注：40 pages, 4 figures 摘要：我们考虑稀疏随机超图在非均匀超图随机块模型（HSBM）下的社团检测问题，该模型是具有社团结构和高阶相互作用的随机网络的一般模型。当随机超图具有有界的期望度时，我们提供了一种谱算法，该算法输出一个分区，其中至少有一部分正确分类的顶点$\gamma$，其中$\gamma\in（0.5,1）$取决于模型的信噪比（SNR）。当信噪比随着顶点数趋于无穷大而缓慢增长时，我们的算法实现了弱一致性，这改进了Ghoshdastidar和Dukkipati（2017）中关于非均匀HSBMs的先前结果。我们的谱算法包括三个主要步骤：（1）超边选择：选择一定大小的超边，为诱导子超图提供最大信噪比；（2）谱划分：构造正则化邻接矩阵，得到基于奇异向量的近似划分；（3）校正和合并：合并来自邻接张量的超边信息，以提高错误率保证。我们算法的理论分析依赖于稀疏非均匀随机超图邻接矩阵的集中和正则化，这可能是独立的兴趣。摘要：We consider the community detection problem in sparse random hypergraphs under the non-uniform hypergraph stochastic block model (HSBM), a general model of random networks with community structure and higher-order interactions. When the random hypergraph has bounded expected degrees, we provide a spectral algorithm that outputs a partition with at least a $\gamma$ fraction of the vertices classified correctly, where $\gamma\in (0.5,1)$ depends on the signal-to-noise ratio (SNR) of the model. When the SNR grows slowly as the number of vertices goes to infinity, our algorithm achieves weak consistency, which improves the previous results in Ghoshdastidar and Dukkipati (2017) for non-uniform HSBMs. Our spectral algorithm consists of three major steps: (1) Hyperedge selection: select hyperedges of certain sizes to provide the maximal signal-to-noise ratio for the induced sub-hypergraph; (2) Spectral partition: construct a regularized adjacency matrix and obtain an approximate partition based on singular vectors; (3) Correction and merging: incorporate the hyperedge information from adjacency tensors to upgrade the error rate guarantee. The theoretical analysis of our algorithm relies on the concentration and regularization of the adjacency matrix for sparse non-uniform random hypergraphs, which can be of independent interest.

【14】 Local permutation tests for conditional independence 标题：条件独立性的局部置换检验链接：https://arxiv.org/abs/2112.11666

作者：Ilmun Kim,Matey Neykov,Sivaraman Balakrishnan,Larry Wasserman 机构：†Department of Statistics and Data Science, Carnegie Mellon University, ‡Department of Statistics and Data Science, Yonsei University 摘要：在本文中，我们研究了局部置换测试，以测试给定$Z$的两个随机向量$X$和$Y$之间的条件独立性。局部置换检验通过局部洗牌样本来确定检验统计量的显著性，这些样本共享条件变量$Z$的类似值，并且它形成了无条件独立性检验常用置换方法的自然扩展。尽管局部置换检验简单且有经验支持，但其理论基础仍不清楚。基于这一差距，本文旨在建立局部置换测试的理论基础，特别关注基于组合的统计。我们首先回顾了条件独立性测试的难度，并为任何有效的条件独立性测试的能力提供了一个上界，该上界在$Z$中观察碰撞的概率很小时成立。这一负面结果自然促使我们对空值和替代值下的可能分布施加额外的限制。为此，我们将注意力集中在光滑分布的某些类别上，并确定局部置换方法普遍有效的可证明紧条件，即当应用于任何（基于binning的）检验统计量时，它是有效的。为了补充第一类错误控制的这一结果，我们还表明，在某些情况下，通过局部置换方法校准的基于分块的统计可以实现最小最大最优功率。我们还介绍了一种双分组置换策略，与典型的单分组方法相比，该策略在不太平滑的零分布上产生了一个有效的测试，并且不会损失太多的能量。最后，我们给出了模拟结果来支持我们的理论发现。摘要：In this paper, we investigate local permutation tests for testing conditional independence between two random vectors $X$ and $Y$ given $Z$. The local permutation test determines the significance of a test statistic by locally shuffling samples which share similar values of the conditioning variables $Z$, and it forms a natural extension of the usual permutation approach for unconditional independence testing. Despite its simplicity and empirical support, the theoretical underpinnings of the local permutation test remain unclear. Motivated by this gap, this paper aims to establish theoretical foundations of local permutation tests with a particular focus on binning-based statistics. We start by revisiting the hardness of conditional independence testing and provide an upper bound for the power of any valid conditional independence test, which holds when the probability of observing collisions in $Z$ is small. This negative result naturally motivates us to impose additional restrictions on the possible distributions under the null and alternate. To this end, we focus our attention on certain classes of smooth distributions and identify provably tight conditions under which the local permutation method is universally valid, i.e. it is valid when applied to any (binning-based) test statistic. To complement this result on type I error control, we also show that in some cases, a binning-based statistic calibrated via the local permutation method can achieve minimax optimal power. We also introduce a double-binning permutation strategy, which yields a valid test over less smooth null distributions than the typical single-binning method without compromising much power. Finally, we present simulation results to support our theoretical findings.

【15】 Entropic Herding 标题：熵羊群效应链接：https://arxiv.org/abs/2112.11616

作者：Hiroshi Yamashita,Hideyuki Suzuki,Kazuyuki Aihara 机构：Intelligent Mobility Society Design, Social Cooperation Program, Graduate School of, Information Science and Technology, the University of Tokyo,-,-, Hongo, Bunkyo-ku, Graduate School of Information Science and Technology, Osaka University,- 备注：28 pages, 12 figures 摘要：Herding是一种确定性算法，用于生成可被视为满足输入力矩条件的随机样本的数据点。该算法基于高维动力系统的复杂行为，并受统计推断最大熵原理的启发。在本文中，我们提出了一种羊群算法的扩展，称为熵羊群算法，它生成一系列分布而不是点。熵羊群效应是根据最大熵原理对目标函数进行优化得到的。以提出的熵羊群算法为框架，讨论了羊群效应与最大熵原理之间的密切联系。具体地说，我们将原始的羊群算法解释为熵羊群的一个易于处理的版本，其理想输出分布用数学表示。我们进一步讨论了羊群算法的复杂行为如何有助于优化。我们认为，提出的熵羊群算法扩展了羊群算法在概率建模中的应用。与原始羊群相比，熵羊群可以生成平滑分布，从而使有效的概率密度计算和样本生成成为可能。为了证明这些论点在本研究中的可行性，在合成数据和真实数据上进行了数值实验，包括与其他传统方法的比较。摘要：Herding is a deterministic algorithm used to generate data points that can be regarded as random samples satisfying input moment conditions. The algorithm is based on the complex behavior of a high-dimensional dynamical system and is inspired by the maximum entropy principle of statistical inference. In this paper, we propose an extension of the herding algorithm, called entropic herding, which generates a sequence of distributions instead of points. Entropic herding is derived as the optimization of the target function obtained from the maximum entropy principle. Using the proposed entropic herding algorithm as a framework, we discuss a closer connection between herding and the maximum entropy principle. Specifically, we interpret the original herding algorithm as a tractable version of entropic herding, the ideal output distribution of which is mathematically represented. We further discuss how the complex behavior of the herding algorithm contributes to optimization. We argue that the proposed entropic herding algorithm extends the application of herding to probabilistic modeling. In contrast to original herding, entropic herding can generate a smooth distribution such that both efficient probability density calculation and sample generation become possible. To demonstrate the viability of these arguments in this study, numerical experiments were conducted, including a comparison with other conventional methods, on both synthetic and real data.

【16】 Estimating the Marginal Effect of a Continuous Exposure on an Ordinal Outcome using Data Subject to Covariate-Driven Treatment and Visit Processes 标题：使用协变量驱动的治疗和随访过程中的数据估计持续暴露对序贯结果的边际影响链接：https://arxiv.org/abs/2112.11517

作者：Janie Coulombe,Erica E M Moodie,Robert W Platt 备注：None 摘要：在统计文献中，已经提出了许多方法，以确保在监测时间不规则的情况下，变量对纵向结果的边际影响的有效推断。然而，由于协变量驱动的监测时间和混杂导致的潜在偏差很少被同时考虑，也从未在有顺序结果和连续暴露的环境中考虑过。在这项工作中，我们提出并演示了一种在这种情况下进行因果推断的方法，依靠比例优势模型来研究暴露对结果的影响。通过比例速率模型考虑不规则的观察时间，并使用治疗权重的逆概率的推广来解释连续暴露。在美国的纵向研究Add Health study中，我们通过估计花费在视频或电脑游戏上的时间对自杀企图的边际（因果）影响来激励我们的方法。虽然在Add健康数据中，观察时间是预先指定的，但我们提出的方法甚至适用于更一般的设置，例如在分析来自电子健康记录的数据时，观察非常不规则。在模拟研究中，我们让观察时间因个体而异，并证明不考虑由于监测和暴露方案而产生的偏差不平衡会使暴露的边际优势比估计值产生偏差。摘要：In the statistical literature, a number of methods have been proposed to ensure valid inference about marginal effects of variables on a longitudinal outcome in settings with irregular monitoring times. However, the potential biases due to covariate-driven monitoring times and confounding have rarely been considered simultaneously, and never in a setting with an ordinal outcome and a continuous exposure. In this work, we propose and demonstrate a methodology for causal inference in such a setting, relying on a proportional odds model to study the effect of the exposure on the outcome. Irregular observation times are considered via a proportional rate model, and a generalization of inverse probability of treatment weights is used to account for the continuous exposure. We motivate our methodology by the estimation of the marginal (causal) effect of the time spent on video or computer games on suicide attempts in the Add Health study, a longitudinal study in the United States. Although in the Add Health data, observation times are pre-specified, our proposed approach is applicable even in more general settings such as when analyzing data from electronic health records where observations are highly irregular. In simulation studies, we let observation times vary across individuals and demonstrate that not accounting for biasing imbalances due to the monitoring and the exposure schemes can bias the estimate for the marginal odds ratio of exposure.

【17】 Preserving data privacy when using multi-site data to estimate individualized treatment rules 标题：在使用多点数据估计个性化治疗规则时保护数据隐私链接：https://arxiv.org/abs/2112.11505

作者：Coraline Danieli,Erica Moodie 机构：McGill University, Department of Epidemiology, Biostatistics and Occupational Health, Funding: This work was supported by the Canadian Institutes of Health Research grant CIHR 备注：Main manuscript : 26 pages, 2 tables, 4 figures - Supplementary Material : 24 pages, 5 tables and 3 figures 摘要：精确医学是一个迅速扩展的健康研究领域，其中患者层面的信息用于告知治疗决策。统计框架有助于形成个性化治疗决策的形式化，这些决策是个性化管理计划的特征。已经提出了许多方法来评估个体化治疗规则，以优化预期的患者结果，其中许多方法具有理想的特性，例如对模型错误的鲁棒性。然而，尽管个人数据在这方面至关重要，但可能存在数据保密问题，特别是在数据由外部共享的多中心研究中。为了解决这个问题，我们比较了两种隐私保护方法：（i）数据池（一种协变量微聚集技术）和（ii）分布式回归。这些方法与双重稳健但用户友好的动态加权普通最小二乘法相结合，以估计个体化治疗规则。在仿真中，我们广泛地评估了在不同假设下估计决策规则参数的方法的性能。结果表明，在数据池设置中不保持双重稳健性，这可能导致偏差，而分布式回归提供了良好的性能。我们通过使用国际华法林联盟的数据分析最佳华法林剂量来说明这些方法。摘要：Precision medicine is a rapidly expanding area of health research wherein patient level information is used to inform treatment decisions. A statistical framework helps to formalize the individualization of treatment decisions that characterize personalized management plans. Numerous methods have been proposed to estimate individualized treatment rules that optimize expected patient outcomes, many of which have desirable properties such as robustness to model misspecification. However, while individual data are essential in this context, there may be concerns about data confidentiality, particularly in multi-centre studies where data are shared externally. To address this issue, we compared two approaches to privacy preservation: (i) data pooling, which is a covariate microaggregation technique and (ii) distributed regression. These approaches were combined with the doubly robust yet user-friendly method of dynamic weighted ordinary least squares to estimate individualized treatment rules. In simulations, we extensively evaluated the performance of the methods in estimating the parameters of the decision rule under different assumptions. The results demonstrate that double robustness is not maintained in data pooling setting and that this can result in bias, whereas the distributed regression provides good performance. We illustrate the methods via an analysis of optimal Warfarin dosing using data from the International Warfarin Consortium.

【18】 Faster indicators of dengue fever case counts using Google and Twitter 标题：使用谷歌和Twitter加快登革热病例计数的指标链接：https://arxiv.org/abs/2112.12101

作者：Giovanni Mizzi,Tobias Preis,Leonardo Soares Bastos,Marcelo Ferreira da Costa Gomes,Claudia Torres Codeço,Helen Susannah Moat 机构： Centre for Complexity Science, University of Warwick, Gibbet Hill Road, CV, Data Science Lab, Behavioural Science, Warwick Business School, University, of Warwick, Scarman Road, CV,AL, Coventry, United Kingdom 备注：25 pages, 7 figures (3 in supplementary information) 摘要：登革热是巴西公共卫生的主要威胁，巴西是世界第六大人口大国，仅2019年就记录了150多万例登革热病例。登革热病例统计的官方数据是以递增的方式提供的，由于许多原因，往往会延迟数周。相比之下，与登革热相关的谷歌搜索和推特消息的数据可以毫不延迟地全部提供。在这里，我们描述了一个模型，该模型使用在线数据提供改进的里约热内卢登革热发病率每周估计数。通过明确说明病例计数数据的增量交付，我们解决了以前在线数据疾病监测模型的一个关键缺点，以确保我们的方法可以在实践中使用。我们还同时利用了来自谷歌趋势和推特的数据，并证明这比仅使用其中一个数据流的模型得出的估计稍好一些。我们的结果提供了证据，证明在线数据可用于提高快速估计疾病发病率的准确性和精确度，即使潜在病例计数数据存在长期和各种延迟。摘要：Dengue is a major threat to public health in Brazil, the world's sixth biggest country by population, with over 1.5 million cases recorded in 2019 alone. Official data on dengue case counts is delivered incrementally and, for many reasons, often subject to delays of weeks. In contrast, data on dengue-related Google searches and Twitter messages is available in full with no delay. Here, we describe a model which uses online data to deliver improved weekly estimates of dengue incidence in Rio de Janeiro. We address a key shortcoming of previous online data disease surveillance models by explicitly accounting for the incremental delivery of case count data, to ensure that our approach can be used in practice. We also draw on data from Google Trends and Twitter in tandem, and demonstrate that this leads to slightly better estimates than a model using only one of these data streams alone. Our results provide evidence that online data can be used to improve both the accuracy and precision of rapid estimates of disease incidence, even where the underlying case count data is subject to long and varied delays.

【19】 Small deviation estimates for the largest eigenvalue of Wigner matrices 标题：Wigner矩阵最大特征值的小偏差估计链接：https://arxiv.org/abs/2112.12093

作者：László Erdős,Yuanyuan Xu 机构：IST Austria 摘要：我们建立了实对称复厄米矩阵最大特征值的精确右尾小偏差估计，这些矩阵的条目是具有一致有界矩的独立随机变量。该证明依赖于沿连续插值矩阵流的格林函数比较。在左尾也得到了不太精确的估计。摘要：We establish precise right-tail small deviation estimates for the largest eigenvalue of real symmetric and complex Hermitian matrices whose entries are independent random variables with uniformly bounded moments. The proof relies on a Green function comparison along a continuous interpolating matrix flow for a long time. Less precise estimates are also obtained in the left tail.

【20】 Application of Opinion Dynamics Models in Energy Eco-Feedback Agent-Based Simulation 标题：观点动力学模型在基于Agent的能量生态反馈仿真中的应用链接：https://arxiv.org/abs/2112.12063

作者：Mohammad Zarei,Mojtaba Maghrebi 机构：Department of Civil Engineering, Ferdowsi University of Mashhad, Mashhad, Iran, School of Civil and Environmental Engineering, University of New South Wales, Sydney, Australia 摘要：研究表明，通过行为干预抑制消费者对能源的需求是减少温室气体排放和气候变化努力的重要组成部分。在此基础上，反馈干预使消费者能够看到能源消耗和节能工作，被视为提高节能行为的实用方法。模拟技术提供了一种方便且经济的工具，用于检查作为此类干预结果可能影响节能量的因素。然而，开发一个能够正确表示真实世界过程的合理模型是一个巨大的挑战。在这篇论文中，研究了五个共同观点动态（OD）模型，它们代表了个体之间相互作用的观点变化，并提出了一个修正的OD（ROD）模型来构建更有效的生态反馈模拟模型。研究结果表明，关联意见的影响条件和权重因子对模拟结果的准确性有显著影响，并与现场试验报告进行了比较。因此，建议将ROD用于eco反馈模拟，这表明预测与现场数据最接近。摘要：Research suggests that curbing consumer demand for energy through behavioural interventions is an essential component of efforts to reduce greenhouse gas emissions and climate change. On this ground, feedback interventions, which make the energy consumption and conservation efforts visible to the consumers, are considered as a practical method increasing the energy saving behaviours. Simulation techniques provide a convenient and economical tool to examine the factors that could affect the energy saving amount as the outcome of such interventions. However, developing a reasonable model that could correctly represent real-world process is a big challenge. In this paper, five common Opinion Dynamic (OD) models that represent how opinion change occur among individuals interactions have been investigated and a Revised OD (ROD) model have been proposed to build more efficient eco-feedback simulation models. Findings indicate that influence condition and weight-factor of connected opinions have significant impact on the accuracy of simulation outputs, which have been compared to the field experiment reports. Accordingly, ROD has been suggested for eco-feedback simulations, which shows the closest prediction to the field data.

【21】 A Stochastic Bregman Primal-Dual Splitting Algorithm for Composite Optimization 标题：一种求解复合优化问题的随机Bregman原始-对偶分裂算法链接：https://arxiv.org/abs/2112.11928

作者：Antonio Silveti-Falls,Cesare Molinari,Jalal Fadili 摘要：利用Bregman发散性和相对光滑性假设，研究了求解实自反Banach空间上凹凸鞍点问题的随机一阶原对偶方法，其中考虑了算法中梯度项计算的随机误差。在温和的假设下，我们证明了在拉格朗日最优性缺口的期望下的遍历收敛性，并且遍历序列的每个几乎确定的弱簇点都是期望中的鞍点。在稍微严格的假设下，我们证明了几乎肯定的点态迭代到鞍点的弱收敛性。在目标函数的相对强凸性假设和Bregman散度熵的完全凸性假设下，我们建立了点态迭代到鞍点的几乎确定强收敛性。我们的框架是通用的，在算法中不需要导致Bregman发散的熵的强凸性。数值应用包括熵正则化的Wasserstein重心问题和单纯形上的正则化反问题。摘要：We study a stochastic first order primal-dual method for solving convex-concave saddle point problems over real reflexive Banach spaces using Bregman divergences and relative smoothness assumptions, in which we allow for stochastic error in the computation of gradient terms within the algorithm. We show ergodic convergence in expectation of the Lagrangian optimality gap with a rate of O(1/k) and that every almost sure weak cluster point of the ergodic sequence is a saddle point in expectation under mild assumptions. Under slightly stricter assumptions, we show almost sure weak convergence of the pointwise iterates to a saddle point. Under a relative strong convexity assumption on the objective functions and a total convexity assumption on the entropies of the Bregman divergences, we establish almost sure strong convergence of the pointwise iterates to a saddle point. Our framework is general and does not need strong convexity of the entropies inducing the Bregman divergences in the algorithm. Numerical applications are considered including entropically regularized Wasserstein barycenter problems and regularized inverse problems on the simplex.

【22】 Constraining cosmological parameters from N-body simulations with Bayesian Neural Networks 标题：用贝叶斯神经网络从N体模拟中约束宇宙学参数链接：https://arxiv.org/abs/2112.11865

作者：Hector J. Hortua 机构：Universidad Nacional Abierta y a Distancia UNADCEAD José Acevedo y Gómez - Bogotá- Colombiahector 备注：Published at NeurIPS 2021 workshop: Bayesian Deep Learning 摘要：在本文中，我们使用吉诃德模拟通过贝叶斯神经网络提取宇宙学参数。这种模型具有估计相关不确定性的显著能力，这是精确宇宙学时代的最终目标之一。我们展示了BNN在从模拟中提取更复杂的输出分布和非高斯信息方面的优势。摘要：In this paper, we use The Quijote simulations in order to extract the cosmological parameters through Bayesian Neural Networks. This kind of model has a remarkable ability to estimate the associated uncertainty, which is one of the ultimate goals in the precision cosmology era. We demonstrate the advantages of BNNs for extracting more complex output distributions and non-Gaussianities information from the simulations.

【23】 Dynamics of senses of new physics discourse: co-keywords analysis 标题：新物理话语意义的动态性：共词分析链接：https://arxiv.org/abs/2112.11829

作者：Yurij L. Katchanov,Yulia V. Markova 机构：Institute for Statistical Studies and Economics of Knowledge, National Research, University Higher School of Economics , Myasnitskaya Ulitsa, Moscow , Russian Federation, American Association for the Advancement of Science , New York Ave NW, Washington, DC, USA 备注：36 pages, 10 figures 摘要：本文对新物理关键词共生模式的演变进行了纵向分析。为此，我们研究了1989年至2018年间INSPIRE数据库中索引的文档。我们的目的是量化快速增长的新物理学子领域的知识结构。开发一种新的关键字共现分析方法是本文的重点。与传统的共关键词网络分析不同，我们研究的结构将不同文档中的物理概念结合起来，并将不同文档与相同的物理概念绑定在一起。我们把揭示概念之间关系的结构称为拓扑，称之为“物理意义”。基于轨迹互信息的概念，给出了物理感官的聚类，确定了它们的生命周期，并构造了感官“权威”的分类。摘要：The paper presents a longitudinal analysis of the evolution of new physics keywords co-occurrence patterns. For that, we explore the documents indexed in the INSPIRE database from 1989 to 2018. Our purpose is to quantify the knowledge structure of the fast-growing subfield of new physics. The development of a novel approach to keywords co-occurrence analysis is the main point of the paper. In contrast to traditional co-keyword network analysis, we investigate structures that unite physics concepts in different documents and bind different documents with the same physics concepts. We consider the structures that reveal relationships among concepts as topological and call them "physics senses". Based on the notion of trajectory mutual information, the paper offers clustering of physics senses, determines their period of life, and constructs a classification of senses' "authority".

【24】 Bayesian Approaches to Shrinkage and Sparse Estimation 标题：贝叶斯方法在收缩和稀疏估计中的应用链接：https://arxiv.org/abs/2112.11751

作者：Dimitris Korobilis,Kenichi Shimizu 机构：University of Glasgow 摘要：在人类知识的所有领域，数据集的规模和复杂性都在增加，因此需要更丰富的统计模型。这一趋势也适用于经济数据，其中高维和非线性/非参数推断是应用计量经济学工作若干领域的标准。本文的目的是通过调查现代收缩率和变量选择算法和方法，向读者介绍贝叶斯模型确定的世界。贝叶斯推理是一种用于量化不确定性和学习模型参数的自然概率框架，这一特征对于高维和复杂度增加的现代模型中的推理尤为重要。我们从一个线性回归设置开始，以引入各种类别的先验值，这些先验值导致收缩/稀疏估计值与流行的惩罚似然估计值（例如.\ridge，lasso）具有可比性。我们探讨了精确和近似推理的各种方法，并讨论了它们的优缺点。最后，我们将探讨如何将为简单回归设置开发的先验知识以简单的方式扩展到各种有趣的计量经济学模型。特别是，考虑了以下案例研究，展示了贝叶斯收缩和变量选择策略在流行计量经济学背景下的应用：i）向量自回归模型；ii）因子模型；iii）时变参数回归；iv）治疗效果模型中的混杂因素选择；v）分位数回归模型。MATLAB软件包和附带的技术手册允许读者复制本综述中描述的许多算法。摘要：In all areas of human knowledge, datasets are increasing in both size and complexity, creating the need for richer statistical models. This trend is also true for economic data, where high-dimensional and nonlinear/nonparametric inference is the norm in several fields of applied econometric work. The purpose of this paper is to introduce the reader to the world of Bayesian model determination, by surveying modern shrinkage and variable selection algorithms and methodologies. Bayesian inference is a natural probabilistic framework for quantifying uncertainty and learning about model parameters, and this feature is particularly important for inference in modern models of high dimensions and increased complexity. We begin with a linear regression setting in order to introduce various classes of priors that lead to shrinkage/sparse estimators of comparable value to popular penalized likelihood estimators (e.g.\ ridge, lasso). We explore various methods of exact and approximate inference, and discuss their pros and cons. Finally, we explore how priors developed for the simple regression setting can be extended in a straightforward way to various classes of interesting econometric models. In particular, the following case-studies are considered, that demonstrate application of Bayesian shrinkage and variable selection strategies to popular econometric contexts: i) vector autoregressive models; ii) factor models; iii) time-varying parameter regressions; iv) confounder selection in treatment effects models; and v) quantile regression models. A MATLAB package and an accompanying technical manual allow the reader to replicate many of the algorithms described in this review.

【25】 Identifying Mixtures of Bayesian Network Distributions 标题：贝叶斯网络分布的混合识别链接：https://arxiv.org/abs/2112.11602

作者：Spencer L. Gordon,Bijan Mazaheri,Yuval Rabani,Leonard J. Schulman 摘要：贝叶斯网络是一组$n$随机变量（用顶点标识）上的有向无环图（DAG）；贝叶斯网络分布（BND）是rv上的概率分布，在图上是马尔可夫分布。此类模型的有限混合是BND在较大图上的这些变量上的投影，该图具有额外的“隐藏”（或“潜在”）随机变量$U$，范围为$\{1、\ldots、k\}$，以及从$U$到每个其他顶点的有向边。这种类型的模型是因果推理研究的基础，其中$U$模型是一种混杂效应。理论文献中有一个非常特殊的例子：空图。这样的分布只是$k$产品分布的混合。一个长期存在的问题是，给定$k$乘积分布的混合分布，确定每个乘积分布及其混合权重。我们的结果是：（1）我们改进了从$\exp（O（k^2））$到$\exp（O（k\log k））$识别$k$产品分布混合的样本复杂性（和运行时）。考虑到已知的$\exp（\Omega（k））$下限，这几乎是最好的选择。（2）我们给出了非空图情况下的第一个算法。最大度为$\Delta$的图的复杂度为$\exp（O（k（\Delta^2+\log k）））$。（上述复杂性是近似的，不依赖于次要参数。）摘要：A Bayesian Network is a directed acyclic graph (DAG) on a set of $n$ random variables (identified with the vertices); a Bayesian Network Distribution (BND) is a probability distribution on the rv's that is Markovian on the graph. A finite mixture of such models is the projection on these variables of a BND on the larger graph which has an additional "hidden" (or "latent") random variable $U$, ranging in $\{1,\ldots,k\}$, and a directed edge from $U$ to every other vertex. Models of this type are fundamental to research in Causal Inference, where $U$ models a confounding effect. One extremely special case has been of longstanding interest in the theory literature: the empty graph. Such a distribution is simply a mixture of $k$ product distributions. A longstanding problem has been, given the joint distribution of a mixture of $k$ product distributions, to identify each of the product distributions, and their mixture weights. Our results are: (1) We improve the sample complexity (and runtime) for identifying mixtures of $k$ product distributions from $\exp(O(k^2))$ to $\exp(O(k \log k))$. This is almost best possible in view of a known $\exp(\Omega(k))$ lower bound. (2) We give the first algorithm for the case of non-empty graphs. The complexity for a graph of maximum degree $\Delta$ is $\exp(O(k(\Delta^2 + \log k)))$. (The above complexities are approximate and suppress dependence on secondary parameters.)

【26】 Multiple Imputation via Generative Adversarial Network for High-dimensional Blockwise Missing Value Problems 标题：高维顺时针失值问题的生成对抗性网络多重插补链接：https://arxiv.org/abs/2112.11507

作者：Zongyu Dai,Zhiqi Bu,Qi Long 机构：Department of AMCS, University of Pennsylvania, Philadelphia, USA, Division of Biostatistics 摘要：缺失数据存在于大多数现实问题中，需要仔细处理，以保持下游分析中的预测准确性和统计一致性。作为处理缺失数据的金标准，提出了多重插补（MI）方法来解释插补不确定性并提供适当的统计推断。在这项工作中，我们提出了通过生成性对抗网络（MI-GAN）进行多重插补的方法，这是一种基于深度学习（具体而言，基于GAN）的多重插补方法，在理论支持下可以在随机缺失（MAR）机制下工作。MI-GAN利用了条件生成对抗性神经工程的最新进展，并在插补误差方面表现出与现有高维数据集最先进插补方法相匹配的强大性能。特别是，MI-GAN在统计推断和计算速度方面明显优于其他插补方法。摘要：Missing data are present in most real world problems and need careful handling to preserve the prediction accuracy and statistical consistency in the downstream analysis. As the gold standard of handling missing data, multiple imputation (MI) methods are proposed to account for the imputation uncertainty and provide proper statistical inference. In this work, we propose Multiple Imputation via Generative Adversarial Network (MI-GAN), a deep learning-based (in specific, a GAN-based) multiple imputation method, that can work under missing at random (MAR) mechanism with theoretical support. MI-GAN leverages recent progress in conditional generative adversarial neural works and shows strong performance matching existing state-of-the-art imputation methods on high-dimensional datasets, in terms of imputation error. In particular, MI-GAN significantly outperforms other imputation methods in the sense of statistical inference and computational speed.

机器翻译，仅供参考

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-12-23，如有侵权请联系 cloudcommunity@tencent.com 删除

linux