前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >统计学学术速递[7.28]

统计学学术速递[7.28]

作者头像
公众号-arXiv每日学术速递
发布2021-07-29 14:08:59
7180
发布2021-07-29 14:08:59
举报

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

stat统计学,共计29篇

【1】 A robust spline approach in partially linear additive models 标题:部分线性加性模型的鲁棒样条方法

作者:Graciela Boente,Alejandra Mercedes Martinez 机构: CONICET and Universidad de Buenos Aires, Argentina, CONICET and Universidad Nacional de Luj´an, Argentina 链接:https://arxiv.org/abs/2107.12987 摘要:部分线性可加模型是线性模型的推广,因为它们通过假设某些协变量与响应具有线性关系,而其他协变量则以未知的单变量光滑函数进入,从而对响应变量与协变量之间的关系进行建模。在部分线性模型的情况下,即当模型中只涉及一个非参数分量时,描述了残差或线性分量所涉及的协变量中异常值的有害影响。在处理加性成分时,当出现非典型数据时,提供可靠的估计量的问题对于激发稳健程序的需求具有实际重要性。因此,我们将$B-$样条与稳健线性回归估计相结合,提出了一类适用于部分线性可加模型的稳健估计。在温和的假设下,我们得到了线性分量的一致性结果、收敛速度和渐近正态性。通过montecarlo研究,比较了在不同模型和污染方案下,鲁棒性方案与经典方案的性能。数值实验表明了该方法对有限样本的优越性。我们还通过一个实际的数据集说明了该方法的有效性。 摘要:Partially linear additive models generalize the linear models since they model the relation between a response variable and covariates by assuming that some covariates are supposed to have a linear relation with the response but each of the others enter with unknown univariate smooth functions. The harmful effect of outliers either in the residuals or in the covariates involved in the linear component has been described in the situation of partially linear models, that is, when only one nonparametric component is involved in the model. When dealing with additive components, the problem of providing reliable estimators when atypical data arise, is of practical importance motivating the need of robust procedures. Hence, we propose a family of robust estimators for partially linear additive models by combining $B-$splines with robust linear regression estimators. We obtain consistency results, rates of convergence and asymptotic normality for the linear components, under mild assumptions. A Monte Carlo study is carried out to compare the performance of the robust proposal with its classical counterpart under different models and contamination schemes. The numerical experiments show the advantage of the proposed methodology for finite samples. We also illustrate the usefulness of the proposed approach on a real data set.

【2】 School neighbourhood and compliance with WHO-recommended annual NO2 guideline: a case study of Greater London 标题:学校邻里关系和遵守世卫组织推荐的年度NO2指南:大伦敦地区的案例研究

作者:Niloofar Shoari,Shahram Heydari,Marta Blangiardo 机构:a MRC Centre for Environment & Health, Department of Epidemiology and Biostatistics, Imperial, College London, London, UK, b Department of Civil, Maritime, and Environmental Engineering, University of Southampton, Southampton, UK 链接:https://arxiv.org/abs/2107.12952 摘要:尽管英国出台了多项国家和地方政策来清洁空气,但伦敦的许多学校还是违反了世卫组织建议的空气污染物浓度,如NO2和PM2.5。尽管如此,以前的研究强调了空气污染物对儿童健康的重大不利影响。在本文中,我们采用贝叶斯空间层次模型来调查影响大伦敦(英国)学校超过世界卫生组织建议的NO2浓度(即40微克/立方米年平均值)的几率的因素。我们从家庭、社会经济、交通相关、土地利用、建筑和自然环境特征的角度考虑了一系列变量,包括学校的特征及其邻近地区的属性。结果表明,与交通有关的因素,包括学校附近的红绿灯和公共汽车站的数量,以及区级公共汽车的燃料消耗,是增加不遵守世卫组织指南可能性的决定性因素。相比之下,与道路、河流运输和地下车站的距离、车速(交通拥堵的一个指标)、区级绿地的比例以及学校绿地的面积降低了超过世卫组织建议的NO2浓度的可能性。作为敏感性分析,我们在假设情景下重复了我们的分析,其中建议的NO2浓度为35 ug/m3,而不是40 ug/m3。我们的研究结果强调了在公共汽车上采用清洁燃料技术、安装绿色屏障和减少学校周围机动交通对减少学校附近NO2浓度暴露的重要性。这项研究将有助于地方当局的决策,以改善城市环境中学龄儿童的空气质量。 摘要:Despite several national and local policies towards cleaner air in England, many schools in London breach the WHO-recommended concentrations of air pollutants such as NO2 and PM2.5. This is while, previous studies highlight significant adverse health effects of air pollutants on children's health. In this paper we adopted a Bayesian spatial hierarchical model to investigate factors that affect the odds of schools exceeding the WHO-recommended concentration of NO2 (i.e., 40 ug/m3 annual mean) in Greater London (UK). We considered a host of variables including schools' characteristics as well as their neighbourhoods' attributes from household, socioeconomic, transport-related, land use, built and natural environment characteristics perspectives. The results indicated that transport-related factors including the number of traffic lights and bus stops in the immediate vicinity of schools, and borough-level bus fuel consumption are determinant factors that increase the likelihood of non-compliance with the WHO guideline. In contrast, distance from roads, river transport, and underground stations, vehicle speed (an indicator of traffic congestion), the proportion of borough-level green space, and the area of green space at schools reduce the likelihood of exceeding the WHO recommended concentration of NO2. As a sensitivity analysis, we repeated our analysis under a hypothetical scenario in which the recommended concentration of NO2 is 35 ug/m3, instead of 40 ug/m3. Our results underscore the importance of adopting clean fuel technologies on buses, installing green barriers, and reducing motorised traffic around schools in reducing exposure to NO2 concentrations in proximity to schools. This study would be useful for local authority decision making with the aim of improving air quality for school-aged children in urban settings.

【3】 Subset selection for linear mixed models 标题:线性混合模型的子集选择

作者:Daniel R. Kowal 链接:https://arxiv.org/abs/2107.12890 摘要:线性混合模型(LMMs)是结构相关回归分析的工具,如分组、聚类或多级数据。然而,在解释这种结构性依赖的同时,协变量的选择仍然是一个挑战。介绍了一种基于LMMs的子集选择贝叶斯决策分析方法。利用一个包含结构依赖的马氏损失函数,我们得到了任意协变量子集和任意贝叶斯LMM下的最优线性行为。关键的是,这些行为继承了基本贝叶斯LMM的收缩或正则化和不确定性量化。我们没有选择一个单一的“最佳”子集(其信息内容通常不稳定且有限),而是收集与“最佳”子集的预测能力几乎匹配的可接受子集族。可接受族由其最小成员和关键变量重要性度量总结。提供定制的子集搜索和样本外近似算法,以实现更具可扩展性的计算。这些工具被应用于模拟数据和纵向体力活动数据集,并且在这两种情况下都显示出良好的预测、估计和选择能力。 摘要:Linear mixed models (LMMs) are instrumental for regression analysis with structured dependence, such as grouped, clustered, or multilevel data. However, selection among the covariates--while accounting for this structured dependence--remains a challenge. We introduce a Bayesian decision analysis for subset selection with LMMs. Using a Mahalanobis loss function that incorporates the structured dependence, we derive optimal linear actions for any subset of covariates and under any Bayesian LMM. Crucially, these actions inherit shrinkage or regularization and uncertainty quantification from the underlying Bayesian LMM. Rather than selecting a single "best" subset, which is often unstable and limited in its information content, we collect the acceptable family of subsets that nearly match the predictive ability of the "best" subset. The acceptable family is summarized by its smallest member and key variable importance metrics. Customized subset search and out-of-sample approximation algorithms are provided for more scalable computing. These tools are applied to simulated data and a longitudinal physical activity dataset, and in both cases demonstrate excellent prediction, estimation, and selection ability.

【4】 Longitudinal Latent Overall Toxicity (LOTox) profiles in osteosarcoma: a new taxonomy based on latent Markov models 标题:骨肉瘤的纵向潜在总毒性(LOTox)谱:一种基于潜在马尔科夫模型的新分类法

作者:Marta Spreafico,Francesca Ieva,Marta Fiocco 机构:MOX – Department of Mathematics, Politecnico di Milano, Milan , Italy, Mathematical Institute, Leiden University, Leiden, The Netherlands, CHRP, National Center for Healthcare Research and Pharmacoepidemiology, Milan , Italy 链接:https://arxiv.org/abs/2107.12863 摘要:在癌症试验中,纵向毒性数据的分析是一项困难的任务,因为存在多个具有不同程度毒性负担的不良事件。能够总结和量化癌症治疗过程中总体毒性风险及其演变的模型,涉及毒性水平进展的纵向和分类方面,是必要的,但仍然没有得到很好的发展。本文提出了一种基于潜在马尔可夫(LM)模型的纵向毒性数据处理方法。感兴趣的潜在状态是每个患者的潜在总毒性(LOTox)状况,它影响治疗期间观察到的毒性水平的分布。通过假设LOTox动力学存在LM链,新方法旨在识别总毒性负荷(LOTox状态)的不同潜在状态。该方法研究了患者在化疗期间如何在潜伏期之间移动,从而重建个性化的纵向LOTox曲线。这种方法从未应用于骨肉瘤治疗,并为儿童癌症治疗的医学决策提供了新的见解。 摘要:In cancer trials, the analysis of longitudinal toxicity data is a difficult task due to the presence of multiple adverse events with different extents of toxic burden. Models able to summarise and quantify the overall toxic risk and its evolution during cancer therapy, which deals with both the longitudinal and the categorical aspects of toxicity levels progression, are necessary, but still not well developed. In this work, a novel approach based on Latent Markov (LM) models for longitudinal toxicity data is proposed. The latent status of interest is the Latent Overall Toxicity (LOTox) condition of each patient, which affects the distribution of the observed toxic levels over treatment. By assuming the existence of a LM chain for LOTox dynamics, the new approach aims at identifying different latent states of overall toxicity burden (LOTox states). This approach investigates how patients move between latent states during chemotherapy treatment allowing for the reconstruction of personalized longitudinal LOTox profiles. This methodology has never been applied to osteosarcoma treatment and provides new insights for medical decisions in childhood cancer therapy.

【5】 Sequentially estimating the dynamic contact angle of sessile saliva droplets in view of SARS-CoV-2 标题:基于SARS-CoV-2的固着唾液滴动态接触角的序贯估算

作者:Sudeep R. Bapat 机构:Indian Institute of Management, Indore, India 链接:https://arxiv.org/abs/2107.12857 摘要:估计受病毒感染的唾液液滴的接触角被视为一个重要的研究领域,因为它提供了一个有关各液滴干燥时间以及潜在大流行增长的想法。在本文中,我们扩展了Balusamy,Banerjee和Sahu提出的数据[在SARS-CoV-2的背景下固着唾液滴的寿命,Int.J。热质传输。123105178(2021)],其中使用新提出的半圆形包装指数模型拟合接触角,并建立序列置信区间估计方法,该方法大大减少了数据收集的时间和成本。 摘要:Estimating the contact angle of a virus infected saliva droplet is seen to be an important area of research as it presents an idea about the drying time of the respective droplet and in turn of the growth of the underlying pandemic. In this paper we extend the data presented by Balusamy, Banerjee and Sahu [Lifetime of sessile saliva droplets in the context of SARS-CoV-2, Int. J. Heat Mass Transf. 123, 105178 (2021)], where the contact angles are fitted using a newly proposed half-circular wrapped-exponential model, and a sequential confidence interval estimation approach is established which largely reduces both time and cost with regards to data collection.

【6】 Wasserstein-Splitting Gaussian Process Regression for Heterogeneous Online Bayesian Inference 标题:异质在线贝叶斯推理的Wasserstein-Split高斯过程回归

作者:Michael E. Kepler,Alec Koppel,Amrit Singh Bedi,Daniel J. Stilwell 机构: Stilwell are with the Bradley Department ofElectrical and Computer Engineering, Virginia Polytechnic Institute andState University 链接:https://arxiv.org/abs/2107.12797 摘要:高斯过程(Gaussian processes,GPs)是一种著名的非参数贝叶斯推理技术,但在大样本情况下存在可扩展性问题,在非平稳或空间异构数据情况下性能会下降。在这项工作中,我们试图克服这些问题,通过(i)采用变分自由能近似GPs操作与在线期望传播步骤相结合;以及(ii)引入一个局部分裂步骤,当后验分布发生显著变化时,该步骤实例化一个新的GP,如后验分布上的Wasserstein度量所量化的那样。然后,随着时间的推移,这产生了一个可以增量更新的稀疏GPs集合,并适应训练数据中的局部性、异质性和非平稳性。 摘要:Gaussian processes (GPs) are a well-known nonparametric Bayesian inference technique, but they suffer from scalability problems for large sample sizes, and their performance can degrade for non-stationary or spatially heterogeneous data. In this work, we seek to overcome these issues through (i) employing variational free energy approximations of GPs operating in tandem with online expectation propagation steps; and (ii) introducing a local splitting step which instantiates a new GP whenever the posterior distribution changes significantly as quantified by the Wasserstein metric over posterior distributions. Over time, then, this yields an ensemble of sparse GPs which may be updated incrementally, and adapts to locality, heterogeneity, and non-stationarity in training data.

【7】 Statistical Guarantees for Fairness Aware Plug-In Algorithms 标题:公平性感知插件算法的统计保证

作者:Drona Khurana,Srinivasan Ravichandran,Sparsh Jain,Narayanan Unny Edakunni 备注:This paper was accepted at the workshop on Socially Responsible Machine Learning, ICML 2021 链接:https://arxiv.org/abs/2107.12783 摘要:在(Menon&Williamson,2018)中提出了一个插件算法,用于估计公平性感知二进制分类的Bayes最优分类器。然而,他们的方法的统计效力尚未确定。我们证明了插件算法的统计一致性。我们还推导了通过插件算法学习Bayes最优分类器的有限样本保证。最后,我们提出了一个协议,修改插件的方法,以便同时保证公平性和差异隐私的二进制特征被认为是敏感的。 摘要:A plug-in algorithm to estimate Bayes Optimal Classifiers for fairness-aware binary classification has been proposed in (Menon & Williamson, 2018). However, the statistical efficacy of their approach has not been established. We prove that the plug-in algorithm is statistically consistent. We also derive finite sample guarantees associated with learning the Bayes Optimal Classifiers via the plug-in algorithm. Finally, we propose a protocol that modifies the plug-in approach, so as to simultaneously guarantee fairness and differential privacy with respect to a binary feature deemed sensitive.

【8】 Deep Neural Networks for Detecting Statistical Model Misspecifications. The Case of Measurement Invariance 标题:用于检测统计模型错误的深度神经网络。测量不变性的情况

作者:Artur Pokropek,Ernest Pokropek 机构: Institute of Philosophy and Sociology, Polish Academy of Sciences, Warsaw, Poland, Division of Robotics, Perception, and Learning, KTH Royal Institute of Technology, Stockholm, Sweden 备注:30 pages, 7 figures, 4 tables 链接:https://arxiv.org/abs/2107.12757 摘要:虽然近年来提出了一些新的统计方法来模拟群体差异,对仪器测量不变性的性质有不同的假设,但用于检测这些模型的局部规格的工具还没有完全开发出来。关于可比性的局部错误说明的主要类型是指标的非不变性(也称为差异项功能;DIF)。这种不一致性可能是由于翻译不当或文化差异造成的。在这项研究中,我们提出了一种新的方法来检测这种错误的使用深层神经网络(DNN)。我们将提出的模型与最流行的传统方法进行了比较:验证性因素分析(CFA)模型的修正指数(MI)和预期参数变化(EPC)指标,logistic-DIF检测,以及CFA比对方法引入的序贯过程。仿真研究表明,该方法在几乎所有情况下都优于传统方法,或者至少与最佳方法一样准确。我们还提供了一个实证例子,利用欧洲社会调查(ESS)的数据,包括已知的项目是误译,这是正确的识别我们的方法和DIF检测基于logistic回归。该研究为统计模型误码检测的机器学习算法的未来发展提供了有力的基础。 摘要:While in recent years a number of new statistical approaches have been proposed to model group differences with a different assumption on the nature of the measurement invariance of the instruments, the tools for detecting local specifications of these models have not been fully developed yet. The main type of local misspecification concerning comparability is the non-invariance of indicators (called also differential item functioning; DIF). Such non-invariance could arise from poor translations or significant cultural differences. In this study, we present a novel approach to detect such misspecifications using a Deep Neural Network (DNN). We compared the proposed model with the most popular traditional methods: modification indices (MI) and expected parameters change (EPC) indicators from the confirmatory factor analysis (CFA) modelling, logistic DIF detection, and sequential procedure introduced with the CFA alignment approach. Simulation studies show that proposed method outperformed traditional methods in almost all scenarios, or it was at least as accurate as the best one. We also provide an empirical example utilizing European Social Survey (ESS) data including items known to be miss-translated, which are correctly identified by our approach and DIF detection based on logistic regression. This study provides a strong foundation for the future development of machine learning algorithms for detection of statistical model misspecifications.

【9】 Technical properties of Ranked Nodes Method 标题:排序节点法的技术特性

作者:Pekka Laitila,Kai Virtanen 机构:Department of Mathematics and Systems Analysis, Aalto University, Helsinki, Finland, Department of Military Technology, Finnish National Defence University, Helsinki, Finland 备注:39 pages, 3 figures 链接:https://arxiv.org/abs/2107.12747 摘要:本文给出了用专家启发法构造贝叶斯网络条件概率表的排序节点法(RNM)的分析和实验结果。大多数结果都集中在一个设置上,在该设置中,RNM应用于子节点和父节点,它们都具有相同数量的离散序数状态。研究结果表明,RNM的一些性质可以用来支持其未来的细化和发展。 摘要:This paper presents analytical and experimental results on the ranked nodes method (RNM) that is used to construct conditional probability tables for Bayesian networks by expert elicitation. The majority of the results are focused on a setting in which RNM is applied to a child node and parent nodes that all have the same amount discrete ordinal states. The results indicate on RNM properties that can be used to support its future elaboration and development.

【10】 Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel 标题:无神经切核的浅层神经网络梯度下降的稳定性及推广

作者:Dominic Richards,Ilja Kuzborskij 机构:University of Oxford, DeepMind 链接:https://arxiv.org/abs/2107.12723 摘要:我们重新讨论了训练超参数浅层神经网络的梯度下降法(GD)的平均算法稳定性,并证明了在不使用神经切线核(NTK)或Polyak-{\L}ojasiewicz(PL)假设的情况下新的推广和超额风险界。特别地,我们给出了oracle型界,它揭示了GD的泛化和超额风险是由一个具有最短GD路径的插值网络(在某种意义上,是一个具有最小相对范数的插值网络)控制的。虽然这是众所周知的核化插入,我们的证明直接适用于网络训练GD没有中间核化。同时,通过放松这里开发的oracle不等式,我们以一种简单的方式恢复了现有的基于NTK的风险边界,这表明我们的分析更加严密。最后,与大多数基于NTK的分析不同,我们着重于带标签噪声的回归分析,并表明提前停止的GD是一致的。 摘要:We revisit on-average algorithmic stability of Gradient Descent (GD) for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the Neural Tangent Kernel (NTK) or Polyak-{\L}ojasiewicz (PL) assumptions. In particular, we show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation (in a sense, an interpolating network with the smallest relative norm). While this was known for kernelised interpolants, our proof applies directly to networks trained by GD without intermediate kernelisation. At the same time, by relaxing oracle inequalities developed here we recover existing NTK-based risk bounds in a straightforward way, which demonstrates that our analysis is tighter. Finally, unlike most of the NTK-based analyses we focus on regression with label noise and show that GD with early stopping is consistent.

【11】 LinCDE: Conditional Density Estimation via Lindsey's Method 标题:LinCDE:基于Lindsey方法的条件密度估计

作者:Zijun Gao,Trevor Hastie 机构:Stanford University, Stanford, CA , USA, Department of Statistics and Department of Biomedical Data Science 备注:50 pages, 20 figures 链接:https://arxiv.org/abs/2107.12713 摘要:条件密度估计是统计学中的一个基本问题,在生物学、经济学、金融学、环境学等领域都有着广泛的应用。本文提出了一种基于梯度增强和Lindsey方法的条件密度估计方法。LinCDE允许对密度族进行灵活的建模,并且可以捕捉分布特征,如形态和形状。特别地,当适当地参数化时,LinCDE将产生平滑且非负的密度估计。此外,与增强回归树一样,LinCDE也进行自动特征选择。我们通过大量的模拟和几个真实的数据例子来证明LinCDE的有效性。 摘要:Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, LinCDE will produce smooth and non-negative density estimates. Furthermore, like boosted regression trees, LinCDE does automatic feature selection. We demonstrate LinCDE's efficacy through extensive simulations and several real data examples.

【12】 L_1 density estimation from privatised data标题:私有化数据的L_1密度估计

作者:László Györfi,Martin Kroll 链接:https://arxiv.org/abs/2107.12649 摘要:我们重温了非参数密度估计的经典问题,但施加了局部微分隐私约束。在这样的约束条件下,原始数据$X\u 1、\ldots、X\u n$不能直接观测到,并且所有的估计器都是适当的隐私机制的随机输出的函数。统计学家可以自由选择隐私机制的形式,在这项工作中,我们建议将拉普拉斯分布噪声添加到向量$X\u i$位置的离散化中。基于这些随机数据,我们设计了一种新的密度函数估计器,它可以看作是一种私有化的直方图密度估计器。我们的理论结果包括泛点一致性和强泛点一致性。此外,还得到了Lipschitz函数类的收敛速度,并用匹配的极大极小下界加以补充。 摘要:We revisit the classical problem of nonparametric density estimation, but impose local differential privacy constraints. Under such constraints, the original data $X_1,\ldots,X_n$, taking values in $\mathbb{R}^d $, cannot be directly observed, and all estimators are functions of the randomised output of a suitable privacy mechanism. The statistician is free to choose the form of the privacy mechanism, and in this work we propose to add Laplace distributed noise to a discretisation of the location of a vector $X_i$. Based on these randomised data, we design a novel estimator of the density function, which can be viewed as a privatised version of the well-studied histogram density estimator. Our theoretical results include universal pointwise consistency and strong universal $L_1$-consistency. In addition, a convergence rate over classes of Lipschitz functions is derived, which is complemented by a matching minimax lower bound.

【13】 Extrapolation Estimation for Nonparametric Regression with Measurement Error 标题:具有测量误差的非参数回归的外推估计

作者:Weixing Song,Kanwal Ayub,Jianhong Shi 机构:Department of Statistics, Kansas State University, Manhattan, KS , School of Mathematics and Computer Sciences, Shanxi Normal University, Linfen, China 链接:https://arxiv.org/abs/2107.12586 摘要:对于含有正态测量误差的协变量的非参数回归模型,提出了一种估计非参数回归函数的外推算法。通过将条件期望直接应用于局部线性逼近与观测响应之间偏差的核加权最小二乘法,该算法成功地绕过了经典模拟外推方法所需的模拟步骤,从而大大减少了计算时间。值得注意的是,所提出的方法也提供了外推函数的精确形式,但是如果带宽小于测量误差的标准偏差,则通常不能通过简单地将拟合外推函数中的外推变量设置为负变量来获得外推估计。文中讨论了该估计方法的大样本特性,并通过仿真研究和实际数据实例说明了其应用。 摘要:For the nonparametric regression models with covariates contaminated with normal measurement errors, this paper proposes an extrapolation algorithm to estimate the nonparametric regression functions. By applying the conditional expectation directly to the kernel-weighted least squares of the deviations between the local linear approximation and the observed responses, the proposed algorithm successfully bypasses the simulation step needed in the classical simulation extrapolation method, thus significantly reducing the computational time. It is noted that the proposed method also provides an exact form of the extrapolation function, but the extrapolation estimate generally cannot be obtained by simply setting the extrapolation variable to negative one in the fitted extrapolation function if the bandwidth is less than the standard deviation of the measurement error. Large sample properties of the proposed estimation procedure are discussed, as well as simulation studies and a real data example being conducted to illustrate its applications.

【14】 Spatial prediction of apartment rent using regression-based and machine learning-based approaches with a large dataset 标题:基于回归和机器学习的大数据集公寓租金空间预测

作者:Takahiro Yoshida,Hajime Seya 机构: Graduate School of Engineering, The University of Tokyo, Japan, Graduate School of Engineering, Kobe University, Japan, Corresponding author. Address: ,-,-, Hongo, Bunkyo-ku, Tokyo ,-, Japan. 链接:https://arxiv.org/abs/2107.12539 摘要:本研究使用一个大的数据集(最多n=10^6),通过添加新的经验证据和考虑观测值的空间相关性,试图加强回归和基于机器学习(ML)的租金价格预测模型之间比较的文献。基于回归的方法结合了最近邻高斯过程(NNGP)模型,使得kriging能够应用于大型数据集。相比之下,基于ML的方法使用了典型的模型:极端梯度增强(XGBoost)、随机森林(RF)和深度神经网络(DNN)。这些模型的样本外预测精度通过日本公寓租金数据进行比较,样本大小顺序不同(即n=10^4、10^5、10^6)。结果表明,随着样本量的增加,XGBoost和RF的预测精度优于NNGP,具有更高的样本外预测精度。XGBoost在对数和实际尺度以及所有价格区间(当n=10^5和10^6时)的所有样本大小和误差度量都达到了最高的预测精度。比较了几种解释RF空间相关性的方法,结果表明,在解释变量中加入空间坐标就足够了。 摘要:Employing a large dataset (at most, the order of n = 10^6), this study attempts enhance the literature on the comparison between regression and machine learning (ML)-based rent price prediction models by adding new empirical evidence and considering the spatial dependence of the observations. The regression-based approach incorporates the nearest neighbor Gaussian processes (NNGP) model, enabling the application of kriging to large datasets. In contrast, the ML-based approach utilizes typical models: extreme gradient boosting (XGBoost), random forest (RF), and deep neural network (DNN). The out-of-sample prediction accuracy of these models was compared using Japanese apartment rent data, with a varying order of sample sizes (i.e., n = 10^4, 10^5, 10^6). The results showed that, as the sample size increased, XGBoost and RF outperformed NNGP with higher out-of-sample prediction accuracy. XGBoost achieved the highest prediction accuracy for all sample sizes and error measures in both logarithmic and real scales and for all price bands (when n = 10^5 and 10^6). A comparison of several methods to account for the spatial dependence in RF showed that simply adding spatial coordinates to the explanatory variables may be sufficient.

【15】 Proof: Accelerating Approximate Aggregation Queries with Expensive Predicates 标题:证明:使用昂贵的谓词加速近似聚合查询

作者:Daniel Kang,John Guibas,Peter Bailis,Tatsunori Hashimoto,Yi Sun,Matei Zaharia 机构:Stanford University†, University of Chicago‡ 链接:https://arxiv.org/abs/2107.12525 摘要:给定一个数据集$\mathcal{D}$,我们感兴趣的是计算$\mathcal{D}$中与谓词匹配的子集的平均值\algname利用分层抽样和代理模型,在给定抽样预算$N$的情况下,有效地计算这一统计数据。在本文中,我们从理论上分析了估计的均方误差,并证明了估计的均方误差以$O的速率衰减,其中$N=K\cdot N\u 1+N\u 2^{1}+N\u 1^{1/2}N\u 2^{-3/2})$,对于某些整数常数,$K$和$K\cdot N\u 1$和$N\u 2$分别表示在估计的第1阶段和第2阶段中使用的样本数。因此,如果将总样本预算$N$的一个常数部分分配给每个阶段,我们将获得$O(N^{-1})$的均方误差,该均方误差与最佳分层抽样算法的均方误差率相匹配,前提是已知每个层的谓词正率和标准差。 摘要:Given a dataset $\mathcal{D}$, we are interested in computing the mean of a subset of $\mathcal{D}$ which matches a predicate. \algname leverages stratified sampling and proxy models to efficiently compute this statistic given a sampling budget $N$. In this document, we theoretically analyze \algname and show that the MSE of the estimate decays at rate $O(N_1^{-1} + N_2^{-1} + N_1^{1/2}N_2^{-3/2})$, where $N=K \cdot N_1+N_2$ for some integer constant $K$ and $K \cdot N_1$ and $N_2$ represent the number of samples used in Stage 1 and Stage 2 of \algname respectively. Hence, if a constant fraction of the total sample budget $N$ is allocated to each stage, we will achieve a mean squared error of $O(N^{-1})$ which matches the rate of mean squared error of the optimal stratified sampling algorithm given a priori knowledge of the predicate positive rate and standard deviation per stratum.

【16】 Outcome-Adjusted Balance Measure for Generalized Propensity Score Model Selection 标题:广义倾向得分模型选择的结果调整平衡测度

作者:Honghe Zhao,Shu Yang 机构:North Carolina State University, edu†Department of Statistics 链接:https://arxiv.org/abs/2107.12487 摘要:在这篇文章中,我们提出了结果调整平衡措施来进行模型选择的广义倾向评分(GPS),这是一个必不可少的组成部分,在估计两两以上的治疗水平的观察性研究平均治疗效果(ATE)。平衡度量的主要目标是确定GPS模型规格,从而使得到的ATE估计器是一致和有效的。根据最近的经验和理论证据,我们确定最优的全球定位系统模型应该只包括与结果相关的协变量。给定一组候选GPS模型,结果调整平衡测量通过匹配每个候选模型来插补所有基线协变量,并选择使协变量的插补值和原始值之间的绝对平均差加权和最小的模型。定义权重以利用协变量-结果关系,使得没有最优变量选择的GPS模型受到惩罚。在适当的假设条件下,我们证明了结果调整平衡测度一致地选择最优GPS模型,从而得到的GPS匹配估计是渐近正态的和有效的。在仿真研究中,我们将其有限样本性能与现有的测量方法进行了比较。我们举例说明了该方法在教学数据分析中的应用。 摘要:In this article, we propose the outcome-adjusted balance measure to perform model selection for the generalized propensity score (GPS), which serves as an essential component in estimation of the pairwise average treatment effects (ATEs) in observational studies with more than two treatment levels. The primary goal of the balance measure is to identify the GPS model specification such that the resulting ATE estimator is consistent and efficient. Following recent empirical and theoretical evidence, we establish that the optimal GPS model should only include covariates related to the outcomes. Given a collection of candidate GPS models, the outcome-adjusted balance measure imputes all baseline covariates by matching on each candidate model, and selects the model that minimizes a weighted sum of absolute mean differences between the imputed and original values of the covariates. The weights are defined to leverage the covariate-outcome relationship, so that GPS models without optimal variable selection are penalized. Under appropriate assumptions, we show that the outcome-adjusted balance measure consistently selects the optimal GPS model, so that the resulting GPS matching estimator is asymptotically normal and efficient. We compare its finite sample performance with existing measures in a simulation study. We illustrate an application of the proposed methodology in the analysis of the Tutoring data.

【17】 Efficient Treatment Effect Estimation in Observational Studies under Heterogeneous Partial Interference 标题:非均匀部分干扰下观察性研究中的有效治疗效果估计

作者:Zhaonan Qu,Ruoxuan Xiong,Jizhou Liu,Guido Imbens 链接:https://arxiv.org/abs/2107.12420 摘要:在许多社会科学和医学应用的观察性研究中,被试或个体是相互联系的,一个单位的治疗和属性可能会影响另一个单位的治疗和结果,违反稳定单位治疗价值假设(SUTVA),造成干扰。为了进行可行的推理,以前的许多工作都假设干扰单元的“可交换性”,在这种可交换性下,干扰的影响是通过处理的邻域的数目或比率来捕捉的。然而,在许多具有不同单元的应用中,干扰是异质的。在本文中,我们重点讨论了部分干扰的设置,并在可观测特性的条件下限制单元的可交换性。在这个框架下,我们提出了广义增广逆倾向加权(AIPW)估计,用于包含直接处理效应和溢出效应的一般因果估计。我们证明了它们是一致的,渐近正态的,半参数有效的,并且对异质干扰和模型错误具有鲁棒性。我们还将我们的方法应用于Add健康数据集,发现吸烟行为对学业成绩有干扰。 摘要:In many observational studies in social science and medical applications, subjects or individuals are connected, and one unit's treatment and attributes may affect another unit's treatment and outcome, violating the stable unit treatment value assumption (SUTVA) and resulting in interference. To enable feasible inference, many previous works assume the ``exchangeability'' of interfering units, under which the effect of interference is captured by the number or ratio of treated neighbors. However, in many applications with distinctive units, interference is heterogeneous. In this paper, we focus on the partial interference setting, and restrict units to be exchangeable conditional on observable characteristics. Under this framework, we propose generalized augmented inverse propensity weighted (AIPW) estimators for general causal estimands that include direct treatment effects and spillover effects. We show that they are consistent, asymptotically normal, semiparametric efficient, and robust to heterogeneous interference as well as model misspecifications. We also apply our method to the Add Health dataset and find that smoking behavior exhibits interference on academic outcomes.

【18】 Channel-Wise Early Stopping without a Validation Set via NNK Polytope Interpolation 标题:通过NNK多面体插值在没有验证集的情况下实现通道智能提前停止

作者:David Bonet,Antonio Ortega,Javier Ruiz-Hidalgo,Sarath Shekkizhar 机构:∗ Universitat Politecnica de Catalunya, Barcelona, Spain, † University of Southern California, Los Angeles, USA 备注:Submitted to APSIPA 2021 链接:https://arxiv.org/abs/2107.12972 摘要:最先进的神经网络体系结构继续扩大规模,并提供令人印象深刻的概括结果,尽管这是以有限的解释性为代价的。特别是,一个关键的挑战是确定何时停止训练模型,因为这对泛化有重大影响。卷积神经网络(ConvNets)是由多个通道聚合而成的高维特征空间,由于维数灾难的存在,对中间数据表示和模型演化的分析具有挑战性。提出了一种新的基于非负核回归(NNK)图的信道广义化估计方法&CW-DeepNNK(CW-DeepNNK)。该方法使得所学习的数据表示和通道之间的关系都具有基于实例的可解释性。基于我们的观察,我们使用CW-DeepNNK提出了一个新的早期停止准则,该准则(i)不需要验证集,(ii)基于任务性能度量,并且(iii)允许在每个通道的不同点达到停止。实验结果表明,与基于验证集性能的标准准则相比,该方法具有明显的优势。 摘要:State-of-the-art neural network architectures continue to scale in size and deliver impressive generalization results, although this comes at the expense of limited interpretability. In particular, a key challenge is to determine when to stop training the model, as this has a significant impact on generalization. Convolutional neural networks (ConvNets) comprise high-dimensional feature spaces formed by the aggregation of multiple channels, where analyzing intermediate data representations and the model's evolution can be challenging owing to the curse of dimensionality. We present channel-wise DeepNNK (CW-DeepNNK), a novel channel-wise generalization estimate based on non-negative kernel regression (NNK) graphs with which we perform local polytope interpolation on low-dimensional channels. This method leads to instance-based interpretability of both the learned data representations and the relationship between channels. Motivated by our observations, we use CW-DeepNNK to propose a novel early stopping criterion that (i) does not require a validation set, (ii) is based on a task performance metric, and (iii) allows stopping to be reached at different points for each channel. Our experiments demonstrate that our proposed method has advantages as compared to the standard criterion based on validation set performance.

【19】 Finding Failures in High-Fidelity Simulation using Adaptive Stress Testing and the Backward Algorithm 标题:基于自适应压力测试和反向算法的高保真仿真故障查找

作者:Mark Koren,Ahmed Nassar,Mykel J. Kochenderfer 机构: Stanford University 备注:Accepted to IROS 2021 链接:https://arxiv.org/abs/2107.12940 摘要:验证自治系统的安全性通常需要使用高保真模拟器,以充分捕捉真实世界场景的可变性。然而,在模拟场景的空间中彻底搜索故障通常是不可行的。自适应应力测试(AST)是一种利用强化学习来发现系统最可能失效的方法。AST与深度强化学习求解器已被证明是有效的,在寻找跨一系列不同系统的失败。这种方法通常涉及运行许多模拟,这在使用高保真模拟器时可能非常昂贵。为了提高效率,我们提出了一种在低保真度模拟器中首先发现故障的方法。然后使用反向算法,通过单个专家演示来训练深度神经网络策略,从而将低保真度故障调整为高保真度故障。我们已经创建了一系列自主车辆验证案例研究,这些案例代表了低保真度和高保真度模拟器的一些不同方式,例如时间离散化。我们在大量的案例研究中证明,这种新的AST方法能够用比直接高保真运行AST所需的高保真模拟步骤少得多的步骤来发现故障。作为概念证明,我们还演示了在NVIDIA的DriveSim模拟器上的AST,这是一个行业最先进的高保真模拟器,用于查找自动驾驶车辆的故障。 摘要:Validating the safety of autonomous systems generally requires the use of high-fidelity simulators that adequately capture the variability of real-world scenarios. However, it is generally not feasible to exhaustively search the space of simulation scenarios for failures. Adaptive stress testing (AST) is a method that uses reinforcement learning to find the most likely failure of a system. AST with a deep reinforcement learning solver has been shown to be effective in finding failures across a range of different systems. This approach generally involves running many simulations, which can be very expensive when using a high-fidelity simulator. To improve efficiency, we present a method that first finds failures in a low-fidelity simulator. It then uses the backward algorithm, which trains a deep neural network policy using a single expert demonstration, to adapt the low-fidelity failures to high-fidelity. We have created a series of autonomous vehicle validation case studies that represent some of the ways low-fidelity and high-fidelity simulators can differ, such as time discretization. We demonstrate in a variety of case studies that this new AST approach is able to find failures with significantly fewer high-fidelity simulation steps than are needed when just running AST directly in high-fidelity. As a proof of concept, we also demonstrate AST on NVIDIA's DriveSim simulator, an industry state-of-the-art high-fidelity simulator for finding failures in autonomous vehicles.

【20】 Individual Survival Curves with Conditional Normalizing Flows 标题:具有条件归一化流的个体生存曲线

作者:Guillaume Ausset,Tom Ciffreo,Francois Portier,Stephan Clémençon,Timothée Papin 机构:∗Télécom Paris, Saclay, France, † BNP Paribas, Paris, France 备注:IEEE DSAA '21 链接:https://arxiv.org/abs/2107.12825 摘要:生存分析,或称事件时间模型,是一个经典的统计学问题,由于其在流行病学、人口统计学或精算学中的实际应用而引起了广泛的兴趣。在个体化医学兴起的推动下,从机器学习的角度研究这一课题的最新进展关注的是精确的个体预测,而不是人口研究。我们在这里介绍了一种基于条件归一化流的事件时间密度估计方法,作为一种高度灵活和个性化的条件生存分布模型。我们使用一个新的规范化流的层次公式来实现灵活的条件分布的有效拟合而不需要过度拟合,并展示了规范化流公式如何有效地适应删失设置。我们在一个综合数据集、四个开放的医学数据集和一个常见的财务问题的例子上对所提出的方法进行了实验验证。 摘要:Survival analysis, or time-to-event modelling, is a classical statistical problem that has garnered a lot of interest for its practical use in epidemiology, demographics or actuarial sciences. Recent advances on the subject from the point of view of machine learning have been concerned with precise per-individual predictions instead of population studies, driven by the rise of individualized medicine. We introduce here a conditional normalizing flow based estimate of the time-to-event density as a way to model highly flexible and individualized conditional survival distributions. We use a novel hierarchical formulation of normalizing flows to enable efficient fitting of flexible conditional distributions without overfitting and show how the normalizing flow formulation can be efficiently adapted to the censored setting. We experimentally validate the proposed approach on a synthetic dataset as well as four open medical datasets and an example of a common financial problem.

【21】 Theoretical Study and Comparison of SPSA and RDSA Algorithms 标题:SPSA和RDSA算法的理论研究与比较

作者:Yiwen Chen 备注:33 pages, 7 figures, 10 tables 链接:https://arxiv.org/abs/2107.12771 摘要:随机逼近(SA)算法广泛应用于只有噪声测量的系统优化问题。本文研究了多元Kiefer-Wolfowitz背景下的随机方向SA(RDSA)和同时扰动SA(SPSA)两类SA算法,给出了RDSA算法的偏差项、收敛性和渐近正态性。RDSA和SPSA中的梯度估计有不同的形式,因此使用不同类型的随机扰动。本文研究了RDSA和SPSA中扰动的各种有效分布,并利用由渐近分布计算的均方误差对这两种算法进行了比较。从理论和数值的角度来看,我们发现SPSA通常优于RDSA。 摘要:Stochastic approximation (SA) algorithms are widely used in system optimization problems when only noisy measurements of the system are available. This paper studies two types of SA algorithms in a multivariate Kiefer-Wolfowitz setting: random-direction SA (RDSA) and simultaneous-perturbation SA (SPSA), and then describes the bias term, convergence, and asymptotic normality of RDSA algorithms. The gradient estimations in RDSA and SPSA have different forms and, consequently, use different types of random perturbations. This paper looks at various valid distributions for perturbations in RDSA and SPSA and then compares the two algorithms using mean-square errors computed from asymptotic distribution. From both a theoretical and numerical point of view, we find that SPSA generally outperforms RDSA.

【22】 On the Role of Optimization in Double Descent: A Least Squares Study 标题:论优化在双重下降中的作用:一项最小二乘研究

作者:Ilja Kuzborskij,Csaba Szepesvári,Omar Rivasplata,Amal Rannen-Triki,Razvan Pascanu 机构:DeepMind, Canada, University of Alberta, Edmonton, University College London 链接:https://arxiv.org/abs/2107.12685 摘要:经验上已经观察到,随着模型尺寸的增加,深度神经网络的性能稳步提高,这与经典的过度拟合和泛化观点相矛盾。最近,有人提出双下降现象来调和这一观察结果与理论,这表明当模型变得足够过参数化时,测试误差有第二次下降,因为模型大小本身充当了一个隐式正则化器。在本文中,我们加入到这个领域不断增长的工作主体中,提供了一个学习动态的仔细研究,作为最小二乘情景下模型大小的函数。给出了最小二乘目标梯度下降解的超额风险界。边界依赖于输入特征的协方差矩阵的最小非零特征值,通过具有双重下降行为的函数形式。这为文献报道的双下降曲线提供了一个新的视角。我们对超额风险的分析允许将优化和泛化误差的影响解耦。特别地,我们发现在无噪声回归的情况下,双下降仅由优化相关的量来解释,这在Moore-Penrose伪逆解的研究中被忽略了。我们相信,我们的推导提供了一个与现有工作相比的另一种观点,揭示了这种现象的可能原因,至少在考虑的最小二乘法设置中。我们从经验上探讨我们的预测是否适用于神经网络,特别是中介隐藏激活的协方差是否与我们的推导预测的行为相似。 摘要:Empirically it has been observed that the performance of deep neural networks steadily improves as we increase model size, contradicting the classical view on overfitting and generalization. Recently, the double descent phenomena has been proposed to reconcile this observation with theory, suggesting that the test error has a second descent when the model becomes sufficiently overparameterized, as the model size itself acts as an implicit regularizer. In this paper we add to the growing body of work in this space, providing a careful study of learning dynamics as a function of model size for the least squares scenario. We show an excess risk bound for the gradient descent solution of the least squares objective. The bound depends on the smallest non-zero eigenvalue of the covariance matrix of the input features, via a functional form that has the double descent behavior. This gives a new perspective on the double descent curves reported in the literature. Our analysis of the excess risk allows to decouple the effect of optimization and generalization error. In particular, we find that in case of noiseless regression, double descent is explained solely by optimization-related quantities, which was missed in studies focusing on the Moore-Penrose pseudoinverse solution. We believe that our derivation provides an alternative view compared to existing work, shedding some light on a possible cause of this phenomena, at least in the considered least squares setting. We empirically explore if our predictions hold for neural networks, in particular whether the covariance of intermediary hidden activations has a similar behavior as the one predicted by our derivations.

【23】 Detection of cybersecurity attacks through analysis of web browsing activities using principal component analysis 标题:基于主成分分析的网络浏览行为分析检测网络安全攻击

作者:Insha Ullah,Kerrie Mengersen,Rob J Hyndman,James McGree 链接:https://arxiv.org/abs/2107.12592 摘要:政府部门和金融机构等组织通过越来越多的互联网连接设备提供可访问的在线服务设施,使其运营环境容易受到网络攻击。因此,有必要建立机制,及时发现网络安全攻击。各种网络入侵检测系统(NIDS)被提出,可分为基于特征的NIDS和基于异常的NIDS。基于特征码的网络入侵检测系统(NIDS)通过扫描已知攻击活动的特征码来识别误用行为,但由于其无法识别新的攻击(以前从未见过的攻击)而受到批评。在基于异常的网络入侵检测系统中,如果表示与训练模型的偏差,则会声明连接异常,而无监督学习算法由于具有识别新攻击的能力而避开了这个问题。在本研究中,我们使用一种基于主成分分析的无监督学习算法来侦测网路攻击。在训练阶段,我们的方法还可以识别训练数据集中的异常值。在监控阶段,我们的方法首先确定受影响的维度,然后通过仅对受异常影响的组件进行聚合来计算异常得分。我们通过模拟和两个应用,即澳大利亚网络安全中心最近发布的UNSW-NB15数据集和著名的KDD'99数据集,探讨了算法的性能。该算法在训练和监控两个阶段都可以扩展到大数据集,仿真和实际数据集的结果表明,该方法在检测可疑网络活动方面具有良好的应用前景。 摘要:Organizations such as government departments and financial institutions provide online service facilities accessible via an increasing number of internet connected devices which make their operational environment vulnerable to cyber attacks. Consequently, there is a need to have mechanisms in place to detect cyber security attacks in a timely manner. A variety of Network Intrusion Detection Systems (NIDS) have been proposed and can be categorized into signature-based NIDS and anomaly-based NIDS. The signature-based NIDS, which identify the misuse through scanning the activity signature against the list of known attack activities, are criticized for their inability to identify new attacks (never-before-seen attacks). Among anomaly-based NIDS, which declare a connection anomalous if it expresses deviation from a trained model, the unsupervised learning algorithms circumvent this issue since they have the ability to identify new attacks. In this study, we use an unsupervised learning algorithm based on principal component analysis to detect cyber attacks. In the training phase, our approach has the advantage of also identifying outliers in the training dataset. In the monitoring phase, our approach first identifies the affected dimensions and then calculates an anomaly score by aggregating across only those components that are affected by the anomalies. We explore the performance of the algorithm via simulations and through two applications, namely to the UNSW-NB15 dataset recently released by the Australian Centre for Cyber Security and to the well-known KDD'99 dataset. The algorithm is scalable to large datasets in both training and monitoring phases, and the results from both the simulated and real datasets show that the method has promise in detecting suspicious network activities.

【24】 Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization 标题:指针值检索:理解神经网络泛化极限的新基准

作者:Chiyuan Zhang,Maithra Raghu,Jon Kleinberg,Samy Bengio 机构:Google Research, Brain Team, Cornell University 链接:https://arxiv.org/abs/2107.12580 摘要:深度学习的成功在很大程度上依赖于神经网络对看不见的数据输出有意义的预测的能力——泛化。然而,尽管它的重要性,仍然有基本的开放性问题,如何神经网络推广。神经网络在多大程度上依赖于记忆——看到高度相似的训练实例——它们在多大程度上能够进行人类智能化的推理——识别数据背后的抽象规则?在本文中,我们介绍了一个新的基准,指针值检索(PVR)任务,探索神经网络泛化的局限性。虽然PVR任务可以由视觉输入和符号输入组成,每种输入都有不同的难度,但它们都有一个简单的基本规则。PVR任务输入的一部分充当指针,给出输入的另一部分的位置,该部分构成值(和输出)。我们证明了这种任务结构为理解泛化提供了一个丰富的测试平台,我们的实证研究表明,基于数据集大小、任务复杂性和模型结构的神经网络性能有很大的变化。位置、值和指针规则的交互作用还允许通过引入分布偏移和增加函数复杂性来开发细致入微的泛化测试。这些既揭示了微妙的失败,也揭示了令人惊讶的成功,表明了在这个基准上许多有希望的探索方向。 摘要:The successes of deep learning critically rely on the ability of neural networks to output meaningful predictions on unseen data -- generalization. Yet despite its criticality, there remain fundamental open questions on how neural networks generalize. How much do neural networks rely on memorization -- seeing highly similar training examples -- and how much are they capable of human-intelligence styled reasoning -- identifying abstract rules underlying the data? In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. While PVR tasks can consist of visual as well as symbolic inputs, each with varying levels of difficulty, they all have a simple underlying rule. One part of the PVR task input acts as a pointer, giving the location of a different part of the input, which forms the value (and output). We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture. The interaction of position, values and the pointer rule also allow the development of nuanced tests of generalization, by introducing distribution shift and increasing functional complexity. These reveal both subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.

【25】 Bi-Directional Grid Constrained Stochastic Processes' Link to Multi-Skew Brownian Motion 标题:双向网格约束随机过程与多斜布朗运动的联系

作者:Aldo Taranto,Ron Addie,Shahjahan Khan 机构:School of Sciences, University of Southern Queensland, Toowoomba, QLD , Australia 备注:Manuscript accepted for publication in the Journal of Applied Probability & Statistics and will appear in issue 1 of volume 17 to be published in April 2022 链接:https://arxiv.org/abs/2107.12554 摘要:双向网格约束(BGC)随机过程(bgcsp)越来越稳定地约束向原点的随机运动,它们离原点越远,而不是像已有的It^o扩散理论那样,一次施加反射屏障。我们发现BGCSP是多斜布朗运动(M-SBM)的一个变种而不是特例。这是因为它们有其自身的复杂性,例如屏障被隐藏(事先不知道)并且不一定随时间而恒定。我们提供了一个M-SBM理论框架和一个模拟框架来阐述bgcsp的深层特性。然后通过生成大量约束路径的仿真来应用该仿真框架,并对结果进行了分析。BGCSP在金融和其他许多领域都有应用,需要从初始位置的上方和下方进行分级约束。 摘要:Bi-Directional Grid Constrained (BGC) stochastic processes (BGCSPs) constrain the random movement toward the origin steadily more and more, the further they deviate from the origin, rather than all at once imposing reflective barriers, as does the well-established theory of It^o diffusions with such reflective barriers. We identify that BGCSPs are a variant rather than a special case of the multi-skew Brownian motion (M-SBM). This is because they have their own complexities, such as the barriers being hidden (not known in advance) and not necessarily constant over time. We provide an M-SBM theoretical framework and also a simulation framework to elaborate deeper properties of BGCSPs. The simulation framework is then applied by generating numerous simulations of the constrained paths and the results are analysed. BGCSPs have applications in finance and indeed many other fields requiring graduated constraining, from both above and below the initial position.

【26】 Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 标题:受限玻尔兹曼机与深度信念网络:教程与综述

作者:Benyamin Ghojogh,Ali Ghodsi,Fakhri Karray,Mark Crowley 机构:Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada, Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science 备注:To appear as a part of an upcoming textbook on dimensionality reduction and manifold learning 链接:https://arxiv.org/abs/2107.12521 摘要:这是一篇关于Boltzmann机器(BM)、受限Boltzmann机器(RBM)和深度信念网络(DBN)的教程和调查论文。我们从概率图形模型,马尔可夫随机场,吉布斯抽样,统计物理,伊辛模型和霍普菲尔德网络的背景知识开始。然后介绍了BM和RBM的结构。解释了显隐变量的条件分布、RBM中Gibbs抽样生成变量、最大似然估计训练BM和RBM以及对比散度。然后,我们讨论了变量可能的离散分布和连续分布。我们介绍了条件RBM及其训练方法。最后,我们将深层信念网络解释为一系列RBM模型。这篇关于玻耳兹曼机器的论文可以应用于数据科学、统计学、神经计算和统计物理学等各个领域。 摘要:This is a tutorial and survey paper on Boltzmann Machine (BM), Restricted Boltzmann Machine (RBM), and Deep Belief Network (DBN). We start with the required background on probabilistic graphical models, Markov random field, Gibbs sampling, statistical physics, Ising model, and the Hopfield network. Then, we introduce the structures of BM and RBM. The conditional distributions of visible and hidden variables, Gibbs sampling in RBM for generating variables, training BM and RBM by maximum likelihood estimation, and contrastive divergence are explained. Then, we discuss different possible discrete and continuous distributions for the variables. We introduce conditional RBM and how it is trained. Finally, we explain deep belief network as a stack of RBM models. This paper on Boltzmann machines can be useful in various fields including data science, statistics, neural computation, and statistical physics.

【27】 A Unifying Framework for Testing Shape Restrictions 标题:一种用于形状约束测试的统一框架

作者:Zheng Fang 机构:Department of Economics, Texas A&M University 链接:https://arxiv.org/abs/2107.12494 摘要:本文的计量经济学贡献如下。首先,我们开发了一个基于Wald原理的测试形状限制的统一框架。其次,我们检验了一些显著的形状强制算子在实现我们的测试中的适用性和有用性,包括重排和最大凸最小化(或最小凹优化)。特别地,由于缺乏凸性,有影响的重排算子是不适用的,而最大的凸二值化被证明具有使用我们的框架所需的分析性质。凸度在建立尺寸控制中的重要性已在文献中的其他地方提到。第三,我们证明,尽管投影算子在一般的非Hilbert参数空间(例如,由统一范数定义的空间)中可能没有很好的定义/表现,但是我们仍然可以通过应用我们的框架设计一个强大的基于距离的测试。通过montecarlo模拟评估了我们的测试的有限样本性能,并通过调查高端劳动力市场每周工作时间与年度工资增长之间的关系来展示其经验相关性。 摘要:This paper makes the following econometric contributions. First, we develop a unifying framework for testing shape restrictions based on the Wald principle. Second, we examine the applicability and usefulness of some prominent shape enforcing operators in implementing our test, including rearrangement and the greatest convex minorization (or the least concave majorization). In particular, the influential rearrangement operator is inapplicable due to a lack of convexity, while the greatest convex minorization is shown to enjoy the analytic properties required to employ our framework. The importance of convexity in establishing size control has been noted elsewhere in the literature. Third, we show that, despite that the projection operator may not be well-defined/behaved in general non-Hilbert parameter spaces (e.g., ones defined by uniform norms), one may nonetheless devise a powerful distance-based test by applying our framework. The finite sample performance of our test is evaluated through Monte Carlo simulations, and its empirical relevance is showcased by investigating the relationship between weekly working hours and the annual wage growth in the high-end labor market.

【28】 Robustness and sensitivity analyses for rough Volterra stochastic volatility models 标题:粗糙Volterra随机波动模型的稳健性和灵敏度分析

作者:Jan Matas,Jan Pospíšil 机构:NTIS - New Technologies for the Information Society, University of West Bohemia, Univerzitní ,, Plzeň, Czech Republic 链接:https://arxiv.org/abs/2107.12462 摘要:本文对几种连续时间粗糙Volterra随机波动模型进行了关于市场校准过程的鲁棒性和敏感性分析。稳健性是指对期权数据结构变化的敏感性。后一种分析应验证波动过程动力学中粗糙度重要性的假设。对苹果公司2015年4月和5月四个不同交易日的股票期权交易数据集进行了实证研究,特别给出了RFSV、rBergomi和aRFSV模型的结果。 摘要:In this paper we perform robustness and sensitivity analysis of several continuous-time rough Volterra stochastic volatility models with respect to the process of market calibration. Robustness is understood in the sense of sensitivity to changes in the option data structure. The latter analyses then should validate the hypothesis on importance of the roughness in the volatility process dynamics. Empirical study is performed on a data set of Apple Inc. equity options traded in four different days in April and May 2015. In particular, the results for RFSV, rBergomi and aRFSV models are provided.

【29】 Debiasing In-Sample Policy Performance for Small-Data, Large-Scale Optimization 标题:针对小数据、大规模优化消除样本内策略性能偏差

作者:Vishal Gupta,Michael Huang,Paat Rusmevichientong 机构:Data Science and Operations, USC Marshall School of Business, Los Angles, CA 链接:https://arxiv.org/abs/2107.12438 摘要:在数据匮乏的环境下,交叉验证的表现不佳,提出了一种新的数据驱动优化策略的样本外性能估计方法,该方法利用优化问题的灵敏度分析来估计最优目标值相对于数据中噪声量的梯度,并将估计的梯度作为策略的样本内性能演出与交叉验证技术不同,我们的方法避免了为测试集牺牲数据,在训练时利用所有数据,因此非常适合于数据稀缺的环境。对于具有不确定线性目标但已知的潜在非凸可行域的优化问题,我们证明了估计量的偏差和方差的界。对于可行域在某种意义上是弱耦合的更特殊的优化问题,我们证明了更强的结果。具体地说,我们给出了我们的估计的误差的显式高概率界,它一致地保持在一个策略类上,并且依赖于问题的维数和策略类的复杂性。我们的界表明,在温和的条件下,我们的估计误差随着优化问题维数的增加而消失,即使可用数据量保持小而恒定。换言之,我们证明了我们的估计器在小数据、大尺度情况下表现良好。最后,我们通过一个使用真实数据调度紧急医疗响应服务的案例研究,将我们提出的方法与最新的方法进行了数值比较。我们的方法提供了对样本外性能更准确的估计,并学习了性能更好的策略。 摘要:Motivated by the poor performance of cross-validation in settings where data are scarce, we propose a novel estimator of the out-of-sample performance of a policy in data-driven optimization.Our approach exploits the optimization problem's sensitivity analysis to estimate the gradient of the optimal objective value with respect to the amount of noise in the data and uses the estimated gradient to debias the policy's in-sample performance. Unlike cross-validation techniques, our approach avoids sacrificing data for a test set, utilizes all data when training and, hence, is well-suited to settings where data are scarce. We prove bounds on the bias and variance of our estimator for optimization problems with uncertain linear objectives but known, potentially non-convex, feasible regions. For more specialized optimization problems where the feasible region is ``weakly-coupled" in a certain sense, we prove stronger results. Specifically, we provide explicit high-probability bounds on the error of our estimator that hold uniformly over a policy class and depends on the problem's dimension and policy class's complexity. Our bounds show that under mild conditions, the error of our estimator vanishes as the dimension of the optimization problem grows, even if the amount of available data remains small and constant. Said differently, we prove our estimator performs well in the small-data, large-scale regime. Finally, we numerically compare our proposed method to state-of-the-art approaches through a case-study on dispatching emergency medical response services using real data. Our method provides more accurate estimates of out-of-sample performance and learns better-performing policies.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-07-28,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
主机安全
主机安全(Cloud Workload Protection,CWP)基于腾讯安全积累的海量威胁数据,利用机器学习为用户提供资产管理、木马文件查杀、黑客入侵防御、漏洞风险预警及安全基线等安全防护服务,帮助企业构建服务器安全防护体系。现支持用户非腾讯云服务器统一进行安全防护,轻松共享腾讯云端安全情报,让私有数据中心拥有云上同等级别的安全体验。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档