统计学学术速递[6.28]

公众号-arXiv每日学术速递

发布于 2021-07-02 17:23:32

1.1K0

发布于 2021-07-02 17:23:32

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

stat统计学，共计40篇

【1】 Active Learning with Multifidelity Modeling for Efficient Rare Event Simulation 标题：基于多保真建模的有效稀有事件仿真主动学习

作者：S. L. N. Dhulipala,M. D. Shields,B. W. Spencer,C. Bolisetti,A. E. Slaughter,V. M. Laboure,P. Chakroborty 机构： USAbDepartment of Civil and Systems Engineering, Johns Hopkins University 链接：https://arxiv.org/abs/2106.13790 摘要：虽然多理想建模提供了一种经济高效的方法来对计算昂贵的模型进行不确定性量化，但根据问题的类型和复杂性以及结果中所需的精度，通过自适应地确定所需高保真（HF）模拟的数量，可以实现更高的效率。我们提出了一个多理想模型的主动学习框架，强调对罕见事件的有效估计。我们的框架通过融合低保真（LF）预测和HF推断校正，过滤校正后的LF预测来决定是否调用高保真模型，并且为了提高后续的准确性，在每次HF模型调用之后对LF预测进行校正。该框架并未对LF模型类型或其与HF模型的相关性作出任何假设。此外，为了提高估计较小故障概率时的鲁棒性，我们建议使用动态主动学习函数来决定何时调用HF模型。我们使用几个学术案例研究和两个有限元（FE）模型案例研究来证明我们的框架：使用Stokes近似估计Navier-Stokes速度和通过粗网格各向同性模型估计横向各向同性模型中受位移影响的应力。在这些案例研究中，所提出的框架不仅准确地估计了失效概率，而且与montecarlo或标准方差缩减方法相比，它只需要调用HF模型的一小部分。摘要：While multifidelity modeling provides a cost-effective way to conduct uncertainty quantification with computationally expensive models, much greater efficiency can be achieved by adaptively deciding the number of required high-fidelity (HF) simulations, depending on the type and complexity of the problem and the desired accuracy in the results. We propose a framework for active learning with multifidelity modeling emphasizing the efficient estimation of rare events. Our framework works by fusing a low-fidelity (LF) prediction with an HF-inferred correction, filtering the corrected LF prediction to decide whether to call the high-fidelity model, and for enhanced subsequent accuracy, adapting the correction for the LF prediction after every HF model call. The framework does not make any assumptions as to the LF model type or its correlations with the HF model. In addition, for improved robustness when estimating smaller failure probabilities, we propose using dynamic active learning functions that decide when to call the HF model. We demonstrate our framework using several academic case studies and two finite element (FE) model case studies: estimating Navier-Stokes velocities using the Stokes approximation and estimating stresses in a transversely isotropic model subjected to displacements via a coarsely meshed isotropic model. Across these case studies, not only did the proposed framework estimate the failure probabilities accurately, but compared with either Monte Carlo or a standard variance reduction method, it also required only a small fraction of the calls to the HF model.

【2】 Tighter Analysis of Alternating Stochastic Gradient Method for Stochastic Nested Problems 标题：随机嵌套问题交替随机梯度法的紧致性分析

作者：Tianyi Chen,Yuejiao Sun,Wotao Yin 机构：Rensselaer Polytechnic Institute, University of California, Los Angeles 备注：Submitted for publication 链接：https://arxiv.org/abs/2106.13781 摘要：随机嵌套优化，包括随机组合优化、最小最大优化和双层优化，在许多机器学习应用中得到了广泛的应用。虽然这三个问题具有嵌套结构，但现有的工作往往将它们分开处理，从而开发特定于问题的算法并进行分析。在各种令人兴奋的发展中，简单的SGD类型更新（可能是多变量更新）在解决这类嵌套问题时仍然很普遍，但是与非嵌套问题相比，它们被认为具有较慢的收敛速度。本文将随机嵌套问题的多个SGD类型更新统一为一个SGD方法，我们称之为交替随机梯度下降（ALSET）方法。利用问题的隐光滑性，对随机嵌套问题的ALSET进行了更严密的分析。在新的分析中，要获得嵌套问题的$\epsilon$稳定点，需要${\calo}（\epsilon^{-2}）$个样本。在一定的正则性条件下，将我们的结果应用于随机组合、最小-最大和强化学习问题，可以提高或匹配相应情况下最著名的样本复杂度。我们的结果解释了为什么随机嵌套问题中简单的SGD型算法在实践中都能很好地工作，而不需要进一步的修改。摘要：Stochastic nested optimization, including stochastic compositional, min-max and bilevel optimization, is gaining popularity in many machine learning applications. While the three problems share the nested structure, existing works often treat them separately, and thus develop problem-specific algorithms and their analyses. Among various exciting developments, simple SGD-type updates (potentially on multiple variables) are still prevalent in solving this class of nested problems, but they are believed to have slower convergence rate compared to that of the non-nested problems. This paper unifies several SGD-type updates for stochastic nested problems into a single SGD approach that we term ALternating Stochastic gradient dEscenT (ALSET) method. By leveraging the hidden smoothness of the problem, this paper presents a tighter analysis of ALSET for stochastic nested problems. Under the new analysis, to achieve an $\epsilon$-stationary point of the nested problem, it requires ${\cal O}(\epsilon^{-2})$ samples. Under certain regularity conditions, applying our results to stochastic compositional, min-max and reinforcement learning problems either improves or matches the best-known sample complexity in the respective cases. Our results explain why simple SGD-type algorithms in stochastic nested problems all work very well in practice without the need for further modifications.

【3】 Parameter Estimation for the McKean-Vlasov Stochastic Differential Equation 标题：McKean-Vlasov随机微分方程的参数估计

作者：Louis Sharrock,Nikolas Kantas,Panos Parpas,Grigorios A. Pavliotis 机构：†Department of Mathematics 链接：https://arxiv.org/abs/2106.13751 摘要：本文研究了随机McKean-Vlasov方程及弱相互作用粒子系统的参数估计问题。我们首先在粒子数为$N\rightarrow\infty$的极限下建立了相互作用粒子系统的离线极大似然估计的相合性和渐近正态性。然后，我们提出了一种McKean-Vlasov随机微分方程参数的在线估计方法，该方法根据相互作用粒子系统渐近对数似然的连续时间随机梯度下降算法进行演化。证明了在适当的假设下，该估计在$\mathbb{L}^1$收敛到McKean-Vlasov SDE在联合极限下的渐近对数似然的平稳点，即$N\rightarrow\infty$和$t\rightarrow\infty$。在全局强凹的假设下，证明了我们的估计在$\mathbb{L}^2$中收敛到这个渐近对数似然函数的唯一极大值，并建立了$\mathbb{L}^2$的收敛速度。我们还得到了类似的结果，假设不是观察相互作用粒子系统的多个轨迹，而是观察McKean-Vlasov-SDE本身的多个独立复制，或者，更不现实的是，观察McKean-Vlasov-SDE的单个样本路径及其定律。我们的理论结果是通过两个数值例子，一个线性平均场模型和一个随机意见动力学模型来证明的。摘要：In this paper, we consider the problem of parameter estimation for a stochastic McKean-Vlasov equation, and the associated system of weakly interacting particles. We first establish consistency and asymptotic normality of the offline maximum likelihood estimator for the interacting particle system in the limit as the number of particles $N\rightarrow\infty$. We then propose an online estimator for the parameters of the McKean-Vlasov SDE, which evolves according to a continuous-time stochastic gradient descent algorithm on the asymptotic log-likelihood of the interacting particle system. We prove that this estimator converges in $\mathbb{L}^1$ to the stationary points of the asymptotic log-likelihood of the McKean-Vlasov SDE in the joint limit as $N\rightarrow\infty$ and $t\rightarrow\infty$, under suitable assumptions which guarantee ergodicity and uniform-in-time propagation of chaos. We then demonstrate, under the additional assumption of global strong concavity, that our estimator converges in $\mathbb{L}^2$ to the unique maximiser of this asymptotic log-likelihood function, and establish an $\mathbb{L}^2$ convergence rate. We also obtain analogous results under the assumption that, rather than observing multiple trajectories of the interacting particle system, we instead observe multiple independent replicates of the McKean-Vlasov SDE itself or, less realistically, a single sample path of the McKean-Vlasov SDE and its law. Our theoretical results are demonstrated via two numerical examples, a linear mean field model and a stochastic opinion dynamics model.

【4】 InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents 标题：英特尔-VAE：通过中间延迟向可变自动编码器添加感应偏差

作者：Ning Miao,Emile Mathieu,N. Siddharth,Yee Whye Teh,Tom Rainforth 机构： [ 47] find that simply changing p(z) is typically 1Department of Statistics, University of Oxford, 2University of Edinburgh and the Alan Turing InstituteCorrespondence to 链接：https://arxiv.org/abs/2106.13746 摘要：本文介绍了一种简单而有效的方法，通过使用一组潜在变量来学习具有可控诱导偏差的VAE。这使我们能够克服标准高斯先验假设的局限性。特别地，它允许我们对学习的表示施加期望的属性，如稀疏性或聚类，并将先验信息合并到学习的模型中。我们称之为中间潜空间VAE（InteL-VAE）的方法是基于控制中间潜变量编码过程的随机性，然后将它们确定地映射到目标潜表示，从中执行重构。这使得我们能够保持传统VAE框架的所有优点，同时通过潜在映射合并期望的先验信息、归纳偏差，甚至拓扑信息。我们表明，这反过来又可以让英特尔VAE学习更好的生成模型和表示。摘要：We introduce a simple and effective method for learning VAEs with controllable inductive biases by using an intermediary set of latent variables. This allows us to overcome the limitations of the standard Gaussian prior assumption. In particular, it allows us to impose desired properties like sparsity or clustering on learned representations, and incorporate prior information into the learned model. Our approach, which we refer to as the Intermediary Latent Space VAE (InteL-VAE), is based around controlling the stochasticity of the encoding process with the intermediary latent variables, before deterministically mapping them forward to our target latent representation, from which reconstruction is performed. This allows us to maintain all the advantages of the traditional VAE framework, while incorporating desired prior information, inductive biases, and even topological information through the latent mapping. We show that this, in turn, allows InteL-VAEs to learn both better generative models and representations.

【5】 Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance 标题：高维Kolmogorov-Smirnov距离的加速计算

作者：Alex Hagen,Shane Jackson,James Kahn,Jan Strube,Isabel Haide,Karl Pazdernik,Connor Hainje 机构：Hainje, Pacific Northwest National Laboratory, Richland, WA, USA, Karlsruhe Institute of Technology, Karlsruhe, Germany, ! 备注：Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence 链接：https://arxiv.org/abs/2106.13706 摘要：统计检验是一个广泛的和关键的各种科学学科。机器学习的出现和计算能力的提高增加了人们对多维数据分析和统计测试的兴趣。我们将强大的Kolmogorov-Smirnov双样本检验扩展到高维形式，其方式与Fasano（Fasano，1987）相似。我们将我们的结果称为d维Kolmogorov-Smirnov检验（ddKS），并提供了三个新的贡献：我们建立了一个给定ddKS分数显著性的分析方程，我们提供了一个在现代计算硬件上计算ddKS的算法，该算法对于小样本量和小维数具有恒定的时间复杂度，并给出了ddKS的两种近似计算方法：一种是在大样本下将时间复杂度降为线性，另一种是随着维数的增加将时间复杂度降为线性。我们在一组数据集上对ddKS及其近似值进行了功率分析，并与其他常见的高维双样本检验和距离：Hotelling的T^2检验和Kullback-Leibler散度进行了比较。我们的ddKS测试对所有被测试的数据集、维度和大小都有很好的表现，而其他的测试和距离都不能拒绝至少一个数据集上的空假设。因此，我们得出结论，ddKS是一个强大的多维双样本测试的一般用途，并可以在一个快速有效的方式计算使用我们的并行或近似方法。本文中描述的所有方法的开源实现都位于https://github.com/pnnl/ddks. 摘要：Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide three novel contributions therewith: we develop an analytical equation for the significance of a given ddKS score, we provide an algorithm for computation of ddKS on modern computing hardware that is of constant time complexity for small sample sizes and dimensions, and we provide two approximate calculations of ddKS: one that reduces the time complexity to linear at larger sample sizes, and another that reduces the time complexity to linear with increasing dimension. We perform power analysis of ddKS and its approximations on a corpus of datasets and compare to other common high dimensional two sample tests and distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test performs well for all datasets, dimensions, and sizes tested, whereas the other tests and distances fail to reject the null hypothesis on at least one dataset. We therefore conclude that ddKS is a powerful multidimensional two sample test for general use, and can be calculated in a fast and efficient manner using our parallel or approximate methods. Open source implementations of all methods described in this work are located at https://github.com/pnnl/ddks.

【6】 Posterior Covariance Information Criterion 标题：后验协方差信息准则

作者：Yukito Iba,Keisuke Yano 链接：https://arxiv.org/abs/2106.13694 摘要：提出了一种基于拟后验分布的预测评价信息准则PCIC。它被认为是广泛适用的信息准则（WAIC）的自然推广，可以通过单个Markov链montecarlo运行来计算。PCIC适用于WAIC中没有很好处理的各种预测设置，包括加权似然推理和准贝叶斯预测。摘要：We introduce an information criterion, PCIC, for predictive evaluation based on quasi-posterior distributions. It is regarded as a natural generalization of widely applicable information criterion (WAIC) and can be computed via a single Markov Chain Monte Carlo run. PCIC is useful in a variety of predictive settings that are not well dealt with in WAIC, including weighted likelihood inference and quasi-Bayesian prediction.

【7】 Feature Grouping and Sparse Principal Component Analysis 标题：特征分组与稀疏主成分分析

作者：Haiyan Jiang,Shanshan Qin,Dejing Dou 机构： Baidu Research, Baidu Inc., China, School of Statistics, Tianjin University of Finance and Economics, Tianjin, China 备注：21 pages, 5 figures, 2 tables 链接：https://arxiv.org/abs/2106.13685 摘要：稀疏主成分分析（SPCA）广泛应用于数据处理和降维；它使用套索产生具有稀疏载荷的修正主成分，以获得更好的可解释性。然而，稀疏PCA除了考虑所有系数为零的特殊分组（即特征选择）外，从不考虑载荷共享相似系数的额外分组结构（即特征分组）。本文提出了一种新的特征分组和稀疏主成分分析（FGSPCA）方法，该方法允许载荷属于不相交的齐次群，稀疏性是一种特例。所提出的FGSPCA是一种子空间学习方法，通过引入具有自然可调稀疏性和分组效应的非凸正则化，可以同时进行分组追踪和特征选择。为了解决由此产生的非凸优化问题，我们提出了一种混合了凸规划、增广拉格朗日法和坐标下降法的交替算法。此外，在实际数据集上的实验结果表明，与没有分组效应的方法相比，本文提出的FGSPCA方法具有分组效应。摘要：Sparse Principal Component Analysis (SPCA) is widely used in data processing and dimension reduction; it uses the lasso to produce modified principal components with sparse loadings for better interpretability. However, sparse PCA never considers an additional grouping structure where the loadings share similar coefficients (i.e., feature grouping), besides a special group with all coefficients being zero (i.e., feature selection). In this paper, we propose a novel method called Feature Grouping and Sparse Principal Component Analysis (FGSPCA) which allows the loadings to belong to disjoint homogeneous groups, with sparsity as a special case. The proposed FGSPCA is a subspace learning method designed to simultaneously perform grouping pursuit and feature selection, by imposing a non-convex regularization with naturally adjustable sparsity and grouping effect. To solve the resulting non-convex optimization problem, we propose an alternating algorithm that incorporates the difference-of-convex programming, augmented Lagrange and coordinate descent methods. Additionally, the experimental results on real data sets show that the proposed FGSPCA benefits from the grouping effect compared with methods without grouping effect.

【8】 Prediction of Hereditary Cancers Using Neural Networks 标题：神经网络在遗传性癌症预测中的应用

作者：Zoe Guan,Giovanni Parmigiani,Danielle Braun,Lorenzo Trippa 链接：https://arxiv.org/abs/2106.13682 摘要：家族史是多种癌症的主要危险因素。孟德尔风险预测模型将家族史转化为基于癌症易感基因知识的癌症风险预测。这些模型被广泛应用于临床实践，以帮助识别高危人群。孟德尔模型利用了整个家族史，但它们依赖于许多关于癌症易感基因的假设，这些假设要么不现实，要么由于突变率低而难以验证。在大型系谱数据库上训练更灵活的模型，如神经网络，有可能提高精确度。在本文中，我们开发了一个框架，将神经网络应用于家族史数据，并研究他们学习癌症遗传易感性的能力。虽然有大量关于神经网络及其在许多任务中的最新表现的文献，但将其应用于家族史数据的工作却很少。我们提出全连接神经网络和卷积神经网络的适应谱系。在孟德尔遗传的模拟数据中，我们证明了我们提出的神经网络模型能够达到接近最优的预测性能。此外，当观察到的家族史包括错误的癌症诊断时，神经网络能够比嵌入正确遗传规律的孟德尔BRCAPRO模型更好。使用一个超过200000个家族史的大数据集，风险服务队列，我们训练了乳腺癌未来风险的预测模型。我们使用癌症遗传学网络的数据来验证模型。摘要：Family history is a major risk factor for many types of cancer. Mendelian risk prediction models translate family histories into cancer risk predictions based on knowledge of cancer susceptibility genes. These models are widely used in clinical practice to help identify high-risk individuals. Mendelian models leverage the entire family history, but they rely on many assumptions about cancer susceptibility genes that are either unrealistic or challenging to validate due to low mutation prevalence. Training more flexible models, such as neural networks, on large databases of pedigrees can potentially lead to accuracy gains. In this paper, we develop a framework to apply neural networks to family history data and investigate their ability to learn inherited susceptibility to cancer. While there is an extensive literature on neural networks and their state-of-the-art performance in many tasks, there is little work applying them to family history data. We propose adaptations of fully-connected neural networks and convolutional neural networks to pedigrees. In data simulated under Mendelian inheritance, we demonstrate that our proposed neural network models are able to achieve nearly optimal prediction performance. Moreover, when the observed family history includes misreported cancer diagnoses, neural networks are able to outperform the Mendelian BRCAPRO model embedding the correct inheritance laws. Using a large dataset of over 200,000 family histories, the Risk Service cohort, we train prediction models for future risk of breast cancer. We validate the models using data from the Cancer Genetics Network.

【9】 Uncertainty-aware Validation Benchmarks for Coupling Free Flow and Porous-Medium Flow 标题：耦合自由流和多孔介质流的不确定性感知验证基准

作者：Farid Mohammadi,Elissa Eggenweiler,Bernd Flemisch,Sergey Oladyshkin,Iryna Rybak,Martin Schneider,Kilian Weishaupt 机构：University of Stuttgart, Institute for Modelling Hydraulic and Environmental Systems, University of Stuttgart, Institute of Applied Analysis and Numerical Simulation 链接：https://arxiv.org/abs/2106.13639 摘要：正确选择自由流与多孔介质耦合系统的界面条件和模型参数，对于物理一致性建模和应用的精确数值模拟至关重要。我们考虑了多孔介质隔室不同模型的Stokes-Darcy问题和相应的耦合策略：基于经典或广义界面条件的Darcy定律的标准平均模型，以及孔隙网络模型。本文以孔尺度解析模型为基准，研究了耦合流动问题的行为，并对模型参数和参考数据的不确定性进行了量化。为了实现这一点，我们应用了一个统计框架，其中包含了一个使用完全贝叶斯方法的概率建模技术。在验证任务上的贝叶斯观点产生了与参考数据的最佳偏差-方差权衡。它为模型验证提供了一个综合的度量标准，包括参数和概念上的不确定性。此外，采用贝叶斯稀疏多项式混沌展开的模型降阶技术，加速了具有不同耦合策略的计算要求较高的Stokes-Darcy模型的标定和验证过程。我们执行不确定性感知验证，展示每个模型的预测能力，并使用贝叶斯验证度量进行模型比较。摘要：A correct choice of interface conditions and useful model parameters for coupled free-flow and porous-medium systems is vital for physically consistent modeling and accurate numerical simulations of applications. We consider the Stokes--Darcy problem with different models for the porous-medium compartment and corresponding coupling strategies: the standard averaged model based on Darcy's law with classical or generalized interface conditions, as well as the pore-network model. We study the coupled flow problems' behaviors considering a benchmark case where a pore-scale resolved model provides the reference solution and quantify the uncertainties in the models' parameters and the reference data. To achieve this, we apply a statistical framework that incorporates a probabilistic modeling technique using a fully Bayesian approach. A Bayesian perspective on a validation task yields an optimal bias-variance trade-off against the reference data. It provides an integrative metric for model validation that incorporates parameter and conceptual uncertainty. Additionally, a model reduction technique, namely Bayesian Sparse Polynomial Chaos Expansion, is employed to accelerate the calibration and validation processes for computationally demanding Stokes--Darcy models with different coupling strategies. We perform uncertainty-aware validation, demonstrate each model's predictive capabilities, and make a model comparison using a Bayesian validation metric.

【10】 Multi-scale Poisson process approaches for differential expression analysis of high-throughput sequencing data 标题：高通量测序数据差异表达分析的多尺度泊松过程方法

作者：Heejung Shim,Zhengrong Xing,Ester Pantaleo,Francesca Luca,Roger Pique-Regi,Matthew Stephens 机构：School of Mathematics and Statistics and Melbourne Integrative Genomics, University, Department of Statistics, University of Chicago, Chicago, IL , USA, e-mail:, Department of Obstetrics and Gynecology and Center for Molecular Medicine and 备注：24 pages, 4 figures 链接：https://arxiv.org/abs/2106.13634 摘要：评估和测试不同条件下的分子表型差异（如基因表达、染色质可及性、转录因子结合）是理解基因调控的分子基础的重要部分。这些表型通常使用高通量测序分析（例如，RNA-seq、ATAC-seq、ChIP-seq）进行测量，这些分析提供高分辨率计数数据，反映表型如何沿基因组变化。已经提出了多种方法来帮助利用这些高分辨率测量进行差异表达分析。但是，他们忽略了数据的计数性质，而是使用仅适用于大样本或高计数数据的正态近似。这里我们开发基于计数的方法来解决这个问题。我们使用具有空间结构的潜在强度函数的非齐次泊松过程对每个样本的数据进行建模，然后在泊松过程的多尺度模型的基础上，估计和测试跨样本（或样本组）的潜在强度函数的差异。通过仿真和实际的ATAC-seq数据，我们证明了我们的方法优于以往基于正态分布的方法，特别是在小样本或低计数的情况下。摘要：Estimating and testing for differences in molecular phenotypes (e.g. gene expression, chromatin accessibility, transcription factor binding) across conditions is an important part of understanding the molecular basis of gene regulation. These phenotypes are commonly measured using high-throughput sequencing assays (e.g., RNA-seq, ATAC-seq, ChIP-seq), which provide high-resolution count data that reflect how the phenotypes vary along the genome. Multiple methods have been proposed to help exploit these high-resolution measurements for differential expression analysis. However, they ignore the count nature of the data, instead using normal approximations that work well only for data with large sample sizes or high counts. Here we develop count-based methods to address this problem. We model the data for each sample using an inhomogeneous Poisson process with spatially structured underlying intensity function, and then, building on multi-scale models for the Poisson process, estimate and test for differences in the underlying intensity function across samples (or groups of samples). Using both simulation and real ATAC-seq data we show that our method outperforms previous normal-based methods, especially in situations with small sample sizes or low counts.

【11】 Graph model selection by edge probability sequential inference 标题：基于边概率序贯推理的图模型选择

作者：Louis Duvivier,Rémy Cazabet,Céline Robardet 机构：R´emy Cazabet, C´eline Robardet 链接：https://arxiv.org/abs/2106.13579 摘要：图被广泛用于描述由许多相互作用的组件组成的系统以及理解它们相互作用的结构。存在各种统计模型，这些模型将这种结构描述为约束和随机性结合的结果模型选择技术需要为给定的图自动识别最佳模型和最佳参数集。为此，大多数作者依赖于最小描述长度范式，并通过考虑图集合上定义的概率分布的熵将其应用于图。本文介绍了一种新的基于边缘集合概率分布的模型选择方法——边缘概率序贯推理。从理论的角度，我们表明，这种方法提供了一个更一致的理由，统计推断相对于现有的技术，由于事实上，它依赖于多种实现的随机变量。它还提供了更好的保障，防止过度拟合，使之有可能降低参数的数量低于模型的观测数量。通过实验，我们在两种情况下说明了该方法的优点：推断随机块模型的划分，以及在随机块模型和配置模型之间确定给定图的最相关模型。摘要：Graphs are widely used for describing systems made up of many interacting components and for understanding the structure of their interactions. Various statistical models exist, which describe this structure as the result of a combination of constraints and randomness. %Model selection techniques need to automatically identify the best model, and the best set of parameters for a given graph. To do so, most authors rely on the minimum description length paradigm, and apply it to graphs by considering the entropy of probability distributions defined on graph ensembles. In this paper, we introduce edge probability sequential inference, a new approach to perform model selection, which relies on probability distributions on edge ensembles. From a theoretical point of view, we show that this methodology provides a more consistent ground for statistical inference with respect to existing techniques, due to the fact that it relies on multiple realizations of the random variable. It also provides better guarantees against overfitting, by making it possible to lower the number of parameters of the model below the number of observations. Experimentally, we illustrate the benefits of this methodology in two situations: to infer the partition of a stochastic blockmodel, and to identify the most relevant model for a given graph between the stochastic blockmodel and the configuration model.

【12】 Robust Real-Time Delay Predictions in a Network of High-Frequency Urban Buses 标题：高频城市公交线网的鲁棒实时延误预测

作者：Hector Rodriguez-Deniz,Mattias Villani 机构： Villani is also at STIMA (Link¨oping University) 链接：https://arxiv.org/abs/2106.13576 摘要：由于交通环境的高度随机性，为交通使用者和运营商提供准确的出行时间预测是一项具有挑战性的工作。公共交通用户对意外的等待时间特别敏感，这会对他们对系统可靠性的看法产生负面影响。在本文中，我们提出了一个鲁棒模型，用于实时巴士旅行时间预测，偏离高斯假设，使用学生-t$误差。该方法利用线路和以往公交出行的时空特征来模拟短期效应，利用日期/时间变量和高斯过程进行长期预测。该模型允许对均值、方差和峰度空间进行灵活建模。我们提出了贝叶斯推理和计算概率预测分布的算法。实验是利用瑞典斯德哥尔摩的高频公交车数据进行的。结果表明，Student-t$模型在预测特定站点的公交延误时，其对数后验预测能力优于Gaussian模型，说明了在模型选择中考虑预测不确定性的重要性。估计的学生-t$回归捕获了白天和不同工作日之间典型的时间变化。从最近一个车站进入的公共汽车检测到强烈的时空效应，这与许多最近发展的模型是一致的。我们最后展示了贝叶斯推理如何自然地允许预测不确定性量化，例如通过返回传入总线的延迟超过给定阈值的预测概率。摘要：Providing transport users and operators with accurate forecasts on travel times is challenging due to a highly stochastic traffic environment. Public transport users are particularly sensitive to unexpected waiting times, which negatively affect their perception on the system's reliability. In this paper we develop a robust model for real-time bus travel time prediction that depart from Gaussian assumptions by using Student-$t$ errors. The proposed approach uses spatiotemporal characteristics from the route and previous bus trips to model short-term effects, and date/time variables and Gaussian processes for long-run forecasts. The model allows for flexible modeling of mean, variance and kurtosis spaces. We propose algorithms for Bayesian inference and for computing probabilistic forecast distributions. Experiments are performed using data from high-frequency buses in Stockholm, Sweden. Results show that Student-$t$ models outperform Gaussian ones in terms of log-posterior predictive power to forecast bus delays at specific stops, which reveals the importance of accounting for predictive uncertainty in model selection. Estimated Student-$t$ regressions capture typical temporal variability between within-day hours and different weekdays. Strong spatiotemporal effects are detected for incoming buses from immediately previous stops, which is in line with many recently developed models. We finally show how Bayesian inference naturally allows for predictive uncertainty quantification, e.g. by returning the predictive probability that the delay of an incoming bus exceeds a given threshold.

【13】 Extreme event propagation using counterfactual theory and vine copulas 标题：基于反事实理论和藤蔓联结的极端事件传播

作者：Valentin Courgeau,Almut E. D. Veraart 链接：https://arxiv.org/abs/2106.13564 摘要：了解多元极端事件在管理复杂系统的风险方面起着至关重要的作用，因为极端事件是由它们自己的机制控制的。在给定变量超过高阈值（如交通强度）的条件下，了解哪些高影响量（如空气污染物水平）最有可能在未来达到极端是关键。本文研究了边际极端事件对未来相关数量极端事件的贡献。我们提出了一个极端事件传播框架，以最大限度地提高已知原因和未来高影响量之间的反事实因果关系概率。极值理论为上尾翼建模提供了一种工具，而藤连接函数是一种灵活的装置，用于捕捉各种各样的关节极值行为。我们优化了因果关系的可能性，并将我们的框架应用于伦敦道路交通和空气污染物数据集。我们复制了记录在案的大气机制，超越了线性关系。这提供了一个新的工具，用于量化各种应用中的极端传播。摘要：Understanding multivariate extreme events play a crucial role in managing the risks of complex systems since extremes are governed by their own mechanisms. Conditional on a given variable exceeding a high threshold (e.g.\ traffic intensity), knowing which high-impact quantities (e.g\ air pollutant levels) are the most likely to be extreme in the future is key. This article investigates the contribution of marginal extreme events on future extreme events of related quantities. We propose an Extreme Event Propagation framework to maximise counterfactual causation probabilities between a known cause and future high-impact quantities. Extreme value theory provides a tool for modelling upper tails whilst vine copulas are a flexible device for capturing a large variety of joint extremal behaviours. We optimise for the probabilities of causation and apply our framework to a London road traffic and air pollutants dataset. We replicate documented atmospheric mechanisms beyond linear relationships. This provides a new tool for quantifying the propagation of extremes in a large variety of applications.

【14】 Optimal Accelerated Degradation Testing Based on Bivariate Gamma Process with Dependent Components 标题：基于相依分量二元Gamma过程的最优加速退化试验

作者：Helmi Shat,Norbert Gaffke 机构：Institute for Mathematical Stochastics, Otto-von-Guericke University Magdeburg, PF , Magdeburg, Germany 链接：https://arxiv.org/abs/2106.13540 摘要：加速退化试验（ADT）是可靠性工程的主要方法之一，它可以在较短的时间内准确地估计高可靠系统的可靠性特性。测试数据通过物理上合理的统计模型外推，以获得正常使用条件下寿命分位数的估计值。Gamma过程是退化的自然模型，其退化路径单调且严格递增。在这项工作中，最佳的实验设计是推导出ADT与两个响应组件。我们考虑独立和相依的边际反应的情况，其中观测时间被假定为固定和已知的。假设边际退化路径遵循Gamma过程，其中copula函数用于表示两个分量之间的依赖关系。对于独立响应分量的情况，最优设计使正常使用条件下失效时间分布的估计分位数的渐近方差最小。对于相关响应分量的情况，采用$D$-准则导出$D$-最优设计。进一步地，当基于copula的模型被简化为二元结果时，$D$-和$c$-最优设计被开发出来。摘要：Accelerated degradation testing (ADT) is one of the major approaches in reliability engineering which allows accurate estimation of reliability characteristics of highly reliable systems within a relatively short time. The testing data are extrapolated through a physically reasonable statistical model to obtain estimates of lifetime quantiles at normal use conditions. The Gamma process is a natural model for degradation, which exhibits a monotone and strictly increasing degradation path. In this work, optimal experimental designs are derived for ADT with two response components. We consider the situations of independent as well as dependent marginal responses where the observational times are assumed to be fixed and known. The marginal degradation paths are assumed to follow a Gamma process where a copula function is utilized to express the dependence between both components. For the case of independent response components the optimal design minimizes the asymptotic variance of an estimated quantile of the failure time distribution at the normal use conditions. For the case of dependent response components the $D$-criterion is adopted to derive $D$-optimal designs. Further, $D$- and $c$-optimal designs are developed when the copula-based models are reduced to bivariate binary outcomes.

【15】 MARS: A second-order reduction algorithm for high-dimensional sparse precision matrices estimation 标题：MARS：一种高维稀疏精度矩阵估计的二阶降阶算法

作者：Qian LI,Binyan Jiang,Defeng Sun 机构：Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong 链接：https://arxiv.org/abs/2106.13508 摘要：精度矩阵（或逆协方差矩阵）的估计在统计数据分析中具有重要意义。然而，由于参数的数量与维数p成二次比例，当p较大时，计算变得非常困难。本文提出了一种自适应筛选约简算法，在$\ellu 1$惩罚迹损失下生成精度矩阵估计的求解路径，每个子问题用二阶算法求解。在算法的每次迭代中，基于Karush-Kuhn-Tucker（KKT）条件和前一次迭代中估计精度矩阵的稀疏结构，可以大大减少问题中的变量数目。因此，我们的算法能够处理可能超出现有方法容量的高维数据集。此外，对于每一次迭代的子问题，除了直接求解原问题外，我们在对偶问题上发展了一种具有全局线性收敛性的半光滑牛顿增广拉格朗日算法，以提高效率。建立了该算法的理论性质。特别地，我们证明了算法的收敛速度是渐近超线性的。通过大量的仿真研究和实际数据应用，说明了该算法的高效性和良好的性能，并与几种最先进的求解器进行了比较。摘要：Estimation of the precision matrix (or inverse covariance matrix) is of great importance in statistical data analysis. However, as the number of parameters scales quadratically with the dimension p, computation becomes very challenging when p is large. In this paper, we propose an adaptive sieving reduction algorithm to generate a solution path for the estimation of precision matrices under the $\ell_1$ penalized D-trace loss, with each subproblem being solved by a second-order algorithm. In each iteration of our algorithm, we are able to greatly reduce the number of variables in the problem based on the Karush-Kuhn-Tucker (KKT) conditions and the sparse structure of the estimated precision matrix in the previous iteration. As a result, our algorithm is capable of handling datasets with very high dimensions that may go beyond the capacity of the existing methods. Moreover, for the sub-problem in each iteration, other than solving the primal problem directly, we develop a semismooth Newton augmented Lagrangian algorithm with global linear convergence on the dual problem to improve the efficiency. Theoretical properties of our proposed algorithm have been established. In particular, we show that the convergence rate of our algorithm is asymptotically superlinear. The high efficiency and promising performance of our algorithm are illustrated via extensive simulation studies and real data applications, with comparison to several state-of-the-art solvers.

【16】 Semi-supervised multiple testing 标题：半监督多重测试

作者：David Mary,Etienne Roquain 机构：Universit´e Cˆote d’Azur, Observatoire de la Cˆote d’Azur, CNRS, Laboratoire Lagrange, Bd de l’Observatoire, CS , Nice cedex , France, Sorbonne Universit´e (Universit´e Pierre et Marie Curie), LPSM, Place Jussieu, Paris cedex , France 链接：https://arxiv.org/abs/2106.13501 摘要：标准多重测试程序的一个重要限制是应该知道零分布。在这里，我们考虑在以下半监督设置中进行多重测试的零分布自由方法：用户不知道零分布，但手头有从该零分布中提取的单个样本。在实际情况中，这种空训练样本（NTS）可以来自于以前的实验，也可以来自于被测数据的一部分，也可以来自于特定的模拟，或者来自于一个采样过程。在这项工作中，我们提出了处理这样一个框架的理论结果，重点放在错误发现率（FDR）控制和Benjamini-Hochberg（BH）程序上。首先，我们介绍一个程序提供强大的FDR控制。其次，我们还对该过程进行了幂次分析，指出当NTS样本量$n$在测试数$m$之前足够大时，忽略零分布的代价很低；即$n\gtrsim m/（\max（1，k））$，其中$k$表示“可检测”备选方案的数量。第三，为了完成这幅图，我们还给出了一个否定的结果，证明了一般半监督多重测试问题{存在一个内在的过渡阶段，并且证明了该方法在性能边界遵循这个过渡阶段的意义上是最优的。我们的理论性质得到了数值实验的支持，数值实验也表明所描绘的边界是正确的，没有进一步调整任何常数。最后，我们证明了我们的方法为天文数据分析的标准实践提供了理论基础，特别是为{Origin2020}中提出的星系探测程序提供了理论基础。摘要：An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a single sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come from previous experiments, from a part of the data under test, from specific simulations, or from a sampling process. In this work, we present theoretical results that handle such a framework, with a focus on the false discovery rate (FDR) control and the Benjamini-Hochberg (BH) procedure. First, we introduce a procedure providing strong FDR control. Second, we also give a power analysis for that procedure suggesting that the price to pay for ignoring the null distribution is low when the NTS sample size $n$ is sufficiently large in front of the number of test $m$; namely $n\gtrsim m/(\max(1,k))$, where $k$ denotes the number of "detectable" alternatives. Third, to complete the picture, we also present a negative result that evidences an intrinsic transition phase to the general semi-supervised multiple testing problem {and shows that the proposed method is optimal in the sense that its performance boundary follows this transition phase}. Our theoretical properties are supported by numerical experiments, which also show that the delineated boundary is of correct order without further tuning any constant. Finally, we demonstrate that our approach provides a theoretical ground for standard practice in astronomical data analysis, and in particular for the procedure proposed in \cite{Origin2020} for galaxy detection.

【17】 A flexible Bayesian framework for individualized inference via dynamic borrowing 标题：一种灵活的贝叶斯动态借用个性化推理框架

作者：Ziyu Ji,Julian Wolfson 机构：Division of Biostatistics, School of Public Health, University of Minnesota, Delaware St.SE, Minneapolis, Minnesota , U.S.A., Summary 备注：Submitted to Biostatistics 链接：https://arxiv.org/abs/2106.13431 摘要：随着高分辨率数据采集技术在健康领域的迅速发展，人们对推断个体水平的参数越来越感兴趣。虽然技术可以提供关于单个个体的大量数据，但是如何最好地利用多源群体数据来改进个体化推理仍然是一个开放的研究问题。一种可能的方法，多源可交换性模型（MEM），是一种贝叶斯方法，用于将补充源的数据集成到对主要源的分析中。MEM最初是通过借鉴以往类似研究的信息来改进单个研究的推理；然而，它的计算负担随着补充源的数量呈指数增长，这使得它不适用于成百上千的补充源（即个体）可能有助于对给定个体进行推理的应用。在本文中，我们提出了数据驱动的MEM（dMEM），这是一种两阶段的方法，包括源选择和聚类，以使包含任意数量的源，以便于计算和数据高效的方式进行个性化推理。我们举例说明了dMEM在通过智能手机收集的个体水平的人类行为和心理健康数据中的应用，与标准的无借用方法相比，我们的方法将个体水平的估计精度提高了84%，并且在80%的个体中优于最近提出的竞争方法。摘要：The explosion in high-resolution data capture technologies in health has increased interest in making inference about individual-level parameters. While technology may provide substantial data on a single individual, how best to use multisource population data to improve individualized inference remains an open research question. One possible approach, the multisource exchangeability model (MEM), is a Bayesian method for integrating data from supplementary sources into the analysis of a primary source. MEM was originally developed to improve inference for a single study by borrowing information from similar previous studies; however, its computational burden grows exponentially with the number of supplementary sources, making it unsuitable for applications where hundreds or thousands of supplementary sources (i.e., individuals) could contribute to inference on a given individual. In this paper, we propose the data-driven MEM (dMEM), a two-stage approach that includes both source selection and clustering to enable the inclusion of an arbitrary number of sources to contribute to individualized inference in a computationally tractable and data-efficient way. We illustrate the application of dMEM to individual-level human behavior and mental well-being data collected via smartphones, where our approach increases individual-level estimation precision by 84% compared with a standard no-borrowing method and outperforms recently-proposed competing methods in 80% of individuals.

【18】 Implementation of an alternative method for assessing competing risks: restricted mean time lost 标题：实施另一种评估竞争风险的方法：限制平均损失时间

作者：Hongji Wu,Hao Yuan,Zijing Yang,Yawen Hou,Zheng Chen 机构：: Department of Biostatistics, Southern Medical University, Guangzhou, China, : Department of Statistics, Jinan University, Guangzhou, China, Dec. 备注：American Journal of Epidemiology, 2021 链接：https://arxiv.org/abs/2106.13390 摘要：在临床和流行病学研究中，经常采用危险比来比较两组之间的治疗效果，以获得生存数据。对于竞争风险数据，相应的关注量是特定原因风险比（CHR）和子分布风险比（SHR）。然而，它们在模型假设和临床解释方面都有一些局限性。因此，我们引入限制平均时间损失（RMTL）作为一种替代方案，在竞争风险框架下易于解释。基于RMTL（RMTLd）的差异，我们提出了一个假设检验和样本量估计。仿真结果表明，RMTLd检验具有稳健的统计性能（包括I型误差和功率）。同时，基于RMTLd的样本大小可以近似地达到预定的功率水平。两个算例分析的结果也验证了RMTLd试验的性能。从临床解释、应用条件和统计性能的角度，我们建议在分析竞争性风险数据时，与HR一起报告RMTLd，甚至在比例风险假设失败时，RMTLd被视为主要结果。摘要：In clinical and epidemiological studies, hazard ratios are often applied to compare treatment effects between two groups for survival data. For competing risks data, the corresponding quantities of interest are cause-specific hazard ratios (CHRs) and subdistribution hazard ratios (SHRs). However, they all have some limitations related to model assumptions and clinical interpretation. Therefore, we introduce restricted mean time lost (RMTL) as an alternative that is easy to interpret in a competing risks framework. We propose a hypothetical test and sample size estimator based on the difference in RMTL (RMTLd). The simulation results show that the RMTLd test has robust statistical performance (both type I error and power). Meanwhile, the RMTLd-based sample size can approximately achieve the predefined power level. The results of two example analyses also verify the performance of the RMTLd test. From the perspectives of clinical interpretation, application conditions and statistical performance, we recommend that the RMTLd be reported with the HR when analyzing competing risks data and that the RMTLd even be regarded as the primary outcome when the proportional hazard assumption fails.

【19】 A numerically stable online implementation and exploration of WAIC through variations of the predictive density, using NIMBLE 标题：利用NIMBLE通过预报密度的变化实现和探索世界人工智能大会的数值稳定在线实现

作者：Joshua E. Hug,Christopher J. Paciorek 备注：27 pages, 9 tables. This is a preprint of the MA in Statistics thesis of Joshua Hug at the University of California, Berkeley, submitted May 2021 链接：https://arxiv.org/abs/2106.13359 摘要：我们通过制作一个鲁棒的和数值稳定的在线算法来计算渡边Akaike信息准则（WAIC）的过程。我们在NIMBLE软件中实现了该算法。该实现以在线方式执行，并且不需要在存储器中存储来自后验分布的完整样本。该算法允许用户指定用于计算WAIC的预测密度的特定形式，以满足特定的预测目标。然后，我们通过模拟来评价和探索在不同预测目标的背景下使用不同形式的预测密度。我们发现，当使用边缘化的预测密度，WAIC是敏感的分组观测到一个联合密度。摘要：We go through the process of crafting a robust and numerically stable online algorithm for the computation of the Watanabe-Akaike information criteria (WAIC). We implement this algorithm in the NIMBLE software. The implementation is performed in an online manner and does not require the storage in memory of the complete samples from the posterior distribution. This algorithm allows the user to specify a specific form of the predictive density to be used in the computation of WAIC, in order to cater to specific prediction goals. We then comment and explore via simulations the use of different forms of the predictive density in the context of different predictive goals. We find that when using marginalized predictive densities, WAIC is sensitive to the grouping of the observations into a joint density.

【20】 On a Projection Estimator of the Regression Function Derivative 标题：关于回归函数导数的一个投影估计

作者：Fabienne Comte,Nicolas Marie 备注：29 pages, 4 figures 链接：https://arxiv.org/abs/2106.13293 摘要：本文研究标准单变量回归模型中回归函数导数的估计。估计量可以由回归函数的非参数最小二乘估计量导出，也可以通过估计导数的投影来定义。我们证明了两个简单的风险界允许比较我们的估计。在稳定性假设下给出了更详细的界。我们可以用来说明我们的假设和第一个结果的基和空间都是紧型或非紧型的，并且我们讨论了估计量达到的速率。在紧凑的情况下，它们是最优的。最后，我们提出了一个模型选择过程，并证明了相关的风险界。考虑具有非紧支集的基使问题变得困难。摘要：In this paper, we study the estimation of the derivative of a regression function in a standard univariate regression model. The estimators are defined either by derivating nonparametric least-squares estimators of the regression function or by estimating the projection of the derivative. We prove two simple risk bounds allowing to compare our estimators. More elaborate bounds under a stability assumption are then provided. Bases and spaces on which we can illustrate our assumptions and first results are both of compact or non compact type, and we discuss the rates reached by our estimators. They turn out to be optimal in the compact case. Lastly, we propose a model selection procedure and prove the associated risk bound. To consider bases with a non compact support makes the problem difficult.

【21】 Self-training Converts Weak Learners to Strong Learners in Mixture Models 标题：在混合模式中自我训练将弱者转化为强者

作者：Spencer Frei,Difan Zou,Zixiang Chen,Quanquan Gu 机构：and 备注：21 pages 链接：https://arxiv.org/abs/2106.13805 摘要：当数据来自满足浓度和反浓度特性的两个各向同性分布的混合物时，我们考虑了一个二元分类问题。我们证明了存在一个普适常数$C{\mathrm{err}}>0$，如果一个伪标号$\boldsymbol{\beta}{\mathrm{pl}}}$最多可以达到$C{\mathrm{err}}}$，那么对于任何$\varepsilon>0$，使用伪标签$\hat y=\mathrm{sgn}（\langle\boldsymbol{\beta}t，\mathbf{x}\rangle）$和最多使用$\tilde O（d/\varepsilon^2）$的未标记示例初始化在$\boldsymbol{\beta}u 0:=\boldsymbol{\beta}}和$\tilde O（d/\varepsilon^2）$的迭代自训练算法足以学习Bayes最优分类器$\varepsilon$，其中，$d$是环境维度。也就是说，自我训练只使用未标记的例子将弱学习者转化为强学习者。我们还表明，通过对logistic损失进行梯度下降，仅使用$O（d）$标记的示例（即独立于$\varepsilon$）就可以获得分类错误为$C{\mathrm{err}}}$的伪标签$\boldsymbol{\beta}{\mathrm{pl}}$。我们的结果表明，通过半监督自学习算法，混合模型最多可以使用$O（d）$标记样本和$\tilde O（d/\varepsilon^2）$未标记样本学习到贝叶斯最优精度的$\varepsilon$。摘要：We consider a binary classification problem when the data comes from a mixture of two isotropic distributions satisfying concentration and anti-concentration properties enjoyed by log-concave distributions among others. We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\boldsymbol{\beta}_0 := \boldsymbol{\beta}_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \boldsymbol{\beta}_t, \mathbf{x}\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension. That is, self-training converts weak learners to strong learners using only unlabeled examples. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ with classification error $C_{\mathrm{err}}$ using only $O(d)$ labeled examples (i.e., independent of $\varepsilon$). Together our results imply that mixture models can be learned to within $\varepsilon$ of the Bayes-optimal accuracy using at most $O(d)$ labeled examples and $\tilde O(d/\varepsilon^2)$ unlabeled examples by way of a semi-supervised self-training algorithm.

【22】 Assessing Generalization of SGD via Disagreement 标题：通过不同意见评价SGD的概括性

作者：Yiding Jiang,Vaishnavh Nagarajan,Christina Baek,J. Zico Kolter 机构：Carnegie Mellon University, Bosch Center for AI, Pittsburgh 链接：https://arxiv.org/abs/2106.13799 摘要：我们的经验表明，深度网络的测试误差可以通过简单地在相同的训练集上训练相同的结构，但使用不同的随机梯度下降（SGD）运行，并在未标记的测试数据上测量两个网络之间的不一致率来估计。这是建立在20年Nakkiran&Bansal观察的基础上的，并且是一个更强大的版本，它要求第二次跑步必须在全新的训练环境中进行。我们进一步从理论上证明了这种特殊现象是由SGD训练模型的emph{ensembles}性质引起的。这一发现不仅为利用未标记测试数据直接预测测试误差提供了一种简单的实证方法，而且在泛化和校准之间建立了一种新的概念联系。摘要：We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran & Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.

【23】 Conjugate Energy-Based Models 标题：基于共轭能量的模型

作者：Hao Wu,Babak Esmaeili,Michael Wick,Jean-Baptiste Tristan,Jan-Willem van de Meent 机构：Northeastern University 链接：https://arxiv.org/abs/2106.13798 摘要：在本文中，我们提出了共轭能量模型（CEBMs），这是一类新的基于能量的模型，定义了数据和潜在变量的联合密度。CEBM的联合密度分解为数据上的难处理分布和潜在变量上的可处理后验分布。CEBMs与变分自动编码器有相似的用例，因为它们学习从数据到潜在变量的无监督映射。然而，这些模型省略了一个生成器网络，这使得它们能够学习数据点之间更灵活的相似性概念。我们的实验证明共轭EBMs在图像建模、潜在空间预测能力和各种数据集的域外检测方面都取得了很好的效果。摘要：In this paper, we propose conjugate energy-based models (CEBMs), a new class of energy-based models that define a joint density over data and latent variables. The joint density of a CEBM decomposes into an intractable distribution over data and a tractable posterior over latent variables. CEBMs have similar use cases as variational autoencoders, in the sense that they learn an unsupervised mapping from data to latent variables. However, these models omit a generator network, which allows them to learn more flexible notions of similarity between data points. Our experiments demonstrate that conjugate EBMs achieve competitive results in terms of image modelling, predictive power of latent space, and out-of-domain detection on a variety of datasets.

【24】 Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent 标题：代理凸性：梯度下降训练神经网络分析的统一框架

作者：Spencer Frei,Quanquan Gu 机构：and 备注：14 pages 链接：https://arxiv.org/abs/2106.13792 摘要：虽然学习神经网络的优化目标是高度非凸的，但基于梯度的方法在实际学习中已经取得了广泛的成功。这种并置导致了最近一些关于梯度下降训练神经网络的可证明保证的研究。不幸的是，这些工作中的技术往往高度特定于在每个环境中研究的问题，依赖于对分布、优化参数和网络架构的不同假设，使得很难在不同的环境中推广。在这项工作中，我们提出了一个统一的非凸优化框架来分析神经网络的训练。我们引入了代理凸性和代理Polyak-Lojasiewicz（PL）不等式的概念，当使用梯度方法时，如果原始目标函数诱导一个隐式最小化的代理目标函数，则满足这些不等式。我们证明了在满足代理凸性或代理PL不等式的目标上的随机梯度下降（SGD）导致了代理目标函数的有效保证。我们进一步证明了通过梯度下降训练的神经网络的许多现有保证可以通过代理凸性和代理PL不等式来统一。摘要：Although the optimization objectives for learning neural networks are highly non-convex, gradient-based methods have been wildly successful at learning neural networks in practice. This juxtaposition has led to a number of recent studies on provable guarantees for neural networks trained by gradient descent. Unfortunately, the techniques in these works are often highly specific to the problem studied in each setting, relying on different assumptions on the distribution, optimization parameters, and network architectures, making it difficult to generalize across different settings. In this work, we propose a unified non-convex optimization framework for the analysis of neural network training. We introduce the notions of proxy convexity and proxy Polyak-Lojasiewicz (PL) inequalities, which are satisfied if the original objective function induces a proxy objective function that is implicitly minimized when using gradient methods. We show that stochastic gradient descent (SGD) on objectives satisfying proxy convexity or the proxy PL inequality leads to efficient guarantees for proxy objective functions. We further show that many existing guarantees for neural networks trained by gradient descent can be unified through proxy convexity and proxy PL inequalities.

【25】 Private Adaptive Gradient Methods for Convex Optimization 标题：凸优化的私有自适应梯度法

作者：Hilal Asi,John Duchi,Alireza Fallah,Omid Javidbakht,Kunal Talwar 机构：§Department of Electrical Engineering & Computer Science, Massachusetts Institute of Technology 备注：To appear in 38th International Conference on Machine Learning (ICML 2021) 链接：https://arxiv.org/abs/2106.13756 摘要：研究了差分私有凸优化的自适应方法，提出并分析了具有自适应步长的随机梯度下降（SGD）算法和AdaGrad算法的差分私有变量。我们给出了两种算法的上界，并证明了上界是（最坏情况下）最优的。作为我们开发的结果，我们证明了我们的私有版本的AdaGrad优于自适应SGD，而自适应SGD在具有非各向同性梯度的场景中优于传统SGD，其中（非私有）AdaGrad优于SGD。主要的挑战是，在高维问题的梯度几何中，通常为隐私而添加的各向同性噪声控制着信号；有效优化低维子空间的方法忽略了变梯度几何带来的实际问题。相比之下，我们研究了非各向同性剪裁和噪声添加，发展了一个原则性的理论方法；随后的程序也享有明显更强的经验表现比以前的方法。摘要：We study adaptive methods for differentially private convex optimization, proposing and analyzing differentially private variants of a Stochastic Gradient Descent (SGD) algorithm with adaptive stepsizes, as well as the AdaGrad algorithm. We provide upper bounds on the regret of both algorithms and show that the bounds are (worst-case) optimal. As a consequence of our development, we show that our private versions of AdaGrad outperform adaptive SGD, which in turn outperforms traditional SGD in scenarios with non-isotropic gradients where (non-private) Adagrad provably outperforms SGD. The major challenge is that the isotropic noise typically added for privacy dominates the signal in gradient geometry for high-dimensional problems; approaches to this that effectively optimize over lower-dimensional subspaces simply ignore the actual problems that varying gradient geometries introduce. In contrast, we study non-isotropic clipping and noise addition, developing a principled theoretical approach; the consequent procedures also enjoy significantly stronger empirical performance than prior approaches.

【26】 Black Box Probabilistic Numerics 标题：黑箱概率数值

作者：Onur Teymur,Christopher N. Foley,Philip G. Breen,Toni Karvonen,Chris. J. Oates 机构：Newcastle University ,Alan Turing Institute, University of Cambridge ,Optima Partners UK ,Roar AI 链接：https://arxiv.org/abs/2106.13718 摘要：概率数学把数值任务，如微分方程的数值解，作为要解决的推理问题。一种方法是将感兴趣的未知量建模为随机变量，并使用传统数值方法过程中生成的数据约束该变量。然而，数据可能与感兴趣的数量非线性相关，使得对随机变量的适当调节变得困难，并且限制了可以处理的数值任务的范围。相反，本文提出仅基于传统方法的最终输出来构造概率数值方法。兴趣量的收敛近似序列构成了一个数据集，从中可以推断出兴趣的极限量，这是理查森的延迟极限方法的概率模拟。这种黑匣子方法（1）极大地扩展了概率数学可以应用到的任务范围，（2）继承了最先进的数值方法的特点和性能，（3）能够实现可证明的更高阶收敛。本文介绍了非线性常微分方程和偏微分方程以及特征值问题的应用，目前还没有概率数值方法。摘要：Probabilistic numerics casts numerical tasks, such the numerical solution of differential equations, as inference problems to be solved. One approach is to model the unknown quantity of interest as a random variable, and to constrain this variable using data generated during the course of a traditional numerical method. However, data may be nonlinearly related to the quantity of interest, rendering the proper conditioning of random variables difficult and limiting the range of numerical tasks that can be addressed. Instead, this paper proposes to construct probabilistic numerical methods based only on the final output from a traditional method. A convergent sequence of approximations to the quantity of interest constitute a dataset, from which the limiting quantity of interest can be extrapolated, in a probabilistic analogue of Richardson's deferred approach to the limit. This black box approach (1) massively expands the range of tasks to which probabilistic numerics can be applied, (2) inherits the features and performance of state-of-the-art numerical methods, and (3) enables provably higher orders of convergence to be achieved. Applications are presented for nonlinear ordinary and partial differential equations, as well as for eigenvalue problems-a setting for which no probabilistic numerical methods have yet been developed.

【27】 Task-Driven Out-of-Distribution Detection with Statistical Guarantees for Robot Learning 标题：基于任务驱动的具有统计保证的机器人学习失配检测

作者：Alec Farid,Sushant Veer,Anirudha Majumdar 机构：Department of Mechanical and Aerospace Engineering, Princeton University 链接：https://arxiv.org/abs/2106.13703 摘要：我们的目标是执行分布外（OOD）检测，即检测机器人何时在不同于用于训练机器人的分布的环境中工作。我们利用可能近似正确（PAC）-贝叶斯理论，在训练分布上训练一个性能有保证界的策略。我们的OOD检测的关键思想依赖于以下直觉：违反测试环境的性能限制提供了机器人正在操作OOD的证据。我们通过基于p值和浓度不等式的统计技术将其形式化。由此产生的方法（i）在OOD检测上提供有保证的置信边界，并且（ii）是任务驱动的，并且仅对影响机器人性能的变化敏感。我们在一个模拟的例子中演示了我们的方法，用不熟悉的姿势或形状来抓取物体。本文还对一架无人机在陌生环境（包括风干扰和不同的障碍物密度）中进行了基于视觉的避障仿真和硬件实验。我们的例子表明，我们可以执行任务驱动的OOD检测只需少数几个试验。与基线的比较也证明了我们的方法在提供统计保证和对任务无关的分布变化不敏感方面的优势。摘要：Our goal is to perform out-of-distribution (OOD) detection, i.e., to detect when a robot is operating in environments that are drawn from a different distribution than the environments used to train the robot. We leverage Probably Approximately Correct (PAC)-Bayes theory in order to train a policy with a guaranteed bound on performance on the training distribution. Our key idea for OOD detection then relies on the following intuition: violation of the performance bound on test environments provides evidence that the robot is operating OOD. We formalize this via statistical techniques based on p-values and concentration inequalities. The resulting approach (i) provides guaranteed confidence bounds on OOD detection, and (ii) is task-driven and sensitive only to changes that impact the robot's performance. We demonstrate our approach on a simulated example of grasping objects with unfamiliar poses or shapes. We also present both simulation and hardware experiments for a drone performing vision-based obstacle avoidance in unfamiliar environments (including wind disturbances and different obstacle densities). Our examples demonstrate that we can perform task-driven OOD detection within just a handful of trials. Comparisons with baselines also demonstrate the advantages of our approach in terms of providing statistical guarantees and being insensitive to task-irrelevant distribution shifts.

【28】 A proximal-proximal majorization-minimization algorithm for nonconvex tuning-free robust regression problems 标题：非凸无调谐鲁棒回归问题的近邻优化最小化算法

作者：Peipei Tang,Chengjing Wang,Bo Jiang 备注：31 pages, 7 tables 链接：https://arxiv.org/abs/2106.13683 摘要：本文提出了一种求解非凸无调谐稳健回归问题的近优最小化（PPMM）算法。其基本思想是采用基于稀疏半光滑牛顿法（SSN）的近点算法（PPA）求解具有内子问题的非凸问题。必须强调的是，算法设计的主要难点在于如何克服内子问题的奇异性。此外，我们还证明了PPMM算法收敛于d-平稳点。由于问题的Kurdyka-Lojasiewicz（KL）性质，我们给出了PPMM算法的收敛速度。数值实验表明，该算法优于现有的最新算法。摘要：In this paper, we introduce a proximal-proximal majorization-minimization (PPMM) algorithm for nonconvex tuning-free robust regression problems. The basic idea is to apply the proximal majorization-minimization algorithm to solve the nonconvex problem with the inner subproblems solved by a sparse semismooth Newton (SSN) method based proximal point algorithm (PPA). We must emphasize that the main difficulty in the design of the algorithm lies in how to overcome the singular difficulty of the inner subproblem. Furthermore, we also prove that the PPMM algorithm converges to a d-stationary point. Due to the Kurdyka-Lojasiewicz (KL) property of the problem, we present the convergence rate of the PPMM algorithm. Numerical experiments demonstrate that our proposed algorithm outperforms the existing state-of-the-art algorithms.

【29】 Robust Matrix Factorization with Grouping Effect 标题：具有分组效应的鲁棒矩阵分解

作者：Haiyan Jiang,Shuyu Li,Luwei Zhang,Haoyi Xiong,Dejing Dou 机构： Baidu Research, Baidu Inc., China, Columbia University, New York, NY, USA 备注：22 pages, 5 figures, 4 tables 链接：https://arxiv.org/abs/2106.13681 摘要：虽然许多技术已被应用于矩阵分解（MF），但它们可能没有充分利用特征结构。本文将分组效应引入到MF中，提出了一种新的基于分组效应的鲁棒矩阵分解方法（GRMF）。分组效应是稀疏效应的推广，它通过将相似值聚集在多个中心而不是0左右来进行去噪。与现有算法相比，本文提出的GRMF算法可以在不需要先验知识的情况下自动学习MF中的分组结构和稀疏性，通过引入一种自然可调的非凸正则化，实现了同时稀疏和分组的效果。具体地说，GRMF采用了一种高效的交替极小化框架来执行MF，该框架首先通过凸差分（DC）规划将原非凸问题转化为凸问题，然后用交替方向乘子法（ADMM）求解。此外，GRMF可以很容易地扩展到非负矩阵分解（NMF）设置。在实际数据集上进行了大量实验，实验结果表明，与五种基准算法相比，GRMF算法具有更好的性能和鲁棒性。摘要：Although many techniques have been applied to matrix factorization (MF), they may not fully exploit the feature structure. In this paper, we incorporate the grouping effect into MF and propose a novel method called Robust Matrix Factorization with Grouping effect (GRMF). The grouping effect is a generalization of the sparsity effect, which conducts denoising by clustering similar values around multiple centers instead of just around 0. Compared with existing algorithms, the proposed GRMF can automatically learn the grouping structure and sparsity in MF without prior knowledge, by introducing a naturally adjustable non-convex regularization to achieve simultaneous sparsity and grouping effect. Specifically, GRMF uses an efficient alternating minimization framework to perform MF, in which the original non-convex problem is first converted into a convex problem through Difference-of-Convex (DC) programming, and then solved by Alternating Direction Method of Multipliers (ADMM). In addition, GRMF can be easily extended to the Non-negative Matrix Factorization (NMF) settings. Extensive experiments have been conducted using real-world data sets with outliers and contaminated noise, where the experimental results show that GRMF has promoted performance and robustness, compared to five benchmark algorithms.

【30】 Multi-player Multi-armed Bandits with Collision-Dependent Reward Distributions 标题：具有碰撞相关报酬分配的多人多臂抢劫机

作者：Chengshuai Shi,Cong Shen 机构： Brown Department of Electrical andComputer Engineering, University of Virginia 备注：17 pages, 14 figures. Accepted to IEEE Transactions on Signal Processing 链接：https://arxiv.org/abs/2106.13669 摘要：研究了一个新的随机多人多臂盗贼（MP-MAB）问题，当手臂发生碰撞时，奖励分配会发生变化。现有文献总是假设，如果发生碰撞，参与者的报酬为零，但对于认知无线电等应用，更现实的情况是碰撞会降低平均报酬，但不一定为零。我们将重点放在更实用的无感知环境中，玩家不会直接感知碰撞，并提出了纠错碰撞通信（EC3）算法，该算法将隐式通信建模为噪声信道下的可靠通信问题，利用随机编码误差指数建立了无通信协议可克服的最优遗憾。最后，优化码长和译码错误率之间的折衷会导致接近集中MP-MAB遗憾的遗憾，这表示一个自然的下限。在合成数据集和真实数据集上的实际纠错码实验证明了EC3的优越性。结果表明，编码方案的选择对系统的性能有着深刻的影响。摘要：We study a new stochastic multi-player multi-armed bandits (MP-MAB) problem, where the reward distribution changes if a collision occurs on the arm. Existing literature always assumes a zero reward for involved players if collision happens, but for applications such as cognitive radio, the more realistic scenario is that collision reduces the mean reward but not necessarily to zero. We focus on the more practical no-sensing setting where players do not perceive collisions directly, and propose the Error-Correction Collision Communication (EC3) algorithm that models implicit communication as a reliable communication over noisy channel problem, for which random coding error exponent is used to establish the optimal regret that no communication protocol can beat. Finally, optimizing the tradeoff between code length and decoding error rate leads to a regret that approaches the centralized MP-MAB regret, which represents a natural lower bound. Experiments with practical error-correction codes on both synthetic and real-world datasets demonstrate the superiority of EC3. In particular, the results show that the choice of coding schemes has a profound impact on the regret performance.

【31】 VEGN: Variant Effect Prediction with Graph Neural Networks 标题：VEGN：基于图神经网络的变异效应预测

作者：Jun Cheng,Carolin Lawrence,Mathias Niepert 备注：Accepted at Workshop on Computational Biology, co-located with the 38th International Conference on Machine Learning 链接：https://arxiv.org/abs/2106.13642 摘要：基因突变可以破坏正常的基因功能而导致疾病。从单个患者体内数百万个基因变异中识别致病突变是一个具有挑战性的问题。因此，能够优先考虑致病突变的计算方法有着巨大的应用。众所周知，基因通过一个复杂的调控网络发挥作用。然而，现有的变量效应预测模型只考虑一个孤立的变量。与此相反，我们提出了VEGN，它使用一个图形神经网络（GNN）来模拟变异效应预测，该网络操作在一个包含基因和变异的异质图形上。这张图是通过给基因分配变异并用基因-基因相互作用网络连接基因而创建的。在这种情况下，我们探讨了一种方法，其中一个基因图是给定的，另一个素食主义者学习的基因图，因此在给定和学习的边缘操作。图形神经网络被训练来聚集基因之间，以及基因和变体之间的信息。变种可以通过它们连接的基因来交换信息。这种方法提高了现有最先进模型的性能。摘要：Genetic mutations can cause disease by disrupting normal gene function. Identifying the disease-causing mutations from millions of genetic variants within an individual patient is a challenging problem. Computational methods which can prioritize disease-causing mutations have, therefore, enormous applications. It is well-known that genes function through a complex regulatory network. However, existing variant effect prediction models only consider a variant in isolation. In contrast, we propose VEGN, which models variant effect prediction using a graph neural network (GNN) that operates on a heterogeneous graph with genes and variants. The graph is created by assigning variants to genes and connecting genes with an gene-gene interaction network. In this context, we explore an approach where a gene-gene graph is given and another where VEGN learns the gene-gene graph and therefore operates both on given and learnt edges. The graph neural network is trained to aggregate information between genes, and between genes and variants. Variants can exchange information via the genes they connect to. This approach improves the performance of existing state-of-the-art models.

【32】 Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote 标题：加权多数票的Chebyshev-Canelli PAC-Bayes-Bennett不等式

作者：Yi-Shan Wu,Andrés R. Masegosa,Stephan S. Lorenzen,Christian Igel,Yevgeny Seldin 机构：University of Copenhagen, University of Almería 备注：arXiv admin note: text overlap with arXiv:2007.13532 链接：https://arxiv.org/abs/2106.13624 摘要：我们提出一个新的二阶甲骨文界的预期风险加权多数票。该界基于Chebyshev-Cantelli不等式（又称单侧Chebyshev不等式）的一种新的参数形式，该不等式易于有效极小化。新形式解决了基于Chebyshev-Cantelli不等式、C-界[Germain et al.，2015]的先验oracle界面临的优化挑战，同时改进了基于Masegosa et al.[2020]引入的二阶Markov不等式的oracle界。我们还推导了PAC-Bayes-Bennett不等式，并将其用于oracle界的经验估计。PAC-Bayes-Bennett不等式改进了Seldin等人[2012]提出的PAC-Bayes-Bernstein不等式。我们提供了一个实证评估，证明新的边界可以改进Masegosa等人[2020]的工作。Chebyshev-Cantelli不等式和PAC-Bayes-Bennett不等式的参数形式对于研究其他领域的测度集中问题可能具有独立的意义。摘要：We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev-Cantelli inequality (a.k.a.\ one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive the PAC-Bayes-Bennett inequality, which we use for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality by Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work by Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.

【33】 Nonparametric monitoring of sunspot number observations: a case study 标题：太阳黑子数观测的非参数监测：一个实例研究

作者：Sophie Mathieu,Laure Lefèvre,Rainer von Sachs,Véronique Delouille,Christian Ritter,Frédéric Clette 机构：ISBALIDAM, UCLouvain, Laure Lefevre, Solar Physics and Space Weather department, Royal Observatory of Belgium, V´eronique Delouille, Fr´ed´eric Clette 备注：27 pages (without appendices), 6 figures 链接：https://arxiv.org/abs/2106.13535 摘要：太阳活动是长期气候趋势的重要驱动力，必须在气候模型中加以考虑。不幸的是，长期以来对这个数量的直接测量是不存在的。唯一与太阳活动有关的观测记录可以追溯到17世纪，那就是太阳黑子。令人惊讶的是，确定太阳黑子的数量一直持续到今天仍然是一个具有挑战性的统计问题。它产生于在低信噪比、非平稳性、数据缺失、非标准分布和多种误差的背景下，需要整合来自世界各地多个观测站的数据。因此，随着时间的推移，一些台站的数据出现了严重和各种各样的偏差。在本文中，我们提出了第一个系统和彻底的统计方法来监测这些复杂和重要的系列。它由三个步骤组成：在多个时间尺度上平滑、使用块引导校准的CUSUM图进行监测和使用支持向量技术对失控情况进行分类。这种方法使我们能够检测到以前分析中未发现的大范围异常（如突然跳变或更渐进的漂移）。它有助于我们确定主要偏差的原因，这些偏差通常与观察者或设备有关。它们的发现和鉴定将有助于改进今后的观测。在过去的数据中，它们的消除或修正将导致更精确地重建太阳活动的世界参考指数：国际太阳黑子数。摘要：Solar activity is an important driver of long-term climate trends and must be accounted for in climate models. Unfortunately, direct measurements of this quantity over long periods do not exist. The only observation related to solar activity whose records reach back to the seventeenth century are sunspots. Surprisingly, determining the number of sunspots consistently over time has remained until today a challenging statistical problem. It arises from the need of consolidating data from multiple observing stations around the world in a context of low signal-to-noise ratios, non-stationarity, missing data, non-standard distributions and many kinds of errors. The data from some stations experience therefore severe and various deviations over time. In this paper, we propose the first systematic and thorough statistical approach for monitoring these complex and important series. It consists of three steps essential for successful treatment of the data: smoothing on multiple timescales, monitoring using block bootstrap calibrated CUSUM charts and classifying of out-of-control situations by support vector techniques. This approach allows us to detect a wide range of anomalies (such as sudden jumps or more progressive drifts), unseen in previous analyses. It helps us to identify the causes of major deviations, which are often observer or equipment related. Their detection and identification will contribute to improve future observations. Their elimination or correction in past data will lead to a more precise reconstruction of the world reference index for solar activity: the International Sunspot Number.

【34】 Federated Graph Classification over Non-IID Graphs 标题：非IID图上的联合图分类

作者：Han Xie,Jing Ma,Li Xiong,Carl Yang 机构：Department of Computer Science, Emory University 链接：https://arxiv.org/abs/2106.13423 摘要：联邦学习已经成为在不同领域训练机器学习模型的一个重要范例。对于图级任务（如图分类），图也可以看作是一种特殊类型的数据样本，可以在单独的本地系统中收集和存储。与其他领域类似，多个局部系统（每个局部系统都有一小组图）可以从协作训练一个强大的图挖掘模型中获益，例如流行的图神经网络（GNNs）。为了给这些努力提供更多的动力，我们分析了来自不同领域的真实世界图，以确认它们确实共享某些与随机图相比具有统计显著性的图属性。然而，我们也发现不同的图集，即使来自同一个域或同一个数据集，在图结构和节点特征方面都是非IID的。为了解决这个问题，我们提出了一个图聚类联邦学习（GCFL）框架，该框架基于GNNs的梯度动态地发现局部系统的聚类，并从理论上证明了这种聚类可以减少局部系统所拥有的图的结构和特征的异质性。此外，我们观察到GNNs的梯度在GCFL中波动较大，这阻碍了高质量的聚类，并设计了一种基于梯度序列的动态时间扭曲聚类机制（GCFL+）。大量的实验结果和深入的分析证明了我们提出的框架的有效性。摘要：Federated learning has emerged as an important paradigm for training machine learning models in different domains. For graph-level tasks such as graph classification, graphs can also be regarded as a special type of data samples, which can be collected and stored in separate local systems. Similar to other domains, multiple local systems, each holding a small set of graphs, may benefit from collaboratively training a powerful graph mining model, such as the popular graph neural networks (GNNs). To provide more motivation towards such endeavors, we analyze real-world graphs from different domains to confirm that they indeed share certain graph properties that are statistically significant compared with random graphs. However, we also find that different sets of graphs, even from the same domain or same dataset, are non-IID regarding both graph structures and node features. To handle this, we propose a graph clustering federated learning (GCFL) framework that dynamically finds clusters of local systems based on the gradients of GNNs, and theoretically justify that such clusters can reduce the structure and feature heterogeneity among graphs owned by the local systems. Moreover, we observe the gradients of GNNs to be rather fluctuating in GCFL which impedes high-quality clustering, and design a gradient sequence-based clustering mechanism based on dynamic time warping (GCFL+). Extensive experimental results and in-depth analysis demonstrate the effectiveness of our proposed frameworks.

【35】 The Price of Tolerance in Distribution Testing 标题：分布测试中的容差代价

作者：Clément L. Canonne,Ayush Jain,Gautam Kamath,Jerry Li 链接：https://arxiv.org/abs/2106.13414 摘要：我们重新讨论了容错分布测试的问题。也就是说，给定来自未知分布$p$超过$\{1、\dots，n\}$的样本，它是$\varepsilon\u 1$-接近还是$\varepsilon\u 2$-远离参考分布$q$（总变化距离）？尽管在过去的十年里人们对这个问题有很大的兴趣，但只有在极端的情况下，人们才能很好地理解这个问题。在无噪声设置中（即$\varepsilon\u 1=0$），样本复杂度为$\Theta（\sqrt{n}）$，在域大小上是强次线性的。在谱的另一端，当$\varepsilon\u 1=\varepsilon\u 2/2$时，样本复杂度跳到勉强次线性$\Theta（n/\logn）$。然而，人们对中间体制知之甚少。我们将分布测试中的容限价格完全描述为$n$，$\varepsilon\u 1$，$\varepsilon\u 2$的函数，直到一个$\log n$因子。具体来说，我们将示例复杂性显示为\[\tilde\Theta\left（\frac{\sqrt{n}}}{\varepsilon\u2^{2}}+\frac{n}{\logn}\cdot\max\left\{\frac{\varepsilon\u1}{\varepsilon\u2^2}，\left（\frac{\varepsilon\u1}{\varepsilon 2^2}\right）^{！2} \right\}\right），\]在两个先前已知的案例之间进行平滑的权衡。对于$p$和$q$都未知的容限等价性测试问题，我们也提供了类似的描述。令人惊讶的是，在这两种情况下，决定样本复杂度的主要数量是比率$\varepsilon\u 1/\varepsilon\u 2^2$，而不是更直观的$\varepsilon\u 1/\varepsilon\u 2$。特别有技术意义的是我们的下界框架，它涉及到处理$\varepsilon\u 1$和$\varepsilon\u 2$之间的不对称性所需的新的近似理论工具，这在以前的工作中是不存在的。摘要：We revisit the problem of tolerant distribution testing. That is, given samples from an unknown distribution $p$ over $\{1, \dots, n\}$, is it $\varepsilon_1$-close to or $\varepsilon_2$-far from a reference distribution $q$ (in total variation distance)? Despite significant interest over the past decade, this problem is well understood only in the extreme cases. In the noiseless setting (i.e., $\varepsilon_1 = 0$) the sample complexity is $\Theta(\sqrt{n})$, strongly sublinear in the domain size. At the other end of the spectrum, when $\varepsilon_1 = \varepsilon_2/2$, the sample complexity jumps to the barely sublinear $\Theta(n/\log n)$. However, very little is known about the intermediate regime. We fully characterize the price of tolerance in distribution testing as a function of $n$, $\varepsilon_1$, $\varepsilon_2$, up to a single $\log n$ factor. Specifically, we show the sample complexity to be \[\tilde \Theta\left(\frac{\sqrt{n}}{\varepsilon_2^{2}} + \frac{n}{\log n} \cdot \max \left\{\frac{\varepsilon_1}{\varepsilon_2^2},\left(\frac{\varepsilon_1}{\varepsilon_2^2}\right)^{\!\!2}\right\}\right),\] providing a smooth tradeoff between the two previously known cases. We also provide a similar characterization for the problem of tolerant equivalence testing, where both $p$ and $q$ are unknown. Surprisingly, in both cases, the main quantity dictating the sample complexity is the ratio $\varepsilon_1/\varepsilon_2^2$, and not the more intuitive $\varepsilon_1/\varepsilon_2$. Of particular technical interest is our lower bound framework, which involves novel approximation-theoretic tools required to handle the asymmetry between $\varepsilon_1$ and $\varepsilon_2$, a challenge absent from previous works.

【36】 Bayesian Inference in High-Dimensional Time-Serieswith the Orthogonal Stochastic Linear Mixing Model 标题：正交随机线性混合模型在高维时间序列中的贝叶斯推断

作者：Rui Meng,Kristofer Bouchard 机构：Lawrence Berkeley National Laboratory, University of California, Berkeley. 链接：https://arxiv.org/abs/2106.13379 摘要：许多现代时间序列数据集包含大量长时间采样的输出响应变量。例如，在神经科学中，神经元的100s-1000的活动是在行为和对感觉刺激的反应中被记录的。多输出高斯过程模型利用高斯过程的非参数特性来捕获多个输出的结构。然而，这类模型通常假设输出响应变量之间的相关性在输入空间中是不变的。随机线性混合模型（SLMM）假设混合系数依赖于输入，使其更灵活有效地捕捉复杂的输出依赖性。然而，目前对于大数据集的SLMMs推理比较困难，不适用于一些现代时间序列问题。在本文中，我们提出了一个新的回归框架，正交随机线性混合模型（OSLMM），它在混合系数之间引入了一个正交约束。这个约束减少了推理的计算负担，同时保留了处理复杂输出依赖性的能力。我们为SLMM和OSLMM提供了Markov链montecarlo推理过程，并在一些实际应用中证明了OSLMM与现有方法相比具有优越的模型可扩展性和减少的预测误差。在神经生理学记录中，我们使用推断的潜伏期函数对群体对听觉刺激的反应进行紧凑的可视化，并且与竞争性方法（GPFA）相比显示出更好的结果。总之，这些结果表明OSLMM将有助于分析不同的、大规模的时间序列数据集。摘要：Many modern time-series datasets contain large numbers of output response variables sampled for prolonged periods of time. For example, in neuroscience, the activities of 100s-1000's of neurons are recorded during behaviors and in response to sensory stimuli. Multi-output Gaussian process models leverage the nonparametric nature of Gaussian processes to capture structure across multiple outputs. However, this class of models typically assumes that the correlations between the output response variables are invariant in the input space. Stochastic linear mixing models (SLMM) assume the mixture coefficients depend on input, making them more flexible and effective to capture complex output dependence. However, currently, the inference for SLMMs is intractable for large datasets, making them inapplicable to several modern time-series problems. In this paper, we propose a new regression framework, the orthogonal stochastic linear mixing model (OSLMM) that introduces an orthogonal constraint amongst the mixing coefficients. This constraint reduces the computational burden of inference while retaining the capability to handle complex output dependence. We provide Markov chain Monte Carlo inference procedures for both SLMM and OSLMM and demonstrate superior model scalability and reduced prediction error of OSLMM compared with state-of-the-art methods on several real-world applications. In neurophysiology recordings, we use the inferred latent functions for compact visualization of population responses to auditory stimuli, and demonstrate superior results compared to a competing method (GPFA). Together, these results demonstrate that OSLMM will be useful for the analysis of diverse, large-scale time-series datasets.

【37】 On the (Un-)Avoidability of Adversarial Examples 标题：论对抗性例证的(不可回避)性

作者：Sadia Chowdhury,Ruth Urner 机构： and seemingly erraticbehaviors of deep learning models have caused substantial 1Lassonde School of Engineering, YorkUniversity 备注：ICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI 链接：https://arxiv.org/abs/2106.13326 摘要：深度学习模式中的对抗性范例现象引起了人们对其可靠性的极大关注。虽然许多深层神经网络在预测精度方面表现出令人印象深刻的性能，但在许多情况下，一个不可察觉的扰动可以错误地翻转网络的预测。大多数研究都集中在发展对抗性攻击的防御或者在最坏的情况下学习对抗性损失。在这项工作中，我们退后一步，目的是提供一个框架来确定模型在小扰动下的标签变化是否合理（以及何时不合理）。我们谨慎地认为，对抗性稳健性应该被定义为一个符合潜在分布的局部自适应度量。然后，我们提出了一个自适应鲁棒损失的定义，推导了它的经验版本，并开发了一个由此产生的数据扩充框架。我们证明了我们的自适应数据扩充在确定性标签下保持了1-最近邻分类的一致性，并提供了示例性的经验评估。摘要：The phenomenon of adversarial examples in deep learning models has caused substantial concern over their reliability. While many deep neural networks have shown impressive performance in terms of predictive accuracy, it has been shown that in many instances an imperceptible perturbation can falsely flip the network's prediction. Most research has then focused on developing defenses against adversarial attacks or learning under a worst-case adversarial loss. In this work, we take a step back and aim to provide a framework for determining whether a model's label change under small perturbation is justified (and when it is not). We carefully argue that adversarial robustness should be defined as a locally adaptive measure complying with the underlying distribution. We then suggest a definition for an adaptive robust loss, derive an empirical version of it, and develop a resulting data-augmentation framework. We prove that our adaptive data-augmentation maintains consistency of 1-nearest neighbor classification under deterministic labels and provide illustrative empirical evaluations.

【38】 A variational autoencoder approach for choice set generation and implicit perception of alternatives in choice modeling 标题：选择建模中选择集生成和选择隐式感知的变分自动编码器方法

作者：Rui Yao,Shlomo Bekhor 机构：Department of Civil and Environmental Engineering, Technion – Israel Institute of Technology, Haifa , Israel 链接：https://arxiv.org/abs/2106.13319 摘要：本文推导了具有方案隐式可用性/感知（IAP）的广义极值（GEV）模型，提出了一种用于方案选择集生成和隐式感知的变分自编码（VAE）方法。具体地说，作为IAP-GEV模型的一个例子，导出了带有IAP的交叉嵌套logit（CNL）模型。采用VAE方法对选择集生成过程进行建模，使得在选择集中感知到选择方案的可能性最大。以一个实际数据集为例，说明了VAE方法生成路由选择集的方法。与多项式logit模型和传统的选择集生成方法相比，IAP-CNL模型在拟合优度和预测性能方面都有较好的表现。摘要：This paper derives the generalized extreme value (GEV) model with implicit availability/perception (IAP) of alternatives and proposes a variational autoencoder (VAE) approach for choice set generation and implicit perception of alternatives. Specifically, the cross-nested logit (CNL) model with IAP is derived as an example of IAP-GEV models. The VAE approach is adapted to model the choice set generation process, in which the likelihood of perceiving chosen alternatives in the choice set is maximized. The VAE approach for route choice set generation is exemplified using a real dataset. IAP- CNL model estimated has the best performance in terms of goodness-of-fit and prediction performance, compared to multinomial logit models and conventional choice set generation methods.

【39】 Stochastic modeling of scientific impact 标题：科学影响的随机建模

作者：M. V. Simkin 机构： Simkin Department of Electrical and Computer Engineering, University of California 备注：Accepted for publication by Europhysics Letters 链接：https://arxiv.org/abs/2106.13295 摘要：最近的研究发现，精选的科学家在被高度引用的论文中占有不成比例的份额。研究人员推断，如果科学上的成功是随机的，并且引入了一个隐藏的参数Q，或者人才来解释这一发现，那么这种情况就不可能发生。所以，才华横溢的高Q科学家们有很多有影响力的论文。在这里，我展示了一个旧的随机引文复制模型的升级也可以解释这个发现。在新模型中，所有论文的引文复制概率并不相同，而是与作者所有论文的引文总数的对数成正比。该模型的数值模拟结果与Q因子文章的实证结果相似。摘要：Recent research has found that select scientists have a disproportional share of highly cited papers. Researchers reasoned that this could not have happened if success in science was random and introduced a hidden parameter Q, or talent, to explain this finding. So, the talented high-Q scientists have many high impact papers. Here I show that an upgrade of an old random citation copying model could also explain this finding. In the new model the probability of citation copying is not the same for all papers but is proportional to the logarithm of the total number of citations to all papers of its author. Numerical simulations of the model give results similar to the empirical findings of the Q-factor article.

【40】 Multitask Learning for Citation Purpose Classification 标题：面向引文目的分类的多任务学习

作者：Alex Oesterling,Angikar Ghosal,Haoyang Yu,Rui Xin,Yasa Baig,Lesia Semenova,Cynthia Rudin 机构：Duke University 备注：Second Workshop on Scholarly Document Processing 链接：https://arxiv.org/abs/2106.13275 摘要：我们现在进入2021年3C共享任务引用上下文分类的目的竞争的基础上。竞赛的目的是根据目的对科学文章中的引文进行分类。这项任务很重要，因为它可能导致更全面地总结科学文章的目的和用途，但也很困难，主要是因为现有的训练数据有限，其中每个引用的目的都是手工标注的，而且这些标注的主观性也很强。我们的参赛作品是一个多任务模型，它结合了多个模块，从不同的角度来处理问题，包括手工生成的语言特征，TF-IDF特征，和一个带注意的LSTM模型。我们还提供了消融研究和特征分析，其见解可能会导致未来的工作。摘要：We present our entry into the 2021 3C Shared Task Citation Context Classification based on Purpose competition. The goal of the competition is to classify a citation in a scientific article based on its purpose. This task is important because it could potentially lead to more comprehensive ways of summarizing the purpose and uses of scientific articles, but it is also difficult, mainly due to the limited amount of available training data in which the purposes of each citation have been hand-labeled, along with the subjectivity of these labels. Our entry in the competition is a multi-task model that combines multiple modules designed to handle the problem from different perspectives, including hand-generated linguistic features, TF-IDF features, and an LSTM-with-attention model. We also provide an ablation study and feature analysis whose insights could lead to future work.

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2021-06-28，如有侵权请联系 cloudcommunity@tencent.com 删除

linux