统计学学术速递[7.9]

公众号-arXiv每日学术速递

发布于 2021-07-27 10:42:41

6140

发布于 2021-07-27 10:42:41

stat统计学，共计41篇

【1】 Compressibility Analysis of Asymptotically Mean Stationary Processes 标题：渐近平均平稳过程的可压缩性分析

作者：Jorge F. Silva 机构：Information and Decision System Group, Department of Electrical Engineering, University of Chile, Av. Tupper , Room , Santiago, Chile 链接：https://arxiv.org/abs/2107.03975 摘要：这项工作为随机序列的压缩性分析提供了新的结果。结果表明，在不同的有效系数率（压缩性分析）下，随机序列可以用其最佳的$k$-稀疏形式来逼近。特别地，引入强$\ellp$-特征的概念来表示一个随机序列，当考虑有效系数的固定比率（固定比率分析）时，它的最佳$\k$-项逼近误差有一个明确的渐近极限（样本）。这项工作的主要定理表明，渐近平均平稳过程（AMS）的丰富族具有强的$\ellp$-特征。此外，我们给出了刻画和分析这类过程的$\ellp$-近似误差函数的结果。在AMS过程的分析中加入遍历性，引入了一个定理，证明了逼近误差函数是常数，并由过程的平稳平均值以闭形式确定。我们的结果和分析有助于离散时间稀疏过程的理论和理解，并且在技术方面，证实了点态遍历定理在确定离散时间过程的压缩性表达式方面是多么有用，即使在平稳性和遍历性假设放松的情况下。摘要：This work provides new results for the analysis of random sequences in terms of $\ell_p$-compressibility. The results characterize the degree in which a random sequence can be approximated by its best $k$-sparse version under different rates of significant coefficients (compressibility analysis). In particular, the notion of strong $\ell_p$-characterization is introduced to denote a random sequence that has a well-defined asymptotic limit (sample-wise) of its best $k$-term approximation error when a fixed rate of significant coefficients is considered (fixed-rate analysis). The main theorem of this work shows that the rich family of asymptotically mean stationary (AMS) processes has a strong $\ell_p$-characterization. Furthermore, we present results that characterize and analyze the $\ell_p$-approximation error function for this family of processes. Adding ergodicity in the analysis of AMS processes, we introduce a theorem demonstrating that the approximation error function is constant and determined in closed-form by the stationary mean of the process. Our results and analyses contribute to the theory and understanding of discrete-time sparse processes and, on the technical side, confirm how instrumental the point-wise ergodic theorem is to determine the compressibility expression of discrete-time processes even when stationarity and ergodicity assumptions are relaxed.

【2】 Locally differentially private estimation of nonlinear functionals of discrete distributions 标题：离散分布非线性泛函的局部微分私有估计

作者：Cristina Butucea,Yann Issartel 链接：https://arxiv.org/abs/2107.03940 摘要：研究了局部微分隐私下离散分布非线性泛函的估计问题。初始数据$x\u 1，\ldots，x\u n\in[K]$假定为i.i.d.，并根据未知离散分布$p=（p\u 1，\ldots，p\u K）$分布。只有$\alpha$-本地差异私有（LDP）样本$z_1，…，z_n$是公开的，其中术语“local”表示每个$z_i$使用一个单独的属性$x_i$生成。我们展示的隐私机制（PM）是交互式的（即允许他们使用已经发布的机密数据）或非交互式的。我们描述了估计幂和函数$F{\gamma}=\sum{k=1}^k p{k^{\gamma}$，$\gamma>0$的二次风险的行为，作为$k，\，n$和$\alpha$的函数。在非交互情况下，我们研究了两个插件类型的估计值$F{\gamma}$，对于所有$\gamma>0$，这与Jiao等人（2017）在多项式模型中分析的MLE相似。然而，由于隐私限制，我们获得的速率较慢，与Collier等人（2020）在高斯模型中获得的速率相似。在交互式情况下，我们为所有$\gamma>1$引入了一个两步过程，当$\gamma\geq为2$时，该过程可获得更快的参数速率$（n\alpha^2）^{-1/2}$。我们给出了所有$\alpha$-LDP机制和使用私有样本的所有估计的下界结果。摘要：We study the problem of estimating non-linear functionals of discrete distributions in the context of local differential privacy. The initial data $x_1,\ldots,x_n \in [K]$ are supposed i.i.d. and distributed according to an unknown discrete distribution $p = (p_1,\ldots,p_K)$. Only $\alpha$-locally differentially private (LDP) samples $z_1,...,z_n$ are publicly available, where the term 'local' means that each $z_i$ is produced using one individual attribute $x_i$. We exhibit privacy mechanisms (PM) that are interactive (i.e. they are allowed to use already published confidential data) or non-interactive. We describe the behavior of the quadratic risk for estimating the power sum functional $F_{\gamma} = \sum_{k=1}^K p_k^{\gamma}$, $\gamma >0$ as a function of $K, \, n$ and $\alpha$. In the non-interactive case, we study two plug-in type estimators of $F_{\gamma}$, for all $\gamma >0$, that are similar to the MLE analyzed by Jiao et al. (2017) in the multinomial model. However, due to the privacy constraint the rates we attain are slower and similar to those obtained in the Gaussian model by Collier et al. (2020). In the interactive case, we introduce for all $\gamma >1$ a two-step procedure which attains the faster parametric rate $(n \alpha^2)^{-1/2}$ when $\gamma \geq 2$. We give lower bounds results over all $\alpha$-LDP mechanisms and all estimators using the private samples.

【3】 Balancing Higher Moments Matters for Causal Estimation: Further Context for the Results of Setodji et al. (2017) 标题：平衡高阶矩对因果估计的影响：Setodji等人结果的进一步背景。(2017)

作者：Melody Y. Huang,Brian G. Vegetabile,Lane F. Burgette,Claude Setodji,Beth Ann Griffin 机构：University of California, Los Angeles, RAND Corporation 备注：9 pages, 1 table 链接：https://arxiv.org/abs/2107.03922 摘要：我们扩展了Setodji等人（2017）的模拟研究，在评估二元治疗的平均治疗效果时，比较了三种有希望的平衡方法：广义增强模型（GBM）、协变量平衡倾向评分（CBP）和熵平衡（EB）。研究表明，当治疗分配和结果模型中可能存在非线性关联，并且CBPS和EB仅在一阶矩上进行微调以获得平衡时，GBM可以优于CBPS和EB。我们探讨了在CBPS和EB的平衡条件下使用高阶矩的潜在好处。我们的研究结果显示，CBPS和EB在默认情况下应包括高阶矩，并且仅关注第一阶矩会导致CBPS和EB估计的治疗效果估计中的实质性偏差，这可以通过使用高阶矩来避免。摘要：We expand upon the simulation study of Setodji et al. (2017) which compared three promising balancing methods when assessing the average treatment effect on the treated for binary treatments: generalized boosted models (GBM), covariate-balancing propensity scores (CBPS), and entropy balance (EB). The study showed that GBM can outperform CBPS and EB when there are likely to be non-linear associations in both the treatment assignment and outcome models and CBPS and EB are fine-tuned to obtain balance only on first order moments. We explore the potential benefit of using higher-order moments in the balancing conditions for CBPS and EB. Our findings showcase that CBPS and EB should, by default, include higher order moments and that focusing only on first moments can result in substantial bias in both CBPS and EB estimated treatment effect estimates that could be avoided by the use of higher moments.

【4】 Likelihood-Free Frequentist Inference: Bridging Classical Statistics and Machine Learning in Simulation and Uncertainty Quantification 标题：无似然频率推理：在仿真和不确定性量化中架起经典统计和机器学习的桥梁

作者：Niccolò Dalmasso,David Zhao,Rafael Izbicki,Ann B. Lee 机构： Department of Statistics and Data Science, Carnegie Mellon University, Federal University of Sao Carlos 备注：49 pages, 12 figures, code available at this https URL 链接：https://arxiv.org/abs/2107.03920 摘要：许多科学领域广泛使用计算机模拟器，它隐式地为复杂系统编码似然函数。经典的统计方法不太适合这些所谓的无似然推理（LFI）设置，在渐近和低维区域之外。尽管新的机器学习方法，如归一化流，已经彻底改变了LFI方法的样本效率和容量，但它们是否产生可靠的不确定性度量仍然是一个悬而未决的问题。本文提出了一个将经典统计学与现代机器学习相结合的LFI统计框架：（1）构造有限样本覆盖率（I型误差控制）和幂的频数置信集和假设检验，（2）为评估整个参数空间的经验覆盖率提供严格的诊断。我们将我们的框架称为无似然频率推断（LF2I）。任何估计测试统计的方法，比如似然比，都可以插入我们的框架中，以创建具有正确覆盖率的强大测试和置信集。在这项工作中，我们特别研究了两个测试统计量（ACORE和BFF），它们分别在参数空间上最大化和集成优势函数。我们的理论和实证结果提供了多方面的观点，错误来源和挑战的可能性自由频率推断。摘要：Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions for complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, outside the asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce reliable measures of uncertainty. In this paper, we present a statistical framework for LFI that unifies classical statistics with modern machine learning to: (1) construct frequentist confidence sets and hypothesis tests with finite-sample guarantees of nominal coverage (type I error control) and power, and (2) provide rigorous diagnostics for assessing empirical coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that estimates a test statistic, such as the likelihood ratio, can be plugged into our framework to create powerful tests and confidence sets with correct coverage. In this work, we specifically study two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our theoretical and empirical results offer multifaceted perspectives on error sources and challenges in likelihood-free frequentist inference.

【5】 Moment-based density and risk estimation from grouped summary statistics 标题：基于矩的密度和分组汇总统计的风险估计

作者：Philippe Lambert 机构： Institut de Recherche en Sciences Sociales (IRSS), M´ethodes Quantitatives en Sciences Sociales, Universit´e de Liege, Place des Orateurs , B-, Liege, Belgium, Institut de Statistique, Biostatistique et Sciences Actuarielles (ISBA) 链接：https://arxiv.org/abs/2107.03883 摘要：关于连续变量的数据通常通过直方图进行汇总或以表格形式显示：数据范围被划分为连续的区间类，每个类中的观测值数量被提供给分析员。然后，通过假设变量在每个划分类中均匀分布，通过将所有观察值集中在中心，或通过将它们扩展到极值，可以以非参数的方式进行计算。平滑方法也可以用来估计潜在的密度或参数模型可以拟合这些分组数据。对于保险损失数据，通常会提供关于每个类中包含的观测值的一些附加信息，通常是特定于类的样本矩，例如平均值、方差甚至偏度和峰度。问题是如何在估算过程中包含这些附加信息。本文以汽车保险数据为例，提出了一种基于增强信息的密度和分位数估计方法。摘要：Data on a continuous variable are often summarized by means of histograms or displayed in tabular format: the range of data is partitioned into consecutive interval classes and the number of observations falling within each class is provided to the analyst. Computations can then be carried in a nonparametric way by assuming a uniform distribution of the variable within each partitioning class, by concentrating all the observed values in the center, or by spreading them to the extremities. Smoothing methods can also be applied to estimate the underlying density or a parametric model can be fitted to these grouped data. For insurance loss data, some additional information is often provided about the observed values contained in each class, typically class-specific sample moments such as the mean, the variance or even the skewness and the kurtosis. The question is then how to include this additional information in the estimation procedure. The present paper proposes a method for performing density and quantile estimation based on such augmented information with an illustration on car insurance data.

【6】 A Robust Approach to ARMA Factor Modeling 标题：一种稳健的ARMA因子建模方法

作者：Lucia Falconi,Augusto Ferrante,Mattia Zorzi 机构： Zorzi are with the Department of Information Engineering, University of Padova 链接：https://arxiv.org/abs/2107.03873 摘要：本文研究ARMA过程的动态因子分析问题。为了可靠地估计因子的个数，我们构造了一个以有限样本估计为中心的置信域，该置信域包含具有给定概率的真实模型。在这个置信域内，通过迹范数凸松弛有效地逼近问题，并将其表示为适当谱密度的秩极小化。后者是借助于拉格朗日对偶理论来解决的，它允许证明解的存在性。最后给出了求解对偶问题的数值算法。通过对合成数据和实际数据的仿真研究，验证了所提出的估计器的有效性。摘要：This paper deals with the dynamic factor analysis problem for an ARMA process. To robustly estimate the number of factors, we construct a confidence region centered in a finite sample estimate of the underlying model which contains the true model with a prescribed probability. In this confidence region, the problem, formulated as a rank minimization of a suitable spectral density, is efficiently approximated via a trace norm convex relaxation. The latter is addressed by resorting to the Lagrange duality theory, which allows to prove the existence of solutions. Finally, a numerical algorithm to solve the dual problem is presented. The effectiveness of the proposed estimator is assessed through simulation studies both with synthetic and real data.

【7】 Benchpress: a scalable and platform-independent workflow for benchmarking structure learning algorithms for graphical models 标题：Benchpress：一种可扩展的独立于平台的工作流，用于对图形模型的结构学习算法进行基准测试

作者：Felix L. Rios,Giusi Moffa,Jack Kuipers 机构：Department of Mathematics and Computer Science, University of Basel, Spiegelgasse , Basel, Switzerland, Department of Biosystems Science and Engineering, ETH Z¨urich, Mattenstrasse , Basel, Switzerland, Editor: 备注：30 pages, 1 figure 链接：https://arxiv.org/abs/2107.03863 摘要：描述研究领域中变量之间的关系和建立数据生成机制的模型是许多实证科学中的一个基本问题。概率图形模型是解决这个问题的一种常用方法。学习图形结构在计算上具有挑战性，并且是当前研究的一个热点领域，目前正在开发大量的算法。为了便于不同方法的基准测试，我们提出了一种新的自动化工作流程，称为benchpress，用于为概率图形模型生成可伸缩、可复制和与平台无关的结构学习算法基准。Benchpress是通过一个简单的JSON文件连接的，这使得所有用户都可以访问它，而代码是以完全模块化的方式设计的，以使研究人员能够提供更多的方法。Benchpress目前提供了一个与BiDAG、bnlearn、GOBNILP、pcalg、r.blip、scikit learn、TETRAD和trilearn等库中大量最新算法的接口，以及用于数据生成模型和性能评估的各种方法。除了用户定义的模型和随机生成的数据集之外，软件工具还包括来自文献的一些标准数据集和图形模型，这些数据集和图形模型可能包含在基准工作流程中。我们在四个典型的数据场景中演示了这种学习贝叶斯网络的工作流程的适用性。源代码和文档可从http://github.com/felixleopoldo/benchpress. 摘要：Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel automated workflow, called benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BiDAG, bnlearn, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the software tool also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking workflow. We demonstrate the applicability of this workflow for learning Bayesian networks in four typical data scenarios. The source code and documentation is publicly available from http://github.com/felixleopoldo/benchpress.

【8】 Inadmissibility Results for the Selected Hazard Rates 标题：选定危险率的不可受理结果

作者：Brijesh Kumar Jha,Ajaya Kumar Mahapatra,Suchandan Kayal 机构：a. Department of Mathematics, Siksha ‘O’ Anusandhan University, Bhubaneswar-, b. Centre for Applied Mathematics and Computing, Siksha ‘O’ Anusandhan University, Bhubaneswar-, India 链接：https://arxiv.org/abs/2107.03848 摘要：让我们考虑$K~（\Ge 2）$独立种群$PII1，\LDOTS，\PIYK $，其中$ \ PII i $遵循指数分布，危险率${SigMaII}，$（$i＝1，\LDOTS，K$）。假设$y{{1}，\LDOTS，y{{In } $是从$i$$人口$PIi i $中抽取的大小$n$的随机样本，其中i＝1，\LDOTS，K$为$i＝1，\LDOTS，k $，考虑$yii＝\ SUMY{{j＝1 } ^ Ny{{ij}$。自然选择法则是选择一个与最大样本均值相关的群体。也就是说，如果$Y\u i=\max（Y\u 1，\ldots，Y\u k）$，则选择$\Pi\u i$，（$i=1，\ldots，k$）。根据这个选择规则，选择一个群体。然后，我们考虑熵损失函数的选择种群的危险率的估计。提出了一些自然估计。建立了自然估计的极小极大性。在自然估计的基础上，导出了改进估计。最后，进行了数值研究，以比较所提出的估计的风险值。摘要：Let us consider $k ~(\ge 2)$ independent populations $\Pi_1, \ldots,\Pi_k$, where $\Pi_i$ follows exponential distribution with hazard rate ${\sigma_i},$ ($i = 1,\ldots,k$). Suppose $Y_{i1},\ldots, Y_{in}$ be a random sample of size $n$ drawn from the $i$th population $\Pi_i$, where $i = 1,\ldots,k.$ For $i = 1,\ldots,k$, consider $Y_i=\sum_{j=1}^nY_{ij}$. The natural selection rule is to select a population associated with the largest sample mean. That is, $\Pi_i$, ($i = 1,\ldots,k$) is selected if $Y_i=\max(Y_1,\ldots,Y_k)$. Based on this selection rule, a population is chosen. Then, we consider the estimation of the hazard rate of the selected population with respect to the entropy loss function. Some natural estimators are proposed. The minimaxity of a natural estimator is established. Improved estimators improving upon the natural estimators are derived. Finally, numerical study is carried out in order to compare the proposed estimators in terms of the risk values.

【9】 Consistency of the Maximal Information Coefficient Estimator 标题：最大信息系数估计的相合性

作者：John Lazarsfeld,Aaron Johnson 机构：Yale University, U.S. Naval Research Laboratory 链接：https://arxiv.org/abs/2107.03836 摘要：Reshef等人（Science，2011）的最大信息系数（MIC）是一种统计数据，用于测量大型数据集中变量对之间的依赖性。本文证明了MIC是相应的总体统计MIC$的一致估计。这修正了Reshef等人（JMLR，2016）的一个论点中的错误，我们对此进行了描述。摘要：The Maximal Information Coefficient (MIC) of Reshef et al. (Science, 2011) is a statistic for measuring dependence between variable pairs in large datasets. In this note, we prove that MIC is a consistent estimator of the corresponding population statistic MIC$_*$. This corrects an error in an argument of Reshef et al. (JMLR, 2016), which we describe.

【10】 Asymptotic normality of robust M-estimators with convex penalty标题：具有凸罚的稳健M-估计的渐近正态性

作者：Pierre C Bellec,Yiwei Shen,Cun-Hui Zhang 机构：Department of Statistics, Rutgers University 链接：https://arxiv.org/abs/2107.03826 摘要：本文给出了高维带凸惩罚的鲁棒M-估计的个别坐标的渐近正态性结果，其中维数$p$与样本大小$n$最多为同一阶，即对于某些固定常数$gamma>0$，为$p/n\le\gamma$。对于一大类损失函数，包括Huber损失及其用强凸罚正则化的光滑形式，渐近正态性要求对M估计的大多数坐标进行偏差校正和保持。用数据驱动的量估计结果置信区间宽度的渐近方差。这种方差估计自动适应于低（$p/n\to0）$或高（$p/n\le\gamma$）维，并且不涉及以前关于M-估计的渐近正态性的工作中看到的近端算子。对于Huber损失，估计方差有一个包含有效自由度和有效样本量的简单表达式。对具有弹性净惩罚的Huber损失情况进行了详细的研究，并通过仿真验证了理论结果。渐近正态性的结果来自于Stein公式，这些公式对于球面上的高维随机向量是独立的。摘要：This paper develops asymptotic normality results for individual coordinates of robust M-estimators with convex penalty in high-dimensions, where the dimension $p$ is at most of the same order as the sample size $n$, i.e, $p/n\le\gamma$ for some fixed constant $\gamma>0$. The asymptotic normality requires a bias correction and holds for most coordinates of the M-estimator for a large class of loss functions including the Huber loss and its smoothed versions regularized with a strongly convex penalty. The asymptotic variance that characterizes the width of the resulting confidence intervals is estimated with data-driven quantities. This estimate of the variance adapts automatically to low ($p/n\to0)$ or high ($p/n \le \gamma$) dimensions and does not involve the proximal operators seen in previous works on asymptotic normality of M-estimators. For the Huber loss, the estimated variance has a simple expression involving an effective degrees-of-freedom as well as an effective sample size. The case of the Huber loss with Elastic-Net penalty is studied in details and a simulation study confirms the theoretical findings. The asymptotic normality results follow from Stein formulae for high-dimensional random vectors on the sphere developed in the paper which are of independent interest.

【11】 Federated Learning as a Mean-Field Game 标题：作为平均场博弈的联合学习

作者：Arash Mehrjou 机构：ETH Z¨urich, &, Max Planck Institute for Intelligent Systems 链接：https://arxiv.org/abs/2107.03770 摘要：我们建立了联邦学习（机器学习的概念）和平均场对策（博弈论和控制论的概念）之间的联系。在这个类比中，本地联合学习者被视为参与者，梯度在中心服务器中的聚集就是平均场效应。我们将联邦学习看作一个微分对策，并讨论了这个对策均衡的性质。我们希望这种新颖的联邦学习观点能将来自这两个不同领域的研究人员聚集在一起，共同研究大规模分布式和隐私保护学习算法的基本问题。摘要：We establish a connection between federated learning, a concept from machine learning, and mean-field games, a concept from game theory and control theory. In this analogy, the local federated learners are considered as the players and the aggregation of the gradients in a central server is the mean-field effect. We present federated learning as a differential game and discuss the properties of the equilibrium of this game. We hope this novel view to federated learning brings together researchers from these two distinct areas to work on fundamental problems of large-scale distributed and privacy-preserving learning algorithms.

【12】 Encoding Domain Information with Sparse Priors for Inferring Explainable Latent Variables 标题：利用稀疏先验对领域信息进行编码以推断可解释的潜变量

作者：Arber Qoku,Florian Buettner 机构：German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Germany, Goethe University Frankfurt 备注：5 pages, 6 figures, Joint KDD 2021 Health Day and 2021 KDD Workshop on Applied Data Science for Healthcare 链接：https://arxiv.org/abs/2107.03730 摘要：潜变量模型是一种强大的统计工具，通过从可观察的高维数据中推断出未观察到的隐藏状态，可以揭示患者或细胞之间的相关变化。然而，现有方法的一个主要缺点是无法学习稀疏且可解释的隐藏状态。此外，在数据的潜在结构的部分知识容易获得的情况下，将先验信息在统计上合理地集成到当前方法中是具有挑战性的。为了解决这些问题，我们提出了spex-LVM，这是一个具有稀疏先验的阶乘潜变量模型，以鼓励由领域相关信息驱动的可解释因素的推断。spex-LVM利用现有的生物医学路径知识，自动将注释属性分配给潜在因素，产生可解释的结果，并根据相应的兴趣领域进行定制。对模拟和真实的单细胞RNA-seq数据集的评估表明，我们的模型以一种内在可解释的方式有力地识别了相关结构，区分了技术噪声和生物医学变异源，并提供了对现有路径注释的数据集特定的适配。可在https://github.com/MLO-lab/spexlvm. 摘要：Latent variable models are powerful statistical tools that can uncover relevant variation between patients or cells, by inferring unobserved hidden states from observable high-dimensional data. A major shortcoming of current methods, however, is their inability to learn sparse and interpretable hidden states. Additionally, in settings where partial knowledge on the latent structure of the data is readily available, a statistically sound integration of prior information into current methods is challenging. To address these issues, we propose spex-LVM, a factorial latent variable model with sparse priors to encourage the inference of explainable factors driven by domain-relevant information. spex-LVM utilizes existing knowledge of curated biomedical pathways to automatically assign annotated attributes to latent factors, yielding interpretable results tailored to the corresponding domain of interest. Evaluations on simulated and real single-cell RNA-seq datasets demonstrate that our model robustly identifies relevant structure in an inherently explainable manner, distinguishes technical noise from sources of biomedical variation, and provides dataset-specific adaptations of existing pathway annotations. Implementation is available at https://github.com/MLO-lab/spexlvm.

【13】 A bayesian reanalysis of the phase III aducanumab (ADU) trial 标题：Aducanumab(ADU)Ⅲ期试验的贝叶斯再分析

作者：Tommaso Costa,Franco Cauda 机构：GCS-fMRI, Koelliker Hospital and Department of Psychology, University of Turin, Department of Psychology, University of Turin, Turin, Italy, FOCUS Lab, Department of Psychology, university of Turin, Turin, Italy., Short title: Bayesian reanalysis of ADU trial 链接：https://arxiv.org/abs/2107.03686 摘要：在这篇文章中，我们对Biogen公布的aducanumab（ADU）Ⅲ期总结统计数据进行了重新分析，特别是临床痴呆症评分框总和（CDR-SB）的结果。结果表明，关于该药物疗效的证据非常少，并且在贝叶斯框架中呈现了对临床试验结果的更清晰的看法，可用于该领域的未来发展和研究。摘要：In this article we have conducted a reanalysis of the phase III aducanumab (ADU) summary statistics announced by Biogen, in particular the result of the Clinical Dementia Rating-Sum of Boxes (CDR-SB). The results showed that the evidence on the efficacy of the drug is very low and a more clearer view of the results of clinical trials are presented in the Bayesian framework that can be useful for future development and research in the field.

【14】 Assigning Topics to Documents by Successive Projections 标题：通过连续投影将主题分配给文档

作者：Olga Klopp,Maxim Panov,Suzanne Sigalla,Alexandre Tsybakov 机构：ESSEC Business School, France, Skoltech, Russia, CREST, ENSAE, France 链接：https://arxiv.org/abs/2107.03684 摘要：主题模型为组织和理解大型文本语料库的结构，特别是发现隐藏的主题结构提供了一个有用的工具。将大型非结构化语料库中的文档进行主题聚类是图像分析、电子商务、社会网络、群体遗传学等领域的一项重要任务。主题建模的一种常用方法是将每个主题与单词字典上的概率分布相关联，并将每个文档视为主题的混合。由于主题的数量通常大大小于语料库和词典的大小，因此主题建模的方法可以导致显著的维度缩减。在本文中，我们研究了给定语料库中每个文档的主题分布估计问题，也就是说，我们关注问题的聚类方面。我们介绍了一种算法，我们称之为连续投影重叠聚类（SPOC）的启发，连续投影算法分离矩阵分解。该算法实现简单，计算速度快。我们建立了SPOC算法性能的理论保证，特别是其估计风险的接近匹配极大极小上界和下界。我们还提出了一种新的估计主题数的方法。我们通过对合成和半合成数据的数值研究来补充我们的理论结果，以分析这种新算法在实际中的性能。其中一个结论是，算法的误差随着字典的大小最多呈对数增长，这与我们观察到的潜在Dirichlet分配相反。摘要：Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various areas, such as image analysis, e-commerce, social networks, population genetics. A common approach to topic modeling is to associate each topic with a probability distribution on the dictionary of words and to consider each document as a mixture of topics. Since the number of topics is typically substantially smaller than the size of the corpus and of the dictionary, the methods of topic modeling can lead to a dramatic dimension reduction. In this paper, we study the problem of estimating topics distribution for each document in the given corpus, that is, we focus on the clustering aspect of the problem. We introduce an algorithm that we call Successive Projection Overlapping Clustering (SPOC) inspired by the Successive Projection Algorithm for separable matrix factorization. This algorithm is simple to implement and computationally fast. We establish theoretical guarantees on the performance of the SPOC algorithm, in particular, near matching minimax upper and lower bounds on its estimation risk. We also propose a new method that estimates the number of topics. We complement our theoretical results with a numerical study on synthetic and semi-synthetic data to analyze the performance of this new algorithm in practice. One of the conclusions is that the error of the algorithm grows at most logarithmically with the size of the dictionary, in contrast to what one observes for Latent Dirichlet Allocation.

【15】 Inference and forecasting for continuous-time integer-valued trawl processes and their use in financial economics 标题：连续时间整值拖网过程的推断和预测及其在金融经济中的应用

作者：Mikkel Bennedsen,Asger Lunde,Neil Shephard,Almut E. D. Veraart 链接：https://arxiv.org/abs/2107.03674 摘要：本文发展了基于似然的连续时间整值拖网过程的估计、推理、模型选择和预测方法。整数值拖曳过程的全部可能性通常是高度难处理的，激励使用复合似然方法，在这里我们考虑成对似然来代替完全似然。最大化数据的两两似然得到了模型参数向量的一个估计，并证明了该估计的相合性和渐近正态性。同样的方法允许我们发展概率预测方法，可以用来构造整数值时间序列的预测分布。在一个模拟研究中，我们记录了基于似然估计的良好的有限样本性能和相关的模型选择过程。最后，我们将这些方法应用于金融买卖价差数据的建模和预测中，我们发现仔细地建模这些数据的边际分布和自相关结构是有益的。我们认为整数值拖网过程特别适合这种情况。摘要：This paper develops likelihood-based methods for estimation, inference, model selection, and forecasting of continuous-time integer-valued trawl processes. The full likelihood of integer-valued trawl processes is, in general, highly intractable, motivating the use of composite likelihood methods, where we consider the pairwise likelihood in lieu of the full likelihood. Maximizing the pairwise likelihood of the data yields an estimator of the parameter vector of the model, and we prove consistency and asymptotic normality of this estimator. The same methods allow us to develop probabilistic forecasting methods, which can be used to construct the predictive distribution of integer-valued time series. In a simulation study, we document good finite sample performance of the likelihood-based estimator and the associated model selection procedure. Lastly, the methods are illustrated in an application to modelling and forecasting financial bid-ask spread data, where we find that it is beneficial to carefully model both the marginal distribution and the autocorrelation structure of the data. We argue that integer-valued trawl processes are especially well-suited in such situations.

【16】 Traffic prediction at signalised intersections using Integrated Nested Laplace Approximation 标题：基于积分嵌套拉普拉斯逼近的信号交叉口交通量预测

作者：D. Townsend,C. Nel 链接：https://arxiv.org/abs/2107.03617 摘要：利用INLA框架，提出了一种预测信号交叉口交通流的贝叶斯方法。INLA是一种确定性的，计算效率高的替代MCMC的估计后验分布。它是为参数服从联合高斯分布的潜在高斯模型设计的。从LGM自然演化而来的一个假设是高斯马尔可夫随机场（GMRF）。结果表明，基于空间和时间的交通预测模型满足这一假设，因此，当模型中包含空间、时间和其他相关协方差时，INLA算法提供了准确的预测。摘要：A Bayesian approach to predicting traffic flows at signalised intersections is considered using the the INLA framework. INLA is a deterministic, computationally efficient alternative to MCMC for estimating a posterior distribution. It is designed for latent Gaussian models where the parameters follow a joint Gaussian distribution. An assumption which naturally evolves from an LGM is that of a Gaussian Markov Random Field (GMRF). It can be shown that a traffic prediction model based in both space and time satisfies this assumption, and as such the INLA algorithm provides accurate prediction when space, time, and other relevant covariants are included in the model.

【17】 Causal Structural Learning Via Local Graphs 标题：基于局部图的因果结构学习

作者：Wenyu Chen,Mathias Drton,Ali Shojaie 机构：Department of Statistics, University of Washington, Seattle, WA, USA, Department of Mathematics, Technical University of Munich, M¨unchen, Germany, Department of Biostatistics, University of Washington, Seattle, WA, USA 链接：https://arxiv.org/abs/2107.03597 摘要：我们考虑在稀疏高维设置中学习因果结构的问题，该问题可能存在（潜在的许多）未测量混杂因子以及选择偏倚。基于大型随机网络中常见的结构，研究了线性结构方程模型（SEM）中局部结构的表示，我们提出了一个新的局部稀疏性概念，用于潜在变量和选择变量存在时的一致结构学习，提出了一种新的快速因果推理（FCI）算法，降低了计算复杂度和样本复杂度，称之为局部FCI（lFCI）。稀疏性的新概念允许存在高度连接的集线器节点，这在现实网络中很常见，但对于现有方法来说是有问题的。数值实验表明，lFCI算法在多类大型随机网络中均达到了最先进的性能，其性能优于现有的包含中心节点的网络算法。摘要：We consider the problem of learning causal structures in sparse high-dimensional settings that may be subject to the presence of (potentially many) unmeasured confounders, as well as selection bias. Based on the structure found in common families of large random networks and examining the representation of local structures in linear structural equation models (SEM), we propose a new local notion of sparsity for consistent structure learning in the presence of latent and selection variables, and develop a new version of the Fast Causal Inference (FCI) algorithm with reduced computational and sample complexity, which we refer to as local FCI (lFCI). The new notion of sparsity allows the presence of highly connected hub nodes, which are common in real-world networks, but problematic for existing methods. Our numerical experiments indicate that the lFCI algorithm achieves state-of-the-art performance across many classes of large random networks, and its performance is superior to that of existing methods for networks containing hub nodes.

【18】 Evaluating Sensitivity to the Stick-Breaking Prior in Bayesian Nonparametrics 标题：贝叶斯非参数对折棒先验的敏感性评估

作者：Ryan Giordano,Runjing Liu,Tamara Broderick,Michael I. ~Jordan 机构：Michael I. Jordan ‡ 备注：65 pages, 20 figures 链接：https://arxiv.org/abs/2107.03584 摘要：基于Dirichlet过程和其他断棒先验的贝叶斯模型已经被提出作为聚类、主题建模和其他无监督学习任务的核心组成部分。然而，对于这样的模型来说，预先规范是相对困难的，因为它们的灵活性意味着预先选择的结果通常是相对不透明的。此外，这些选择会对后验推理产生重大影响。因此，鲁棒性的考虑需要与非参数建模同时进行。在本文中，我们利用变分贝叶斯方法除了在拟合复杂的非参数模型方面具有计算优势外，还产生了贝叶斯模型的参数和非参数方面的敏感性这一事实来应对这一挑战。特别地，我们展示了在Dirichlet过程混合物和相关混合物模型下，如何评估结论对浓度参数和断棒分布选择的敏感性。我们为贝叶斯灵敏度分析的变分方法提供了理论和实证支持。摘要：Bayesian models based on the Dirichlet process and other stick-breaking priors have been proposed as core ingredients for clustering, topic modeling, and other unsupervised learning tasks. Prior specification is, however, relatively difficult for such models, given that their flexibility implies that the consequences of prior choices are often relatively opaque. Moreover, these choices can have a substantial effect on posterior inferences. Thus, considerations of robustness need to go hand in hand with nonparametric modeling. In the current paper, we tackle this challenge by exploiting the fact that variational Bayesian methods, in addition to having computational advantages in fitting complex nonparametric models, also yield sensitivities with respect to parametric and nonparametric aspects of Bayesian models. In particular, we demonstrate how to assess the sensitivity of conclusions to the choice of concentration parameter and stick-breaking distribution for inferences under Dirichlet process mixtures and related mixture models. We provide both theoretical and empirical support for our variational approach to Bayesian sensitivity analysis.

【19】 The Micro-Randomized Trial for Developing Digital Interventions: Experimental Design and Data Analysis Considerations 标题：发展数字干预的微随机试验：实验设计和数据分析考虑

作者：Tianchen Qian,Ashley E. Walton,Linda M. Collins,Predrag Klasnja,Stephanie T. Lanza,Inbal Nahum-Shani,Mashifiqui Rabbi,Michael A. Russell,Maureen A. Walton,Hyesun Yoo,Susan A. Murphy 机构：Mashfiqui Rabbi, Harvard University 备注：arXiv admin note: substantial text overlap with arXiv:2005.05880, arXiv:2004.10241 链接：https://arxiv.org/abs/2107.03544 摘要：即时适应性干预（JITAI）是一种时变的适应性干预，它利用频繁的机会对干预进行调整——每周、每天甚至一天多次。微随机试验（MRT）已经出现，用于告知JITAIs的建设。mrt可以用来解决JITAI组件是否有效以及在什么情况下有效的研究问题，最终目标是开发高效的JITAI。本文的目的是阐明为什么，何时，如何使用捷运；强调设计和实施地铁时必须考虑的因素；回顾mrt的一次和二次分析方法。我们简要回顾了JITAIs的关键要素，并讨论了规划和设计地铁时的各种考虑因素。我们提供了一个因果漂移效应的定义，适用于MRT数据的一次和二次分析，以通知JITAI的发展。我们回顾了加权和中心最小二乘（WCLS）估计，它提供了一致的因果漂移效应估计MRT数据。我们描述了如何使用标准统计软件（如R Core Team，2019）获得WCLS估计器以及相关的测试统计数据。自始至终，我们使用HeartSteps MRT说明了MRT的设计和分析，以开发一种JITAI，增加久坐者的体力活动。我们用另外两个MRT（SARA和BariFit）来补充HeartSteps MRT，每个MRT都强调了可以使用MRT和可能出现的实验设计考虑来解决的不同研究问题。摘要：Just-in-time adaptive interventions (JITAIs) are time-varying adaptive interventions that use frequent opportunities for the intervention to be adapted--weekly, daily, or even many times a day. The micro-randomized trial (MRT) has emerged for use in informing the construction of JITAIs. MRTs can be used to address research questions about whether and under what circumstances JITAI components are effective, with the ultimate objective of developing effective and efficient JITAI. The purpose of this article is to clarify why, when, and how to use MRTs; to highlight elements that must be considered when designing and implementing an MRT; and to review primary and secondary analyses methods for MRTs. We briefly review key elements of JITAIs and discuss a variety of considerations that go into planning and designing an MRT. We provide a definition of causal excursion effects suitable for use in primary and secondary analyses of MRT data to inform JITAI development. We review the weighted and centered least-squares (WCLS) estimator which provides consistent causal excursion effect estimators from MRT data. We describe how the WCLS estimator along with associated test statistics can be obtained using standard statistical software such as R (R Core Team, 2019). Throughout we illustrate the MRT design and analyses using the HeartSteps MRT, for developing a JITAI to increase physical activity among sedentary individuals. We supplement the HeartSteps MRT with two other MRTs, SARA and BariFit, each of which highlights different research questions that can be addressed using the MRT and experimental design considerations that might arise.

【20】 The folded concave Laplacian spectral penalty learns block diagonal sparsity patterns with the strong oracle property 标题：折叠凹拉普拉斯谱罚学习挡路对角稀疏模式具有强预言性

作者：Iain Carmichael 备注：First draft! 60 pages, 6 figures 链接：https://arxiv.org/abs/2107.03494 摘要：结构化稀疏性是现代统计工具的重要组成部分。如果模型参数集的元素可以看作是一个具有多个连通分量的图的边，那么它就具有块对角稀疏性。例如，具有K个变量块的块对角相关矩阵对应于具有K个连通分量的图，其节点是变量，其边是相关。这种稀疏性捕获模型参数的簇。为了学习块对角稀疏模式，我们发展了折叠凹拉普拉斯谱惩罚，并为由此产生的非凸问题提供了一个优化最小化算法。我们证明，即使在高维环境下，该算法也能以高概率在两步后收敛到oracle估计。然后在协方差估计、线性回归和logistic回归等几个经典问题中证明了该理论。摘要：Structured sparsity is an important part of the modern statistical toolkit. We say set of model parameters has block diagonal sparsity up to permutations if its elements can be viewed as the edges of a graph that has multiple connected components. For example, a block diagonal correlation matrix with K blocks of variables corresponds to a graph with K connected components whose nodes are the variables and whose edges are the correlations. This type of sparsity captures clusters of model parameters. To learn block diagonal sparsity patterns we develop the folded concave Laplacian spectral penalty and provide a majorization-minimization algorithm for the resulting non-convex problem. We show this algorithm has the appealing computational and statistical guarantee of converging to the oracle estimator after two steps with high probability, even in high-dimensional settings. The theory is then demonstrated in several classical problems including covariance estimation, linear regression, and logistic regression.

【21】 Uncertainity in Ranking 标题：排名的不确定性

作者：Justin Rising 机构：Kessel Run, Boston, MA 链接：https://arxiv.org/abs/2107.03459 摘要：根据数据估计的秩是不确定的，这在许多应用中是一个挑战。测量样本秩中不确定度的必要性已经被认识了一段时间，但以往考虑这一问题的文献主要集中在测量单个秩的不确定度，而不是测量联合不确定度。我们用偏序线性延拓的形式刻画了参数不确定性和秩不确定性之间的关系，并利用这一特征提出了样本秩中联合不确定性的度量。我们提供了有效的算法来解决几个感兴趣的问题，并推导了有效的同时置信区间的个人排名。我们将我们的方法应用于模拟数据和真实数据，并通过R软件包使其可用。摘要：Ranks estimated from data are uncertain and this poses a challenge in many applications. The need to measure the uncertainty in sample ranks has been recognized for some time, but the previous literature considering this problem has been concentrated on measuring the uncertainty of individual ranks and not the joint uncertainty. We characterize the relationship between parameter uncertainty and rank uncertainty in terms of linear extensions of a partial order and use this characterization to propose a measure of the joint uncertainty in a sample ranking. We provide efficient algorithms for several questions of interest and also derive valid simultaneous confidence intervals for the individual ranks. We apply our methods to both simulated and real data and make them available through the R package rankUncertainty.

【22】 Model Selection for Generic Contextual Bandits 标题：通用上下文Bitts的模型选择

作者：Avishek Ghosh,Abishek Sankararaman,Kannan Ramchandran 机构：Department of Electrical Eng. and Computer Science, UC Berkeley, AWS AI Labs, Palo Alto, USA, Editor: 备注：40 pages, 5 figures. arXiv admin note: text overlap with arXiv:2006.02612 链接：https://arxiv.org/abs/2107.03455 摘要：在可实现性假设下，考虑一般随机背景强盗的模型选择问题。我们提出了一种基于连续求精的自适应上下文Bandit（{\ttfamily ACB}）算法，该算法分阶段工作，连续地消除了过于简单而无法适应给定实例的模型类。我们证明了该算法是自适应的，即遗憾率顺序匹配{\ttfamily FALCON}，这是Levi等人20年的最新上下文bandit算法，需要了解真实模型类。不知道正确的模型类所付出的代价只是一个累加项，导致了遗憾界中的二阶项。这种代价具有直观的特性，即随着模型类变得更容易识别，它会变得更小，反之亦然。然后，我们证明了一个更简单的explore-then-commit（ETC）风格的算法也获得了匹配{\ttfamily FALCON}的遗憾率，尽管不知道真正的模型类。然而，与{\ttfamily ACB}相比，ETC的模型选择成本更高。此外，将{\ttfamily ACB}应用于具有未知稀疏性的线性bandit设置，按顺序恢复先前由针对线性设置的算法建立的模型选择保证。摘要：We consider the problem of model selection for the general stochastic contextual bandits under the realizability assumption. We propose a successive refinement based algorithm called Adaptive Contextual Bandit ({\ttfamily ACB}), that works in phases and successively eliminates model classes that are too simple to fit the given instance. We prove that this algorithm is adaptive, i.e., the regret rate order-wise matches that of {\ttfamily FALCON}, the state-of-art contextual bandit algorithm of Levi et. al '20, that needs knowledge of the true model class. The price of not knowing the correct model class is only an additive term contributing to the second order term in the regret bound. This cost possess the intuitive property that it becomes smaller as the model class becomes easier to identify, and vice-versa. We then show that a much simpler explore-then-commit (ETC) style algorithm also obtains a regret rate of matching that of {\ttfamily FALCON}, despite not knowing the true model class. However, the cost of model selection is higher in ETC as opposed to in {\ttfamily ACB}, as expected. Furthermore, {\ttfamily ACB} applied to the linear bandit setting with unknown sparsity, order-wise recovers the model selection guarantees previously established by algorithms tailored to the linear setting.

【23】 Identifying optimally cost-effective dynamic treatment regimes with a Q-learning approach 标题：用Q-学习方法确定最优成本效益动态治疗方案

作者：Nicholas Illenberger,Andrew J. Spieker,Nandita Mitra 机构：University of Pennsylvania, U.S.A, Vanderbilt University Medical Center, U.S.A 备注：18 pages, 2 tables 链接：https://arxiv.org/abs/2107.03441 摘要：有关患者治疗策略的卫生政策决策需要同时考虑治疗效果和成本。优化治疗规则的有效性可能会导致昂贵的策略；另一方面，优化成本可能会导致较差的患者结果。我们提出了一个两步的方法来确定一个最佳的成本效益和可解释的动态治疗制度。首先，我们发展了一种结合Q-学习和策略搜索的方法来估计在期望治疗费用约束下的基于列表的最优治疗方案。其次，我们提出了一个迭代过程，从一组对应于不同成本约束的候选制度中选择一个最优的成本效益制度。我们的方法可以估计最佳制度的存在，共同遇到的挑战，包括时变的混淆和相关的结果。通过模拟研究，我们证明了所估计的最佳处理方案的有效性，并检验了在灵活建模方法下的运行特性。利用癌症观察数据库的数据，我们应用我们的方法来评估子宫内膜癌患者辅助放疗和化疗的最佳成本效益治疗策略。摘要：Health policy decisions regarding patient treatment strategies require consideration of both treatment effectiveness and cost. Optimizing treatment rules with respect to effectiveness may result in prohibitively expensive strategies; on the other hand, optimizing with respect to costs may result in poor patient outcomes. We propose a two-step approach for identifying an optimally cost-effective and interpretable dynamic treatment regime. First, we develop a combined Q-learning and policy-search approach to estimate an optimal list-based regime under a constraint on expected treatment costs. Second, we propose an iterative procedure to select an optimally cost-effective regime from a set of candidate regimes corresponding to different cost constraints. Our approach can estimate optimal regimes in the presence of commonly encountered challenges including time-varying confounding and correlated outcomes. Through simulation studies, we illustrate the validity of estimated optimal treatment regimes and examine operating characteristics under flexible modeling approaches. Using data from an observational cancer database, we apply our methodology to evaluate optimally cost-effective treatment strategies for assigning adjuvant radiation and chemotherapy to endometrial cancer patients.

【24】 Bayesian model-based clustering for multiple network data 标题：基于贝叶斯模型的多网络数据聚类

作者：Anastasia Mantziou,Simon Lunagomez,Robin Mitra 机构：Sim´on Lunag´omez, Mathematics and Statistics, Lancaster University, School of Mathematics, Cardiff University 链接：https://arxiv.org/abs/2107.03431 摘要：人们对分析多种网络数据的兴趣与日俱增。这与分析传统的数据集不同，现在数据中的每个观测值都由一个网络组成。最近的技术进步允许在一系列不同的应用中收集这类数据。这激发了研究人员开发统计模型，最准确地描述产生网络人口的概率机制，并利用这一机制来推断网络数据的底层结构。只有很少的研究发展到目前为止考虑的异质性，可以存在于网络人口。我们提出了一种混合的测量误差模型，用于识别网络群体中的网络簇，该模型考虑到网络节点之间的连接模式中检测到的相似性。大量的仿真研究表明，该模型在聚类多个网络数据和推断模型参数方面都有很好的效果。我们进一步将我们的模型应用于两个来自计算机（人类跟踪系统）和神经科学领域的真实世界多网络数据集。摘要：There is increasing appetite for analysing multiple network data. This is different to analysing traditional data sets, where now each observation in the data comprises a network. Recent technological advancements have allowed the collection of this type of data in a range of different applications. This has inspired researchers to develop statistical models that most accurately describe the probabilistic mechanism that generates a network population and use this to make inferences about the underlying structure of the network data. Only a few studies developed to date consider the heterogeneity that can exist in a network population. We propose a Mixture of Measurement Error Models for identifying clusters of networks in a network population, with respect to similarities detected in the connectivity patterns among the networks' nodes. Extensive simulation studies show our model performs well in both clustering multiple network data and inferring the model parameters. We further apply our model on two real world multiple network data sets resulting from the fields of Computing (Human Tracking Systems) and Neuroscience.

【25】 ENNS: Variable Selection, Regression, Classification and Deep Neural Network for High-Dimensional Data 标题：ENNS：高维数据的变量选择、回归、分类和深度神经网络

作者：Kaixu Yang,Tapabrata Maiti 机构： Red Cedar Rd., Department of Statistics and Probability, Michigan State University, East Lansing, MI 链接：https://arxiv.org/abs/2107.03430 摘要：高维、低样本（HDLSS）数据问题在过去几十年中一直是一个非常重要的话题。有大量的文献提出了各种各样的方法来处理这种情况，其中变量选择是一个引人注目的想法。另一方面，深度神经网络被用来模拟响应和特征之间的复杂关系和相互作用，这很难用线性或加性模型来捕捉。本文讨论了基于神经网络模型的变量选择技术的研究现状。结果表明，采用神经网络的阶段性算法存在着后期进入模型的变量不一致等缺点。在此基础上，我们提出了一种集合方法来实现更好的变量选择，并证明了选择错误变量的概率趋于零。然后，我们讨论了额外的正则化处理过拟合，使更好的回归和分类。我们研究了我们提出的方法的各种统计特性。文中给出了大量的仿真和实际数据实例，为理论和方法提供了支持。摘要：High-dimensional, low sample-size (HDLSS) data problems have been a topic of immense importance for the last couple of decades. There is a vast literature that proposed a wide variety of approaches to deal with this situation, among which variable selection was a compelling idea. On the other hand, a deep neural network has been used to model complicated relationships and interactions among responses and features, which is hard to capture using a linear or an additive model. In this paper, we discuss the current status of variable selection techniques with the neural network models. We show that the stage-wise algorithm with neural network suffers from disadvantages such as the variables entering into the model later may not be consistent. We then propose an ensemble method to achieve better variable selection and prove that it has probability tending to zero that a false variable is selected. Then, we discuss additional regularization to deal with over-fitting and make better regression and classification. We study various statistical properties of our proposed method. Extensive simulations and real data examples are provided to support the theory and methodology.

【26】 The Atlas of Lane Changes: Investigating Location-dependent Lane Change Behaviors Using Measurement Data from a Customer Fleet 标题：车道改变地图集：使用客户车队的测量数据调查与位置相关的车道改变行为

作者：Florian Wirthmüller,Jochen Hipp,Christian Reichenbächer,Manfred Reichert 机构： Reichert are with the Institute of Databases andInformation Systems (DBIS) at Ulm University, Reichenb¨acher is with the Wilhelm-Schickard-Institute for Informaticsat Eberhard Karls University T¨ubingen 备注：the article has been accepted for publication during the 24th IEEE Intelligent Transportation Systems Conference (ITSC), 8 pages, 11 figures 链接：https://arxiv.org/abs/2107.04029 摘要：对周围交通参与者行为的预测是驾驶员辅助和自动驾驶系统的一项重要而富有挑战性的任务。目前的方法主要集中在对交通状况的动态方面进行建模，并试图在此基础上预测交通参与者的行为。在本文中，我们通过计算特定位置的先验车道变化概率，朝着扩展这一常见做法迈出了第一步。这背后的想法是直截了当的：人类的驾驶行为可能会在完全相同的交通状况下发生变化，这取决于各自的位置。例如，司机们可能会问自己：我是应该立即通过前面的卡车，还是应该等到到达前方几公里处弯度较小的路段？尽管这样的信息本身还远远不允许进行行为预测，但很明显，当将这种特定于位置的先验概率纳入预测时，今天的方法将大大受益。例如，我们的调查显示，高速公路立交桥往往会增强驾驶员进行车道变更的动机，而曲线似乎具有车道变更抑制效应。然而，对所有考虑的局部条件的调查表明，各种效应的叠加可能导致某些位置出现意外的概率。因此，我们建议动态构建和维护基于客户车队数据的车道变化概率图，以支持具有附加信息的车载预测系统。要获得可靠的车道变更概率，广泛的客户群是成功的关键。摘要：The prediction of surrounding traffic participants behavior is a crucial and challenging task for driver assistance and autonomous driving systems. Today's approaches mainly focus on modeling dynamic aspects of the traffic situation and try to predict traffic participants behavior based on this. In this article we take a first step towards extending this common practice by calculating location-specific a-priori lane change probabilities. The idea behind this is straight forward: The driving behavior of humans may vary in exactly the same traffic situation depending on the respective location. E.g. drivers may ask themselves: Should I pass the truck in front of me immediately or should I wait until reaching the less curvy part of my route lying only a few kilometers ahead? Although, such information is far away from allowing behavior prediction on its own, it is obvious that today's approaches will greatly benefit when incorporating such location-specific a-priori probabilities into their predictions. For example, our investigations show that highway interchanges tend to enhance driver's motivation to perform lane changes, whereas curves seem to have lane change-dampening effects. Nevertheless, the investigation of all considered local conditions shows that superposition of various effects can lead to unexpected probabilities at some locations. We thus suggest dynamically constructing and maintaining a lane change probability map based on customer fleet data in order to support onboard prediction systems with additional information. For deriving reliable lane change probabilities a broad customer fleet is the key to success.

【27】 A Machine Learning Approach to Safer Airplane Landings: Predicting Runway Conditions using Weather and Flight Data 标题：飞机安全着陆的机器学习方法：利用天气和飞行数据预测跑道状况

作者：Alise Danielle Midtfjord,Riccardo De Bin,Arne Bang Huseby 机构：Department of Mathematics, University of Oslo, Norway 链接：https://arxiv.org/abs/2107.04010 摘要：跑道表面的冰雪减少了减速和方向控制所需的可用轮胎-路面摩擦，并在冬季对航空业造成潜在的经济和安全威胁。为了启动适当的安全程序，飞行员需要准确及时地了解跑道表面的实际情况。本研究利用XGBoost建立了一个综合跑道评估系统，该系统包括一个用于预测滑溜状况的分类模型和一个用于预测滑溜程度的回归模型。这些模型是根据天气数据和跑道报告数据训练的。跑道表面状况由轮胎-路面摩擦系数表示，该系数是根据着陆飞机的飞行传感器数据估计的。为了评估模型的性能，将其与几种最先进的跑道评估方法进行了比较。XGBoost模型以ROC AUC为0.95识别湿滑跑道，以MAE为0.0254预测摩擦系数，并优于以往的所有方法。结果表明，当领域知识用于变量提取时，机器学习方法对复杂物理现象的建模能力较强，具有较高的精度。XGBoost模型与SHAP（SHapley加法解释）近似相结合，为机场运营商和飞行员提供了一个可理解的决策支持系统，有助于提高机场跑道的安全性和经济性。摘要：The presence of snow and ice on runway surfaces reduces the available tire-pavement friction needed for retardation and directional control and causes potential economic and safety threats for the aviation industry during the winter seasons. To activate appropriate safety procedures, pilots need accurate and timely information on the actual runway surface conditions. In this study, XGBoost is used to create a combined runway assessment system, which includes a classifcation model to predict slippery conditions and a regression model to predict the level of slipperiness. The models are trained on weather data and data from runway reports. The runway surface conditions are represented by the tire-pavement friction coefficient, which is estimated from flight sensor data from landing aircrafts. To evaluate the performance of the models, they are compared to several state-of-the-art runway assessment methods. The XGBoost models identify slippery runway conditions with a ROC AUC of 0.95, predict the friction coefficient with a MAE of 0.0254, and outperforms all the previous methods. The results show the strong abilities of machine learning methods to model complex, physical phenomena with a good accuracy when domain knowledge is used in the variable extraction. The XGBoost models are combined with SHAP (SHapley Additive exPlanations) approximations to provide a comprehensible decision support system for airport operators and pilots, which can contribute to safer and more economic operations of airport runways.

【28】 On Margins and Derandomisation in PAC-Bayes 标题：关于PAC-Bayes中的边距和去随机化

作者：Felix Biggs,Benjamin Guedj 机构：Centre for Artificial Intelligence, London, United Kingdom, University College London and Inria 链接：https://arxiv.org/abs/2107.03955 摘要：我们开发了一个框架去随机的PAC贝叶斯泛化边界，在训练数据上实现了一个余量，将这个过程与度量集中现象联系起来。我们将这些工具应用于线性预测、具有异常erf激活函数的单隐层神经网络和深ReLU网络，获得了新的界。该方法还推广到“部分去随机化”的思想，其中只有一些层是去随机的，而其他层是随机的。这允许在更复杂的数据集上对单隐层网络进行经验评估，并有助于弥合非随机深层网络和随机深层网络（如PAC-Bayes中通常检查的）的泛化界限之间的差距。摘要：We develop a framework for derandomising PAC-Bayesian generalisation bounds achieving a margin on training data, relating this process to the concentration-of-measure phenomenon. We apply these tools to linear prediction, single-hidden-layer neural networks with an unusual erf activation function, and deep ReLU networks, obtaining new bounds. The approach is also extended to the idea of "partial-derandomisation" where only some layers are derandomised and the others are stochastic. This allows empirical evaluation of single-hidden-layer networks on more complex datasets, and helps bridge the gap between generalisation bounds for non-stochastic deep networks and those for randomised deep networks as generally examined in PAC-Bayes.

【29】 Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic Approach to Manifold Dimension Estimation 标题：数据分析中的流形假设：流形维数估计的双重几何-概率方法

作者：Alexander Ivanov,Gleb Nosovskiy,Alexey Chekunov,Denis Fedoseev,Vladislav Kibkalo,Mikhail Nikulin,Fedor Popelenskiy,Stepan Komkov,Ivan Mazurenko,Aleksandr Petiushko 链接：https://arxiv.org/abs/2107.03903 摘要：流形假说认为，高维空间中的数据点实际上位于低维流形的附近。在许多情况下，这一假设得到了实证验证，并用于增强无监督和半监督学习。在这里，我们提出了新的方法来流形假设检查和潜在的流形维数估计。为了做到这一点，我们同时使用两种截然不同的方法-一种是几何方法，另一种是概率方法-并检查它们是否给出相同的结果。我们的几何方法是对Minkowski维数计算中著名的盒计数算法的稀疏数据的一种改进。概率方法是一种新方法。虽然它利用标准的最近邻域距离，但它不同于以前在这种情况下使用的方法。该方法鲁棒性强，速度快，并包含特殊的初始数据转换。在实际数据集上的实验表明，基于两种方法结合的方法是有效的。摘要：Manifold hypothesis states that data points in high-dimensional space actually lie in close vicinity of a manifold of much lower dimension. In many cases this hypothesis was empirically verified and used to enhance unsupervised and semi-supervised learning. Here we present new approach to manifold hypothesis checking and underlying manifold dimension estimation. In order to do it we use two very different methods simultaneously - one geometric, another probabilistic - and check whether they give the same result. Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation. The probabilistic method is new. Although it exploits standard nearest neighborhood distance, it is different from methods which were previously used in such situations. This method is robust, fast and includes special preliminary data transformation. Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.

【30】 The Price of Diversity 标题：多样性的代价

作者：Hari Bandi,Dimitris Bertsimas 机构：Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 链接：https://arxiv.org/abs/2107.03900 摘要：在涉及个人选择的数据集中，性别、种族和种族方面的系统性偏见通常是无意识的。因此，社会已经发现，在这样的环境下，以保持精英管理的方式来减轻偏见和实现多样性是一个挑战。我们提出（a）一种新的优化方法，基于同时优化翻转结果标签和训练分类模型，以发现选择过程中的变化，从而在不显著影响精英管理的情况下实现多样性，以及（b）一种新的实现工具，采用最佳分类树来提供关于个体的哪些属性导致其标签翻转的见解，并帮助以人类决策者可以理解的方式在当前选择过程中做出改变。我们提供了三个真实世界数据集的案例研究，包括假释、律师资格和贷款决定，并证明多样性的价格很低，有时是负面的，也就是说，我们可以修改我们的选择过程，以增强多样性，而不显着影响精英管理，有时改善它。摘要：Systemic bias with respect to gender, race and ethnicity, often unconscious, is prevalent in datasets involving choices among individuals. Consequently, society has found it challenging to alleviate bias and achieve diversity in a way that maintains meritocracy in such settings. We propose (a) a novel optimization approach based on optimally flipping outcome labels and training classification models simultaneously to discover changes to be made in the selection process so as to achieve diversity without significantly affecting meritocracy, and (b) a novel implementation tool employing optimal classification trees to provide insights on which attributes of individuals lead to flipping of their labels, and to help make changes in the current selection processes in a manner understandable by human decision makers. We present case studies on three real-world datasets consisting of parole, admissions to the bar and lending decisions, and demonstrate that the price of diversity is low and sometimes negative, that is we can modify our selection processes in a way that enhances diversity without affecting meritocracy significantly, and sometimes improving it.

【31】 SSSE: Efficiently Erasing Samples from Trained Machine Learning Models 标题：SSSE：从训练好的机器学习模型中高效地擦除样本

作者：Alexandra Peste,Dan Alistarh,Christoph H. Lampert 机构：IST Austria 链接：https://arxiv.org/abs/2107.03860 摘要：大量用户提供的数据的可用性是机器学习在许多实际任务中取得成功的关键。最近，人们越来越意识到，应该让用户更多地控制他们的数据的使用方式。特别是，用户应有权禁止将其数据用于训练机器学习系统，并有权将其从已经训练过的系统中删除。虽然已经提出了几种样本擦除方法，但它们都存在一些缺点，阻碍了它们的广泛应用。大多数方法要么只适用于非常特定的模型族，要么牺牲了太多原始模型的精度，要么对内存或计算要求过高。本文提出了一种高效的样本擦除算法SSSE，该算法适用于一类广泛的机器学习模型。通过对模型损失情况的二阶分析，我们导出了模型参数的一个封闭形式更新步骤，该步骤只需要访问要删除的数据，而不需要访问原始的训练集。在CelebFaces attributes（CelebA）、Animals with attributes 2（AwA2）和CIFAR10三个数据集上的实验表明，在某些情况下，SSSE几乎可以消除样本，这是一个最佳的、但不切实际的金标准，即只使用允许的数据从头开始训练新模型。摘要：The availability of large amounts of user-provided data has been key to the success of machine learning for many real-world tasks. Recently, an increasing awareness has emerged that users should be given more control about how their data is used. In particular, users should have the right to prohibit the use of their data for training machine learning systems, and to have it erased from already trained systems. While several sample erasure methods have been proposed, all of them have drawbacks which have prevented them from gaining widespread adoption. Most methods are either only applicable to very specific families of models, sacrifice too much of the original model's accuracy, or they have prohibitive memory or computational requirements. In this paper, we propose an efficient and effective algorithm, SSSE, for samples erasure, that is applicable to a wide class of machine learning models. From a second-order analysis of the model's loss landscape we derive a closed-form update step of the model parameters that only requires access to the data to be erased, not to the original training set. Experiments on three datasets, CelebFaces attributes (CelebA), Animals with Attributes 2 (AwA2) and CIFAR10, show that in certain cases SSSE can erase samples almost as well as the optimal, yet impractical, gold standard of training a new model from scratch with only the permitted data.

【32】 Short-term Renewable Energy Forecasting in Greece using Prophet Decomposition and Tree-based Ensembles 标题：基于PROPHET分解和基于树的集成的希腊短期可再生能源预测

作者：Argyrios Vartholomaios,Stamatis Karlos,Eleftherios Kouloumpris,Grigorios Tsoumakas 机构： School of Informatics, Aristotle University, Thessaloniki, Greece, Medoid AI, Egnatia St., Thessaloniki, Greece 备注：11 pages, 7 figures 链接：https://arxiv.org/abs/2107.03825 摘要：利用可再生能源的能源生产由于其间歇性的性质而表现出固有的不确定性。然而，统一的欧洲能源市场促进了区域能源系统运营商对可再生能源（RES）的不断渗透。因此，RES预测可以帮助整合这些不稳定的能源，因为它可以提高电力系统的可靠性和降低辅助运行成本。本文提出了一个新的希腊太阳能和风能发电量预测数据集，并介绍了一个特征工程管道，丰富了数据集的维数空间。此外，我们提出了一种新的方法，利用创新的Prophet模型，一种端到端的预测工具，在基于树的集合提供短期预测之前，在分解能量时间序列时考虑多种非线性趋势。系统的性能通过具有代表性的评估指标来衡量，并通过估计模型在行业提供的绝对误差阈值方案下的泛化程度来衡量。所提出的混合模型与基线持久性模型、基于树的回归集合和Prophet模型相竞争，成功地超越了它们，呈现出更低的错误率和更有利的错误分布。摘要：Energy production using renewable sources exhibits inherent uncertainties due to their intermittent nature. Nevertheless, the unified European energy market promotes the increasing penetration of renewable energy sources (RES) by the regional energy system operators. Consequently, RES forecasting can assist in the integration of these volatile energy sources, since it leads to higher reliability and reduced ancillary operational costs for power systems. This paper presents a new dataset for solar and wind energy generation forecast in Greece and introduces a feature engineering pipeline that enriches the dimensional space of the dataset. In addition, we propose a novel method that utilizes the innovative Prophet model, an end-to-end forecasting tool that considers several kinds of nonlinear trends in decomposing the energy time series before a tree-based ensemble provides short-term predictions. The performance of the system is measured through representative evaluation metrics, and by estimating the model's generalization under an industryprovided scheme of absolute error thresholds. The proposed hybrid model competes with baseline persistence models, tree-based regression ensembles, and the Prophet model, managing to outperform them, presenting both lower error rates and more favorable error distribution.

【33】 Degrees of riskiness, falsifiability, and truthlikeness. A neo-Popperian account applicable to probabilistic theories 标题：风险、可证伪性和真实性的程度。一种适用于概率论的新波普尔学说

作者：Leander Vignero,Sylvia Wenmackers 机构：KU Leuven, Institute of Philosophy, Centre for Logic and Philosophy of Science, Kardinaal Mercierplein , – bus , Leuven, Belgium., (Forthcoming in Synthese.) 备注：41 pages; 3 figures; accepted for publication in Synthese 链接：https://arxiv.org/abs/2107.03772 摘要：在本文中，我们重新审视了波普尔的三个概念：科学假设或理论的风险性、可证伪性和真实性。首先，我们明确了风险概念的基本维度。其次，我们检验了可证伪程度是否可以定义以及如何定义，以及它们如何与风险概念的各个维度以及实验背景相关。第三，我们考虑风险的关系（预期程度）的真实性。自始至终，我们特别关注概率理论，并为概率理论的逼真性提供了一个尝试性的、定量的解释。摘要：In this paper, we take a fresh look at three Popperian concepts: riskiness, falsifiability, and truthlikeness (or verisimilitude) of scientific hypotheses or theories. First, we make explicit the dimensions that underlie the notion of riskiness. Secondly, we examine if and how degrees of falsifiability can be defined, and how they are related to various dimensions of the concept of riskiness as well as the experimental context. Thirdly, we consider the relation of riskiness to (expected degrees of) truthlikeness. Throughout, we pay special attention to probabilistic theories and we offer a tentative, quantitative account of verisimilitude for probabilistic theories.

【34】 Analytically Tractable Hidden-States Inference in Bayesian Neural Networks 标题：贝叶斯神经网络中可解析处理的隐态推理

作者：Luong-Ha Nguyen,James-A. Goulet 机构：Department of Civil, Geologic and Mining Engineering, Polytechnique Montréal, CANADA 备注：37 pages, 13 figures 链接：https://arxiv.org/abs/2107.03759 摘要：除了少数例外，神经网络一直依赖反向传播和梯度下降作为推理机来学习模型参数，因为神经网络的封闭形式贝叶斯推理一直被认为是难以解决的。在本文中，我们展示了如何利用可处理的近似高斯推理（TAGI）的能力来推断隐藏状态，而不是仅仅用它来推断网络的参数。它允许的一个新的方面是通过施加为实现特定目标而设计的约束来推断隐藏状态，如三个示例所示：（1）生成对抗性攻击示例，（2）使用神经网络作为黑盒优化方法，推理在连续动作强化学习中的应用。这些应用程序展示了以前保留给基于梯度的优化方法的任务现在如何通过分析可处理的推理来处理摘要：With few exceptions, neural networks have been relying on backpropagation and gradient descent as the inference engine in order to learn the model parameters, because the closed-form Bayesian inference for neural networks has been considered to be intractable. In this paper, we show how we can leverage the tractable approximate Gaussian inference's (TAGI) capabilities to infer hidden states, rather than only using it for inferring the network's parameters. One novel aspect it allows is to infer hidden states through the imposition of constraints designed to achieve specific objectives, as illustrated through three examples: (1) the generation of adversarial-attack examples, (2) the usage of a neural network as a black-box optimization method, and (3) the application of inference on continuous-action reinforcement learning. These applications showcase how tasks that were previously reserved to gradient-based optimization approaches can now be approached with analytically tractable inference

【35】 Bag of Tricks for Neural Architecture Search 标题：用于神经结构搜索的一袋小把戏

作者：Thomas Elsken,Benedikt Staffler,Arber Zela,Jan Hendrik Metzen,Frank Hutter 机构：Bosch Center for Artificial Intelligence,University of Freiburg 链接：https://arxiv.org/abs/2107.03719 摘要：虽然神经结构搜索方法在前几年取得了成功，并在各种问题上取得了新的最新进展，但它们也因不稳定、对其超参数高度敏感以及通常表现不优于随机搜索而受到批评。为了阐明这个问题，我们讨论了一些有助于提高稳定性、效率和整体性能的实际考虑因素。摘要：While neural architecture search methods have been successful in previous years and led to new state-of-the-art performance on various problems, they have also been criticized for being unstable, being highly sensitive with respect to their hyperparameters, and often not performing better than random search. To shed some light on this issue, we discuss some practical considerations that help improve the stability, efficiency and overall performance.

【36】 Generalization Error of GAN from the Discriminator's Perspective 标题：从鉴别器的角度看GaN的泛化误差

作者：Hongkang Yang,Weinan E 机构：Program in Applied and Computational Mathematics, Princeton University, Department of Mathematics, Princeton University, Beijing Institute of Big Data Research 链接：https://arxiv.org/abs/2107.03633 摘要：生成对抗网络（GAN）是一种学习高维分布的著名模型，但其泛化能力的机制尚不清楚。尤其是GAN易受记忆现象的影响，最终收敛到经验分布。我们考虑一个简化的GaN模型，用密度代替发电机，并分析鉴别器如何促进泛化。我们发现，在提前停止的情况下，用Wasserstein度量度量的泛化误差可以摆脱维数灾难，尽管从长期来看，记忆是不可避免的。此外，我们提出了一个困难的学习结果为WGAN。摘要：The generative adversarial network (GAN) is a well-known model for learning high-dimensional distributions, but the mechanism for its generalization ability is not understood. In particular, GAN is vulnerable to the memorization phenomenon, the eventual convergence to the empirical distribution. We consider a simplified GAN model with the generator replaced by a density, and analyze how the discriminator contributes to generalization. We show that with early stopping, the generalization error measured by Wasserstein metric escapes from the curse of dimensionality, despite that in the long term, memorization is inevitable. In addition, we present a hardness of learning result for WGAN.

【37】 Validation and Inference of Agent Based Models 标题：基于Agent的模型验证与推理

作者：D. Townsend 机构：Validation and Inference of AgentBased ModelsA dissertationsubmitted in partial fulfilmentof the requirements for the DegreeofBachelor of Computing and Mathematical Sciences with HonoursatThe University of WaikatobyDale Townsend 20 2 1arXiv 链接：https://arxiv.org/abs/2107.03619 摘要：基于Agent的建模（ABM）是一种模拟自主Agent行为和交互作用的计算框架。由于基于Agent的模型通常代表复杂系统，因此获取模型参数的似然函数几乎总是很困难的。为了理解模型的输出，有必要在无似然的上下文中进行推理。近似贝叶斯计算是一种适合这种推理的方法。它可以应用于基于Agent的模型，既可以验证仿真结果，又可以推断出一组参数来描述模型。最近在ABC中的研究已经产生了越来越有效的算法来计算近似似然。这些都是调查和比较使用行人模型在汉密尔顿CBD。摘要：Agent Based Modelling (ABM) is a computational framework for simulating the behaviours and interactions of autonomous agents. As Agent Based Models are usually representative of complex systems, obtaining a likelihood function of the model parameters is nearly always intractable. There is a necessity to conduct inference in a likelihood free context in order to understand the model output. Approximate Bayesian Computation is a suitable approach for this inference. It can be applied to an Agent Based Model to both validate the simulation and infer a set of parameters to describe the model. Recent research in ABC has yielded increasingly efficient algorithms for calculating the approximate likelihood. These are investigated and compared using a pedestrian model in the Hamilton CBD.

【38】 CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation 标题：CSDI：概率时间序列补偿的条件分数扩散模型

作者：Yusuke Tashiro,Jiaming Song,Yang Song,Stefano Ermon 机构：Department of Computer Science, Stanford University, Stanford, CA, USA, Mitsubishi UFJ Trust Investment Technology Institute, Tokyo, Japan, Japan Digital Design, Tokyo, Japan 链接：https://arxiv.org/abs/2107.03502 摘要：时间序列缺失值的插补在医疗卫生和金融领域有着广泛的应用。虽然自回归模型是时间序列插补的自然候选模型，但基于分数的扩散模型最近在许多任务（如图像生成和音频合成）中的性能超过了现有的同类模型，并有望用于时间序列插补。本文提出了一种新的时间序列插补方法&条件分数扩散插补模型（CSDI）。与现有的基于分数的方法不同，条件扩散模型被明确训练用于插补，并且可以利用观测值之间的相关性。在医疗保健和环境数据方面，CSDI比现有的流行绩效指标概率插补方法提高了40-70%。此外，与现有的确定性插补方法相比，CSDI的确定性插补方法可减少5-20%的误差。此外，CSDI还可以应用于时间序列插值和概率预测，与现有的基线相比具有一定的竞争力。摘要：The imputation of missing values in time series has many applications in healthcare and finance. While autoregressive models are natural candidates for time series imputation, score-based diffusion models have recently outperformed existing counterparts including autoregressive models in many tasks such as image generation and audio synthesis, and would be promising for time series imputation. In this paper, we propose Conditional Score-based Diffusion models for Imputation (CSDI), a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data. Unlike existing score-based approaches, the conditional diffusion model is explicitly trained for imputation and can exploit correlations between observed values. On healthcare and environmental data, CSDI improves by 40-70% over existing probabilistic imputation methods on popular performance metrics. In addition, deterministic imputation by CSDI reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods. Furthermore, CSDI can also be applied to time series interpolation and probabilistic forecasting, and is competitive with existing baselines.

【39】 Impossibility results for fair representations 标题：公平陈述的不可能结果

作者：Tosca Lechner,Shai Ben-David,Sushant Agarwal,Nivasini Ananthakrishnan 机构：School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada 链接：https://arxiv.org/abs/2107.03483 摘要：随着人们对机器学习中公平性的日益认识和数据表示在数据处理任务中的核心作用的实现，公平数据表示的概念引起了人们的广泛兴趣。这种表示的目的是保证在表示下对数据进行训练的模型（例如，分类器）能够遵守一些公平性约束。当这些表示可以固定用于各种不同任务上的训练模型时，以及当它们充当原始数据（表示设计者已知）和潜在恶意代理（使用表示下的数据来学习预测模型和作出决策）之间的数据过滤时，这些表示是有用的。最近的一长串研究论文都致力于为实现这些目标提供工具。然而，我们证明这基本上是徒劳的努力。粗略地说，我们证明了对于不同的训练任务，没有一种表示方法能够保证分类器的公平性；即使是实现与标签无关的人口均等公平的基本目标，一旦边际数据分布发生变化，也会失败。更精确的公平性概念，如赔率相等，不能通过不考虑特定于任务的标记规则的表示来保证，这些规则将被评估为公平性（即使边际数据分布是已知的先验）。此外，除了一些琐碎的情况外，没有一种表示方法能够保证任意两个不同任务的几率相等和公平性，同时允许对这两个任务进行准确的标签预测。虽然我们的一些结论是直观的，我们制定（并证明）这种不可能的明确声明，往往是对比的印象传达了许多最近的作品公平表示。摘要：With the growing awareness to fairness in machine learning and the realization of the central role that data representation has in data processing tasks, there is an obvious interest in notions of fair data representations. The goal of such representations is that a model trained on data under the representation (e.g., a classifier) will be guaranteed to respect some fairness constraints. Such representations are useful when they can be fixed for training models on various different tasks and also when they serve as data filtering between the raw data (known to the representation designer) and potentially malicious agents that use the data under the representation to learn predictive models and make decisions. A long list of recent research papers strive to provide tools for achieving these goals. However, we prove that this is basically a futile effort. Roughly stated, we prove that no representation can guarantee the fairness of classifiers for different tasks trained using it; even the basic goal of achieving label-independent Demographic Parity fairness fails once the marginal data distribution shifts. More refined notions of fairness, like Odds Equality, cannot be guaranteed by a representation that does not take into account the task specific labeling rule with respect to which such fairness will be evaluated (even if the marginal data distribution is known a priory). Furthermore, except for trivial cases, no representation can guarantee Odds Equality fairness for any two different tasks, while allowing accurate label predictions for both. While some of our conclusions are intuitive, we formulate (and prove) crisp statements of such impossibilities, often contrasting impressions conveyed by many recent works on fair representations.

【40】 In-Network Learning: Distributed Training and Inference in Networks 标题：网络内学习：网络中的分布式训练和推理

作者：Matei Moldoveanu,Abdellatif Zaidi 机构：† Universit´e Paris-Est, Champs-sur-Marne , France, ∤ Mathematical and Algorithmic Sciences Lab., Paris Research Center, Huawei France 备注：Submitted to the IEEE Journal on Selected Areas in Communications (JSAC) Series on Machine Learning for Communications and Networks. arXiv admin note: substantial text overlap with arXiv:2104.14929 链接：https://arxiv.org/abs/2107.03433 摘要：人们普遍认为，将现代机器学习技术的成功应用于移动设备和无线网络，有可能实现重要的新服务。然而，这带来了巨大的挑战，主要是由于数据和处理能力在无线网络中高度分布。在本文中，我们开发了一个学习算法和一个体系结构，它不仅在训练阶段，而且在推理阶段使用多个数据流和处理单元。特别地，分析揭示了推理是如何在网络中传播和融合的。研究了该方法的设计准则和带宽要求。此外，我们还讨论了神经网络在典型无线接入系统中的应用；并提供实验，说明相对于最先进的技术的好处。摘要：It is widely perceived that leveraging the success of modern machine learning techniques to mobile devices and wireless networks has the potential of enabling important new services. This, however, poses significant challenges, essentially due to that both data and processing power are highly distributed in a wireless network. In this paper, we develop a learning algorithm and an architecture that make use of multiple data streams and processing units, not only during the training phase but also during the inference phase. In particular, the analysis reveals how inference propagates and fuses across a network. We study the design criterion of our proposed method and its bandwidth requirements. Also, we discuss implementation aspects using neural networks in typical wireless radio access; and provide experiments that illustrate benefits over state-of-the-art techniques.

【41】 High dimensional precision matrix estimation under weak sparsity 标题：弱稀疏性下的高维精度矩阵估计

作者：Zeyu Wu,Cheng Wang,Weidong Liu 备注：24 pages, 5 figures 链接：https://arxiv.org/abs/2107.02999 摘要：本文在弱稀疏条件下估计高维精度矩阵，其中许多项几乎为零。研究了一种高维精度矩阵估计的Lasso型方法，给出了弱稀疏条件下的一般误差界。放宽了一般不可再现条件，所得结果适用于弱稀疏矩阵。作为应用，我们研究了重尾数据、非超自然数据和矩阵数据的精度矩阵估计。摘要：In this paper, we estimate the high dimensional precision matrix under the weak sparsity condition where many entries are nearly zero. We study a Lasso-type method for high dimensional precision matrix estimation and derive general error bounds under the weak sparsity condition. The common irrepresentable condition is relaxed and the results are applicable to the weak sparse matrix. As applications, we study the precision matrix estimation for the heavy-tailed data, the non-paranormal data, and the matrix data with the Lasso-type method.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-09，如有侵权请联系 cloudcommunity@tencent.com 删除

linux