问多次拟合回归并收集汇总统计信息
EN

Stack Overflow用户

提问于 2019-06-06 08:02:39

回答 2查看 1.2K关注 0票数 1

我有一个数据帧，看起来像这样：

W01           0.750000     0.916667     0.642857      1.000000      0.619565   
W02           0.880000     0.944444     0.500000      0.991228      0.675439   
W03           0.729167     0.900000     0.444444      1.000000      0.611111   
W04           0.809524     0.869565     0.500000      1.000000      0.709091   
W05           0.625000     0.925926     0.653846      1.000000      0.589286   

Variation  1_941119_A/G  1_942335_C/G  1_942451_T/C  1_942934_G/C  \
W01            0.967391      0.965909             1      0.130435   
W02            0.929825      0.937500             1      0.184211   
W03            0.925926      0.880000             1      0.138889   
W04            0.918182      0.907407             1      0.200000   
W05            0.901786      0.858491             1      0.178571   

Variation  1_944296_G/A    ...     X_155545046_C/T  X_155774775_G/T  \
W01            0.978261    ...            0.652174         0.641304   
W02            0.938596    ...            0.728070         0.736842   
W03            0.944444    ...            0.675926         0.685185   
W04            0.927273    ...            0.800000         0.690909   
W05            0.901786    ...            0.794643         0.705357   

Variation  Y_5100327_G/T  Y_5100614_T/G  Y_12786160_G/A  Y_12914512_C/A  \
W01             0.807692       0.800000        0.730769        0.807692   
W02             0.655172       0.653846        0.551724        0.666667   
W03             0.880000       0.909091        0.833333        0.916667   
W04             0.666667       0.642857        0.580645        0.678571   
W05             0.730769       0.720000        0.692308        0.720000   

Variation  Y_13470103_G/A  Y_19705901_A/G  Y_20587967_A/C  mean_age  
W01              0.807692        0.666667        0.333333      56.3  
W02              0.678571        0.520000        0.250000      66.3  
W03              0.916667        0.764706        0.291667      69.7  
W04              0.666667        0.560000        0.322581      71.6  
W05              0.703704        0.600000        0.346154      72.5  

[5 rows x 67000 columns]

我想为每一列拟合一个简单的最小二乘线性回归和泰尔-森线性回归作为独立变量和均值作为响应变量，并收集每个拟合的汇总统计信息，包括slope，intercept，r value，p value和std err，最好收集输出作为数据！

到目前为止，我一直在对我的'df‘进行切片，并分别对每一列进行回归分析：

from scipy import stats
import time

# Start timer
start_time = time.time()

# Select only 'Variation of interest' and 'mean_age' columns
r1 = tdf [['1_944296_G/A', 'mean_age']]

# Use scipy lingress function to perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(tdf['mean_age'], \
    tdf['1_69270_A/G'])

print('The p-value between the 2 variables is measured as ' + str(p_value) + '\n')
print('Least squares linear model coefficients, intercept = ' + str(intercept) + \
  '. Slope = ' + str(slope)+'\n')

# Create regression line
regressLine = intercept + tdf['mean_age']*slope

# Regression using Theil-Sen with 95% confidence intervals 
res = stats.theilslopes(tdf['1_69270_A/G'], tdf['mean_age'], 0.95)

print('Thiel-Sen linear model coefficients, intercept = ' + str(res[1]) + '. Slope = ' + \
  str(res[0]) +'\n')

# Scatter plot the temperature
plt.clf()
plt.scatter(tdf['mean_age'], tdf['1_69270_A/G'], s = 3, label = 'Allele frequency')

# Add least squares regression line
plt.plot(tdf['mean_age'], regressLine, label = 'Least squares regression line'); 

# Add Theil-Sen regression line
plt.plot(tdf['mean_age'], res[1] + res[0] * tdf['mean_age'], 'r-', label = 'Theil-Sen regression line')

# Add Theil-Sen confidence intervals
plt.plot(tdf['mean_age'], res[1] + res[2] * tdf['mean_age'], 'r--', label = 'Theil-Sen 95% confidence interval')
plt.plot(tdf['mean_age'], res[1] + res[3] * tdf['mean_age'], 'r--')

# Add legend, axis limits and save to png
plt.legend(loc = 'upper left')
#plt.ylim(7,14); plt.xlim(1755, 2016)
plt.xlabel('Year'); plt.ylabel('Temperature (C)')
plt.savefig('pythonRegress.png')

# End timer
end_time = time.time()
print('Elapsed time = ' + str(end_time - start_time) + ' seconds')

我想知道如何在每个列的迭代循环中执行此分析，并将最终结果收集到一个全面的数据帧中。

我见过这个(Looping regression and obtaining summary statistics in matrix form“循环回归并以矩阵形式获取汇总统计信息")！但并不完全是我期望的输出。欢迎使用Python或R语言编写的任何解决方案！

python

pandas

linear-regression

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-06-06 09:54:26

我想你会发现这个指南很有用：Running a model on separate groups。

让我们生成一些与您的数据类似的示例数据，其中包含两个变量和平均年龄的值。我们还需要几个包：

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

set.seed(1001)
data1 <- data.frame(mean_age = sample(40:80, 50, replace = TRUE), 
                    snp01 = rnorm(50), 
                    snp02 = rnorm(50))

第一步是使用gather将“宽”格式转换为“长”格式，因为变量名在一列中，值在另一列中。然后，我们可以通过变量名进行nest。

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp)

这将创建一个tibble (一个特殊的数据框)，其中第二列data是一个“列表列”-它包含了该行中的平均年龄和变量的值：

# A tibble: 2 x 2
  snp   data             
  <chr> <list>           
1 snp01 <tibble [50 x 2]>
2 snp02 <tibble [50 x 2]>

现在，我们使用purrr::map为每一行创建第三列和线性模型：

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp) %>% 
  mutate(model = map(data, ~lm(mean_age ~ value, data = .)))

结果：

# A tibble: 2 x 3
  snp   data              model 
  <chr> <list>            <list>
1 snp01 <tibble [50 x 2]> <lm>  
2 snp02 <tibble [50 x 2]> <lm>

最后一步是根据需要总结模型，然后对数据结构进行unnest。我使用的是broom::glance()。完整的过程：

data1 %>% 
  gather(snp, value, -mean_age) %>% 
  nest(-snp) %>% 
  mutate(model = map(data, ~lm(mean_age ~ value, data = .)), 
         summary = map(model, glance)) %>% 
  select(-data, -model) %>% 
  unnest(summary)

结果：

# A tibble: 2 x 12
  snp   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
  <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
1 snp01   0.00732      -0.0134   12.0     0.354   0.555     2  -194.  394.  400.    6901.          48
2 snp02   0.0108       -0.00981  12.0     0.524   0.473     2  -194.  394.  400.    6877.          48

票数 2

Stack Overflow用户

发布于 2019-06-06 09:59:29

我不知道您的数据和分析的确切细节和复杂性，但这是我会采取的方法。

data <- data.frame(mean_age=rnorm(5),
                   Column_1=rnorm(5),
                   Column_2=rnorm(5),
                   Column_3=rnorm(5),
                   Column_4=rnorm(5),
                   Column_5=rnorm(5)
                   )
data


looped <- list()

for(each_col in names(data)[-1]){
    looped[[each_col]] <- lm(get(each_col) ~ mean_age, data)

}

looped

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56469472

复制

相似问题

问多次拟合回归并收集汇总统计信息
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问多次拟合回归并收集汇总统计信息EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问多次拟合回归并收集汇总统计信息
EN