我有一个数据帧,看起来像这样:
W01 0.750000 0.916667 0.642857 1.000000 0.619565
W02 0.880000 0.944444 0.500000 0.991228 0.675439
W03 0.729167 0.900000 0.444444 1.000000 0.611111
W04 0.809524 0.869565 0.500000 1.000000 0.709091
W05 0.625000 0.925926 0.653846 1.000000 0.589286
Variation 1_941119_A/G 1_942335_C/G 1_942451_T/C 1_942934_G/C \
W01 0.967391 0.965909 1 0.130435
W02 0.929825 0.937500 1 0.184211
W03 0.925926 0.880000 1 0.138889
W04 0.918182 0.907407 1 0.200000
W05 0.901786 0.858491 1 0.178571
Variation 1_944296_G/A ... X_155545046_C/T X_155774775_G/T \
W01 0.978261 ... 0.652174 0.641304
W02 0.938596 ... 0.728070 0.736842
W03 0.944444 ... 0.675926 0.685185
W04 0.927273 ... 0.800000 0.690909
W05 0.901786 ... 0.794643 0.705357
Variation Y_5100327_G/T Y_5100614_T/G Y_12786160_G/A Y_12914512_C/A \
W01 0.807692 0.800000 0.730769 0.807692
W02 0.655172 0.653846 0.551724 0.666667
W03 0.880000 0.909091 0.833333 0.916667
W04 0.666667 0.642857 0.580645 0.678571
W05 0.730769 0.720000 0.692308 0.720000
Variation Y_13470103_G/A Y_19705901_A/G Y_20587967_A/C mean_age
W01 0.807692 0.666667 0.333333 56.3
W02 0.678571 0.520000 0.250000 66.3
W03 0.916667 0.764706 0.291667 69.7
W04 0.666667 0.560000 0.322581 71.6
W05 0.703704 0.600000 0.346154 72.5
[5 rows x 67000 columns]
我想为每一列拟合一个简单的最小二乘线性回归和泰尔-森线性回归作为独立变量和均值作为响应变量,并收集每个拟合的汇总统计信息,包括slope
,intercept
,r value
,p value
和std err
,最好收集输出作为数据!
到目前为止,我一直在对我的'df‘进行切片,并分别对每一列进行回归分析:
from scipy import stats
import time
# Start timer
start_time = time.time()
# Select only 'Variation of interest' and 'mean_age' columns
r1 = tdf [['1_944296_G/A', 'mean_age']]
# Use scipy lingress function to perform linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(tdf['mean_age'], \
tdf['1_69270_A/G'])
print('The p-value between the 2 variables is measured as ' + str(p_value) + '\n')
print('Least squares linear model coefficients, intercept = ' + str(intercept) + \
'. Slope = ' + str(slope)+'\n')
# Create regression line
regressLine = intercept + tdf['mean_age']*slope
# Regression using Theil-Sen with 95% confidence intervals
res = stats.theilslopes(tdf['1_69270_A/G'], tdf['mean_age'], 0.95)
print('Thiel-Sen linear model coefficients, intercept = ' + str(res[1]) + '. Slope = ' + \
str(res[0]) +'\n')
# Scatter plot the temperature
plt.clf()
plt.scatter(tdf['mean_age'], tdf['1_69270_A/G'], s = 3, label = 'Allele frequency')
# Add least squares regression line
plt.plot(tdf['mean_age'], regressLine, label = 'Least squares regression line');
# Add Theil-Sen regression line
plt.plot(tdf['mean_age'], res[1] + res[0] * tdf['mean_age'], 'r-', label = 'Theil-Sen regression line')
# Add Theil-Sen confidence intervals
plt.plot(tdf['mean_age'], res[1] + res[2] * tdf['mean_age'], 'r--', label = 'Theil-Sen 95% confidence interval')
plt.plot(tdf['mean_age'], res[1] + res[3] * tdf['mean_age'], 'r--')
# Add legend, axis limits and save to png
plt.legend(loc = 'upper left')
#plt.ylim(7,14); plt.xlim(1755, 2016)
plt.xlabel('Year'); plt.ylabel('Temperature (C)')
plt.savefig('pythonRegress.png')
# End timer
end_time = time.time()
print('Elapsed time = ' + str(end_time - start_time) + ' seconds')
我想知道如何在每个列的迭代循环中执行此分析,并将最终结果收集到一个全面的数据帧中。
我见过这个(Looping regression and obtaining summary statistics in matrix form“循环回归并以矩阵形式获取汇总统计信息")!但并不完全是我期望的输出。欢迎使用Python或R语言编写的任何解决方案!
发布于 2019-06-06 09:54:26
我想你会发现这个指南很有用:Running a model on separate groups。
让我们生成一些与您的数据类似的示例数据,其中包含两个变量和平均年龄的值。我们还需要几个包:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
set.seed(1001)
data1 <- data.frame(mean_age = sample(40:80, 50, replace = TRUE),
snp01 = rnorm(50),
snp02 = rnorm(50))
第一步是使用gather
将“宽”格式转换为“长”格式,因为变量名在一列中,值在另一列中。然后,我们可以通过变量名进行nest
。
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp)
这将创建一个tibble (一个特殊的数据框),其中第二列data
是一个“列表列”-它包含了该行中的平均年龄和变量的值:
# A tibble: 2 x 2
snp data
<chr> <list>
1 snp01 <tibble [50 x 2]>
2 snp02 <tibble [50 x 2]>
现在,我们使用purrr::map
为每一行创建第三列和线性模型:
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp) %>%
mutate(model = map(data, ~lm(mean_age ~ value, data = .)))
结果:
# A tibble: 2 x 3
snp data model
<chr> <list> <list>
1 snp01 <tibble [50 x 2]> <lm>
2 snp02 <tibble [50 x 2]> <lm>
最后一步是根据需要总结模型,然后对数据结构进行unnest
。我使用的是broom::glance()
。完整的过程:
data1 %>%
gather(snp, value, -mean_age) %>%
nest(-snp) %>%
mutate(model = map(data, ~lm(mean_age ~ value, data = .)),
summary = map(model, glance)) %>%
select(-data, -model) %>%
unnest(summary)
结果:
# A tibble: 2 x 12
snp r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 snp01 0.00732 -0.0134 12.0 0.354 0.555 2 -194. 394. 400. 6901. 48
2 snp02 0.0108 -0.00981 12.0 0.524 0.473 2 -194. 394. 400. 6877. 48
发布于 2019-06-06 09:59:29
我不知道您的数据和分析的确切细节和复杂性,但这是我会采取的方法。
data <- data.frame(mean_age=rnorm(5),
Column_1=rnorm(5),
Column_2=rnorm(5),
Column_3=rnorm(5),
Column_4=rnorm(5),
Column_5=rnorm(5)
)
data
looped <- list()
for(each_col in names(data)[-1]){
looped[[each_col]] <- lm(get(each_col) ~ mean_age, data)
}
looped
https://stackoverflow.com/questions/56469472
复制相似问题