文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Pandas数据框运行OLS回归

问使用Pandas数据框运行OLS回归
EN

Stack Overflow用户

提问于 2013-11-15 08:47:00

回答 3查看 230.1K关注 0票数 124

我有一个pandas数据框，我希望能够根据B列和C列中的值预测A列的值。下面是一个玩具示例：

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], 
                   "B": [20, 30, 10, 40, 50], 
                   "C": [32, 234, 23, 23, 42523]})

理想情况下，我应该有像ols(A ~ B + C, data = df)这样的东西，但是当我查看像scikit-learn这样的算法库中的examples时，它似乎是用行而不是列的列表将数据提供给模型。这将需要我将数据重新格式化为列表中的列表，这似乎违背了最初使用pandas的目的。对pandas数据帧中的数据运行OLS回归(或更一般的任何机器学习算法)的最典型的方法是什么？

pandas

scikit-learn

regression

statsmodels

python

回答 3

Stack Overflow用户

发布于 2013-11-15 09:05:30

我认为您几乎可以做您认为理想的事情，使用statsmodels包，在pandas版本0.20.0之前，它是pandas的可选依赖项之一(它在pandas.stats中用于一些事情)。

>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64
>>> print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
Time:                        20:04:30   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
B              0.4012      0.650      0.617      0.600        -2.394     3.197
C              0.0004      0.001      0.650      0.583        -0.002     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.061
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
Skew:                          -0.123   Prob(JB):                        0.780
Kurtosis:                       1.474   Cond. No.                     5.21e+04
==============================================================================

Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

票数 173

Stack Overflow用户

发布于 2017-01-07 10:51:15

我不知道这是sklearn还是pandas中的新特性，但我可以将数据帧直接传递给sklearn，而无需将数据帧转换为numpy数组或任何其他数据类型。

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])

>>> reg.coef_
array([  4.01182386e-01,   3.51587361e-04])

票数 34

Stack Overflow用户

发布于 2021-02-13 02:31:07

B在统计上没有意义。数据不能从中得出推论。C确实会影响B概率

 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})

 avg_c=df['C'].mean()
 sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum()
 countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count()
 avg_c2=sumC/countC
 df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x)
 
 print(df)

 model_ols = smf.ols("A ~ B+C",data=df).fit()

 print(model_ols.summary())

 df[['B','C']].plot()
 plt.show()


 df2=pd.DataFrame()
 df2['B']=np.linspace(10,50,10)
 df2['C']=30

 df3=pd.DataFrame()
 df3['B']=np.linspace(10,50,10)
 df3['C']=100

 predB=model_ols.predict(df2)
 predC=model_ols.predict(df3)
 plt.plot(df2['B'],predB,label='predict B C=30')
 plt.plot(df3['B'],predC,label='predict B C=100')
 plt.legend()
 plt.show()

 print("A change in the probability of C affects the probability of B")

 intercept=model_ols.params.loc['Intercept']
 B_slope=model_ols.params.loc['B']
 C_slope=model_ols.params.loc['C']
 #Intercept    11.874252
 #B             0.760859
 #C            -0.060257

 print("Intercept {}\n B slope{}\n C    slope{}\n".format(intercept,B_slope,C_slope))


 #lower_conf,upper_conf=np.exp(model_ols.conf_int())
 #print(lower_conf,upper_conf)
 #print((1-(lower_conf/upper_conf))*100)

 model_cov=model_ols.cov_params()
 std_errorB = np.sqrt(model_cov.loc['B', 'B'])
 std_errorC = np.sqrt(model_cov.loc['C', 'C'])
 print('SE: ', round(std_errorB, 4),round(std_errorC, 4))
 #check for statistically significant
 print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC)))
 print("B feature is more statistically significant than C")


 Output:

 A change in the probability of C affects the probability of B
 Intercept 11.874251554067563
 B slope0.7608594144571961
 C slope-0.060256845997223814

 Standard Error:  0.4519 0.0793
 B z value 1.683510336937001 C z value -0.7601036314930376
 B feature is more statistically significant than C

 z>2 is statistically significant

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/19991445

复制

相似问题

问使用Pandas数据框运行OLS回归
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Pandas数据框运行OLS回归EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Pandas数据框运行OLS回归
EN