我有一个pandas
数据框,我希望能够根据B列和C列中的值预测A列的值。下面是一个玩具示例:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50],
"B": [20, 30, 10, 40, 50],
"C": [32, 234, 23, 23, 42523]})
理想情况下,我应该有像ols(A ~ B + C, data = df)
这样的东西,但是当我查看像scikit-learn
这样的算法库中的examples时,它似乎是用行而不是列的列表将数据提供给模型。这将需要我将数据重新格式化为列表中的列表,这似乎违背了最初使用pandas的目的。对pandas数据帧中的数据运行OLS回归(或更一般的任何机器学习算法)的最典型的方法是什么?
发布于 2013-11-15 09:05:30
我认为您几乎可以做您认为理想的事情,使用statsmodels包,在pandas
版本0.20.0之前,它是pandas
的可选依赖项之一(它在pandas.stats
中用于一些事情)。
>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept 14.952480
B 0.401182
C 0.000352
dtype: float64
>>> print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.579
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1.375
Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421
Time: 20:04:30 Log-Likelihood: -18.178
No. Observations: 5 AIC: 42.36
Df Residuals: 2 BIC: 41.19
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
B 0.4012 0.650 0.617 0.600 -2.394 3.197
C 0.0004 0.001 0.650 0.583 -0.002 0.003
==============================================================================
Omnibus: nan Durbin-Watson: 1.061
Prob(Omnibus): nan Jarque-Bera (JB): 0.498
Skew: -0.123 Prob(JB): 0.780
Kurtosis: 1.474 Cond. No. 5.21e+04
==============================================================================
Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
发布于 2017-01-07 10:51:15
我不知道这是sklearn
还是pandas
中的新特性,但我可以将数据帧直接传递给sklearn
,而无需将数据帧转换为numpy数组或任何其他数据类型。
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])
>>> reg.coef_
array([ 4.01182386e-01, 3.51587361e-04])
发布于 2021-02-13 02:31:07
B在统计上没有意义。数据不能从中得出推论。C确实会影响B概率
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
avg_c=df['C'].mean()
sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum()
countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count()
avg_c2=sumC/countC
df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x)
print(df)
model_ols = smf.ols("A ~ B+C",data=df).fit()
print(model_ols.summary())
df[['B','C']].plot()
plt.show()
df2=pd.DataFrame()
df2['B']=np.linspace(10,50,10)
df2['C']=30
df3=pd.DataFrame()
df3['B']=np.linspace(10,50,10)
df3['C']=100
predB=model_ols.predict(df2)
predC=model_ols.predict(df3)
plt.plot(df2['B'],predB,label='predict B C=30')
plt.plot(df3['B'],predC,label='predict B C=100')
plt.legend()
plt.show()
print("A change in the probability of C affects the probability of B")
intercept=model_ols.params.loc['Intercept']
B_slope=model_ols.params.loc['B']
C_slope=model_ols.params.loc['C']
#Intercept 11.874252
#B 0.760859
#C -0.060257
print("Intercept {}\n B slope{}\n C slope{}\n".format(intercept,B_slope,C_slope))
#lower_conf,upper_conf=np.exp(model_ols.conf_int())
#print(lower_conf,upper_conf)
#print((1-(lower_conf/upper_conf))*100)
model_cov=model_ols.cov_params()
std_errorB = np.sqrt(model_cov.loc['B', 'B'])
std_errorC = np.sqrt(model_cov.loc['C', 'C'])
print('SE: ', round(std_errorB, 4),round(std_errorC, 4))
#check for statistically significant
print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC)))
print("B feature is more statistically significant than C")
Output:
A change in the probability of C affects the probability of B
Intercept 11.874251554067563
B slope0.7608594144571961
C slope-0.060256845997223814
Standard Error: 0.4519 0.0793
B z value 1.683510336937001 C z value -0.7601036314930376
B feature is more statistically significant than C
z>2 is statistically significant
https://stackoverflow.com/questions/19991445
复制相似问题