有奖捉虫:办公协同&微信生态&物联网文档专题 HOT
注意:
IO 路径参数请在高级设置中查看。

XGBoost 回归

原理

算法说明

XGBoost(eXtreme Gradient Boosting) 是基于优化的 Gradient Boosting 算法的一个开源框架,可以用于回归,也可以用于分类,是目前数据科学竞赛最流行的工具包之一。

参数配置

算法 IO 参数
*输入文件类型:格式包括以下两种:
csv :csv 文件
*输入数据包含 header 信息:默认为“是”。
*输入数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*输出文件类型:格式包括以下两种:
csv :csv 文件
*输出数据包含 header 信息:默认为“是”。
*输出数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*特征列:作为训练特征的列,从 0 开始编号。对于库表可以直接勾选,对于普通路径,可填形式如 a-b、c 或者它们的混合,用英文逗号分割(例如 0-10,15,17-19 表示第 0 到 10 列、15 、17 到 19 列总共 15 列作为特征)。
*标签列:作为标签的列,可用列号或者列名。
算法参数
*early_stopping_rounds :整数,默认为 100 ,早停次数。当验证集上的评估指标在 100 次迭代中均没有得到提升,则模型停止迭代。
*learning_rate :小数,默认为 0.1 ,学习率。权重更新的步长,通常取 0.01-0.2 。
*n_estimators :整数,默认为 1000 ,最大迭代次数。
*n_jobs :整数,默认为 1 ,使用线程数。
*gamma :浮点数,默认为 0.0 ,节点分裂所需的最小损失函数值。如果分裂能够使 loss 函数减小的值大于 gamma ,则这个节点才分裂。gamma 设置了这个减小的最低阈值。如果 gamma 设置为 0 ,表示只要使得 loss 函数减少,就分裂。
*min_child_weight :浮点数, 默认为 1.0 ,孩子节点中最小的样本权重和。这个参数是指建立每个模型所需要的最小样本数,该值越大算法越保守。
*subsample :浮点数,默认为 1.0 ,用于训练模型的子样本占整个样本集合的比例。如果设置为 0.5 则意味着 XGBoost 将在整个样本集合中随机的抽取出 50% 的子样本建立树模型,这能够防止过拟合。
*reg_alpha :浮点数,默认为 0.0 ,权重的 L1 正则项系数。
*reg_lambda :浮点数,默认为 0.0 ,权重的 L2 正则项系数。
训练节点输出
pkl 格式的模型,保存在后台生成的路径下。
*特征重要性:.csv 格式数据,包含两列,第一列为特征名,第二列为重要性值,逗号分隔。计算方法:对特征在每棵树上被用来划分数据的次数求和

Demo

输入数据示例

示例中第一行为列名,最后一列为标签。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
0.09378,12.5,7.87,0,0.524,5.889,39,5.4509,5,311,15.2,390.5,15.71,21.7
0.62976,0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21,396.9,8.26,20.4
0.63796,0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21,380.02,10.26,18.2
0.62739,0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21,395.62,8.47,19.9
1.05393,0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21,386.85,6.58,23.1
0.7842,0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21,386.75,14.67,17.5
0.80271,0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21,288.99,11.69,20.2

参数配置(训练节点)

算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*特征列:0-12
*标签列:13
算法参数
*early_stopping_rounds :100
*learning_rate :0.1
*n_estimators :1000
*n_jobs :-1 ,调用当前系统可以获得的最大线程数。
*gamma :0.0
*min_child_weight :1.0
*subsample :1.0
*reg_alpha :0.0
*reg_lambda :0.0

参数配置(预测节点)

训练算子训练成功后,将模型路径接口与预测算子模型路径接口相连进行使用即可。具体配置参考“算法 IO 参数”。

将测试数据拖拽到预测算子输入数据接口,运行预测算子对测试数据进行预测。
算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*输出文件类型:csv
*输出数据是否包含 header 信息:是
*输出数据分隔符:逗号
*特征列:0-12

预测节点输出数据示例

输出为一个 csv 文件,其中第一行为列名,最后一列 y_pred 为模型的预测值。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL,y_pred
0.00632,18.0,2.31,0,0.5379999999999999,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,24.000237
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,21.60027
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,34.699688
0.032369999999999996,0.0,2.18,0,0.45799999999999996,6.997999999999999,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,33.399853
0.06905,0.0,2.18,0,0.45799999999999996,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,36.199017
0.02985,0.0,2.18,0,0.45799999999999996,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,28.699585
0.08829,12.5,7.87,0,0.524,6.0120000000000005,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,22.899467
0.14455,12.5,7.87,0,0.524,6.172000000000001,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,27.099155
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,16.500153
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,18.900005
0.22489,12.5,7.87,0,0.524,6.377000000000001,94.3,6.3467,5,311,15.2,392.52,20.45,15.0,15.001032
0.11747,12.5,7.87,0,0.524,6.0089999999999995,82.9,6.2267,5,311,15.2,396.9,13.27,18.9,18.900488
0.09378,12.5,7.87,0,0.524,5.888999999999999,39.0,5.4509,5,311,15.2,390.5,15.71,21.7,21.700161
0.62976,0.0,8.14,0,0.5379999999999999,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,20.399849
0.6379600000000001,0.0,8.14,0,0.5379999999999999,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2,18.200222
0.62739,0.0,8.14,0,0.5379999999999999,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9,19.900532
1.05393,0.0,8.14,0,0.5379999999999999,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,23.09971
0.7842,0.0,8.14,0,0.5379999999999999,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5,17.500738
0.80271,0.0,8.14,0,0.5379999999999999,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2,20.199768

决策树回归(DecisionTreeRegressor)

原理

算法说明

DecisionTreeRegressor(决策树算法)是机器学习中常用的一类分类/回归算法。决策树算法有解释性好、可以处理类别特征、不需要做特征 scaling 等优点,可以表示非线性模型,最高可以支持百万级别的样本。

参数配置

算法 IO 参数
*输入文件类型:格式包括以下两种:
csv :csv 文件
*输入数据包含 header 信息:默认为“是”。
*输入数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*输出文件类型:格式包括以下两种:
csv :csv 文件
*输出数据包含 header 信息:默认为“是”。
*输出数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*特征列:作为训练特征的列,从0开始编号。对于库表可以直接勾选,对于普通路径,可填形式如 a-b、c 或者它们的混合,用英文逗号分割(例如 0-10,15,17-19 表示第 0 到 10 列、15 、17 到 19 列总共 15 列作为特征)。
*标签列:作为标签的列,要求特征列的特征标签为 double 类型。
算法参数
*模型保存格式:ML 格式或者 PMML 格式的模型,保存在后台生成的路径下。
*maxBins :计算分裂点时对,对连续特征的最大分桶数。最小为 2 。离散正整数参数,比较合理的方式是根据默认值 32 在周围调节。
*maxDepth :决策树最大深度。离散整正整数参数,比较合理的方式是根据默认值5在周围调节
*minInfoGain :决策树分裂最小信息增益。
*minInstancesPerNode :决策树每个节点的样本下限,可以控制节点是否继续分裂。离散正整数参数,默认值为 1 ,可根据样本数目适当调节。
*checkpointInterval :每多少轮设置 checkpoint 一次,在迭代轮数非常多的时候,可以降低因为计算节点失败导致的级联重算风险。

Demo

输入数据示例

示例中第一行为列名,最后一列为标签。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
0.09378,12.5,7.87,0,0.524,5.889,39,5.4509,5,311,15.2,390.5,15.71,21.7
0.62976,0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21,396.9,8.26,20.4
0.63796,0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21,380.02,10.26,18.2
0.62739,0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21,395.62,8.47,19.9
1.05393,0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21,386.85,6.58,23.1
0.7842,0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21,386.75,14.67,17.5
0.80271,0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21,288.99,11.69,20.2

参数配置

算法 IO 参数
*输入文件类型:csv 文件
*输入数据包含 header 信息:是
*输入数据分隔符:逗号
*特征列:0-12
*标签列:13
算法参数
*模型保存格式:PMML
*maxBins :32
*maxDepth :5
*minInfoGain :0.0
*minInstancesPerNode :1
*checkpointInterval :-1

树模型可视化

树形算法在运行完成后,可以支持用户对模型结果进行可视化查看。树形图中蓝色节点为特征的判读节点,实线表示判断为“是”的路径,虚线表示判断为“否”的路径。绿色节点为分类结果。



注意:
在训练时选择 PMML 格式才能可视化(ML 格式不行)。

预测节点参数配置

训练算子训练成功后,将模型路径接口与预测算子模型路径接口相连进行使用即可。具体配置参考“算法 IO 参数”。

将测试数据拖拽到预测算子输入数据接口,运行预测算子对测试数据进行预测。
算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*输出文件类型:csv
*输出数据是否包含 header 信息:是
*输出数据分隔符:逗号
*特征列:0-12
模型参数
*algorithm:DecisionTreeRegressor
*模型导入格式:PMML

输出数据示例

选择 PMML 格式的模型保存格式,其输出为一个 csv 文件,其中第一行为列名,最后一列 prediction 为模型的预测值。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL,prediction
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,22.55
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,22.55
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,34.7
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,33.39999999999999
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,36.2
0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,22.55
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,27.10000000000001
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0,15.0
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9,18.89999999999999
0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7,22.55
0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,20.3
0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2,17.85
0.62739,0.0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9,19.9
1.05393,0.0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,23.1
0.7842,0.0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5,17.85
0.80271,0.0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2,20.3

多项式回归

原理

算法说明

多项式回归,是在训练线性回归模型之前,对数据进行多项式变换,扩展了数据特征的维度。

参数配置

算法 IO 参数
*输入文件类型:格式包括以下两种:
csv :csv 文件
*输入数据包含 header 信息:默认为“是”。
*输入数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*输出文件类型:格式包括以下两种:
csv :csv 文件
*输出数据包含 header 信息:默认为“是”。
*输出数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*特征列:作为训练特征的列,从0 开始编号。对于库表可以直接勾选,对于普通路径,可填形式如 a-b、c 或者它们的混合,用英文逗号分割(例如 0-10,15,17-19 表示第 0 到 10 列、15 、17 到 19 列总共 15 列作为特征)。
*标签列:作为标签的列,可用列号或者列名。
算法参数
*degree :多项式维度,整数,默认 degree=2 。假设输入数据只有2列 x1,x2 ,则 degree=2 时,增加 0 次项系数 1 ,一次项次数 x1,x2 ,二次项系数 x1^2,x2^2,x1x2 原始 2 列特征扩展为 6 列特征,输入线性回归模型进行训练。
输出
pkl 格式的模型,保存在后台生成的路径下。

Demo

输入数据示例

示例中第一行为列名,最后一列为标签。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
0.09378,12.5,7.87,0,0.524,5.889,39,5.4509,5,311,15.2,390.5,15.71,21.7
0.62976,0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21,396.9,8.26,20.4
0.63796,0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21,380.02,10.26,18.2
0.62739,0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21,395.62,8.47,19.9
1.05393,0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21,386.85,6.58,23.1
0.7842,0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21,386.75,14.67,17.5
0.80271,0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21,288.99,11.69,20.2

参数配置(训练节点)

算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*特征列:0-12
*标签列:13
算法参数
*degree :2

参数配置(预测节点)

训练算子训练成功后,将模型路径接口与预测算子模型路径接口相连进行使用即可。具体配置参考“算法 IO 参数”。

将测试数据拖拽到预测算子输入数据接口,运行预测算子对测试数据进行预测。
算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*输出文件类型:csv
*输出数据是否包含 header 信息:是
*输出数据分隔符:逗号
*特征列:0-12

预测节点输出数据示例

输出为一个 csv 文件,其中第一行为列名,最后一列 y_pred 为模型的预测值。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL,y_pred
0.00632,18.0,2.31,0,0.5379999999999999,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,23.99999999999943
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,21.59999999999917
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,34.70000000000016
0.032369999999999996,0.0,2.18,0,0.45799999999999996,6.997999999999999,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,33.4000000000006
0.06905,0.0,2.18,0,0.45799999999999996,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,36.2000000000009
0.02985,0.0,2.18,0,0.45799999999999996,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,28.70000000000067
0.08829,12.5,7.87,0,0.524,6.0120000000000005,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,22.900000000000034
0.14455,12.5,7.87,0,0.524,6.172000000000001,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,27.100000000000136
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,18.900000000000205
0.22489,12.5,7.87,0,0.524,6.377000000000001,94.3,6.3467,5,311,15.2,392.52,20.45,15.0,15.000000000000625
0.11747,12.5,7.87,0,0.524,6.0089999999999995,82.9,6.2267,5,311,15.2,396.9,13.27,18.9,18.900000000000603
0.09378,12.5,7.87,0,0.524,5.888999999999999,39.0,5.4509,5,311,15.2,390.5,15.71,21.7,21.699999999999534
0.62976,0.0,8.14,0,0.5379999999999999,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,20.39999999999992
0.6379600000000001,0.0,8.14,0,0.5379999999999999,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2,18.199999999999875
0.62739,0.0,8.14,0,0.5379999999999999,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9,19.899999999999523
1.05393,0.0,8.14,0,0.5379999999999999,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,23.099999999999795
0.7842,0.0,8.14,0,0.5379999999999999,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5,17.50000000000017
0.80271,0.0,8.14,0,0.5379999999999999,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2,20.200000000001694

岭回归

原理

算法说明

岭回归,是在线性回归的基础上添加了 L2 正则项,它可以让线性回归系数中的 w 趋近于 0 ,有防止过拟合的作用。

参数配置

算法 IO 参数
*输入文件类型:格式包括以下两种:
csv :csv 文件
*输入数据包含 header 信息:默认为“是”。
*输入数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*输出文件类型:格式包括以下两种:
csv :csv 文件
*输出数据包含 header 信息:默认为“是”。
*输出数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*特征列:作为训练特征的列,从 0 开始编号。对于库表可以直接勾选,对于普通路径,可填形式如 a-b、c 或者它们的混合,用英文逗号分割(例如 0-10,15,17-19 表示第 0 到 10 列、15 、17 到 19 列总共 15 列作为特征)。
*标签列:作为标签的列,可用列号或者列名。
算法参数
*alpha :正则项系数,浮点数。默认 alpha=1.0 ,如果增大 alpha ,意味着模型的惩罚越重,那么相应的回归系数 w 就会越小,更加趋近于 0 。通常会降低模型在训练集的性能,但会提高模型的泛化能力。当 alpha=0 ,岭回归便与普通的线性回归一致。
输出
pkl 格式的模型,保存在后台生成的路径下。

Demo

输入数据示例

示例中第一行为列名,最后一列为标签。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
0.09378,12.5,7.87,0,0.524,5.889,39,5.4509,5,311,15.2,390.5,15.71,21.7
0.62976,0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21,396.9,8.26,20.4
0.63796,0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21,380.02,10.26,18.2
0.62739,0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21,395.62,8.47,19.9
1.05393,0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21,386.85,6.58,23.1
0.7842,0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21,386.75,14.67,17.5
0.80271,0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21,288.99,11.69,20.2

参数配置(训练节点)

算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*特征列:0-12
*标签列:13
算法参数
*alpha :1.0

参数配置(预测节点)

训练算子训练成功后,将模型路径接口与预测算子模型路径接口相连进行使用即可。具体配置参考“算法 IO 参数”。

将测试数据拖拽到预测算子输入数据接口,运行预测算子对测试数据进行预测。
算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*输出文件类型:csv
*输出数据是否包含 header 信息:是
*输出数据分隔符:逗号
*特征列:0-12

预测节点输出数据示例

输出为一个 csv 文件,其中第一行为列名,最后一列 y_pred 为模型的预测值。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL,y_pred
0.00632,18.0,2.31,0,0.5379999999999999,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,24.239967863975338
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,25.91000552246885
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,30.698638520104033
0.032369999999999996,0.0,2.18,0,0.45799999999999996,6.997999999999999,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,33.905907404118786
0.06905,0.0,2.18,0,0.45799999999999996,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,33.46900730882457
0.02985,0.0,2.18,0,0.45799999999999996,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,30.640032618343703
0.08829,12.5,7.87,0,0.524,6.0120000000000005,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,22.485069589701993
0.14455,12.5,7.87,0,0.524,6.172000000000001,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,19.586108814182868
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,15.335991891293457
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,19.414747661875182
0.22489,12.5,7.87,0,0.524,6.377000000000001,94.3,6.3467,5,311,15.2,392.52,20.45,15.0,19.834233253605724
0.11747,12.5,7.87,0,0.524,6.0089999999999995,82.9,6.2267,5,311,15.2,396.9,13.27,18.9,20.659603753088412
0.09378,12.5,7.87,0,0.524,5.888999999999999,39.0,5.4509,5,311,15.2,390.5,15.71,21.7,23.33091092087145
0.62976,0.0,8.14,0,0.5379999999999999,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,19.94772377520159
0.6379600000000001,0.0,8.14,0,0.5379999999999999,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2,18.861127486767927
0.62739,0.0,8.14,0,0.5379999999999999,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9,20.033473608423982
1.05393,0.0,8.14,0,0.5379999999999999,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,22.51351600617096
0.7842,0.0,8.14,0,0.5379999999999999,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5,18.073751219167278
0.80271,0.0,8.14,0,0.5379999999999999,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2,19.960182781813913

梯度提升树回归(GBTRegressor)

原理

算法说明

GBTRegressor(梯度提升树)是一种常用的分类回归算法,通过合并多个决策树来构建一个更为强大的模型。虽然名字中含有“回归”,但这个模型既可以用于回归也可以用于分类。与随机森林方法不同,梯度提升采用连续的方式构造树,每颗树都试图纠正前一棵树的错误。梯度提升树通常使用深度很小(1 到 5 之间)的树,这样模型占用的内存更少,预测速度也更快。

梯度提升背后的主要思想是合并许多简单的模型(在这个语境中叫做弱学习器),例如深度较小的树。每颗树只能对部分数据做出很好的预测,因此,添加的树越来越多,可以不断迭代提高性能。

该算子的实现根据论文 J.H. Friedman. "Stochastic Gradient Boosting." 1999.

参数配置

算法 IO 参数
*输入文件类型:格式包括以下两种:
csv :csv 文件
*输入数据包含 header 信息:默认为“是。
*输入数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*输出文件类型:格式包括以下两种:
csv :csv 文件
*输出数据包含 header 信息:默认为“是”。
*输出数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*特征列:作为训练特征的列,从0 开始编号。对于库表可以直接勾选,对于普通路径,可填形式如 a-b、c 或者它们的混合,用英文逗号分割(例如 0-10,15,17-19 表示第 0 到 10 列、15 、17 到 19 列总共 15 列作为特征)。
*标签列:作为标签的列,要求特征列的特征标签为 double 类型。
算法参数
*模型保存格式:ML 格式或者 PMML 格式的模型,保存在后台生成的路径下。
*checkpointInterval :每多少轮设置 checkpoint 一次,在迭代轮数非常多的时候,可以降低因为计算节点失败导致的级联重算风险。
*featureSubsetStrategy :特征采样比例策略,支持 auto、all、onethird、sqrt 和 log2 ,分别表示自动、全部、三分一、特征数的开方和特征数的对数。其中自动策略为:maxIter 为1时,该参数 all ;maxIter 大于 1 时,该参数为 sqrt 。
*lossType :损失函数类型,支持平方损失(L2)和绝对值损失(L1)。
*maxBins :计算分裂点时对,对连续特征的最大分桶数。最小为 2 。
*maxDepth :决策树最大深度。离散整正整数参数,比较合理的方式是根据默认值5在周围调节。
*maxIter :最大迭代次数。
*minInfoGain :决策树分裂最小信息增益。
*minInstancesPerNode :决策树每个节点的样本下限,可以控制节点是否继续分裂。离散正整数参数,默认值为 1 ,可根据样本数目适当调节。
*stepSize :步长,范围为 (0, 1] 。用于控制每棵树纠正前一棵树的错误的强度。
*subsamplingRate :样本数采样比例。

Demo

输入数据示例

示例中第一行为列名,最后一列为标签。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
0.09378,12.5,7.87,0,0.524,5.889,39,5.4509,5,311,15.2,390.5,15.71,21.7
0.62976,0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21,396.9,8.26,20.4
0.63796,0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21,380.02,10.26,18.2
0.62739,0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21,395.62,8.47,19.9
1.05393,0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21,386.85,6.58,23.1
0.7842,0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21,386.75,14.67,17.5
0.80271,0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21,288.99,11.69,20.2

参数配置

算法 IO 参数
*输入文件类型:csv
*输入数据包含 header 信息:是
*输入数据分隔符:逗号
*特征列:0-12
*标签列:13
算法参数
*模型保存格式:PMML
*checkpointInterval :-1
*featureSubsetStrategy :auto
*losstype :L2
*maxBins :32
*maxDepth :5
*maxIter :20
*minInfoGain :0.0
*minInstancesPerNode :1
*stepSize :0.1
*subsamplingRate :1

预测节点参数配置

训练算子训练成功后,将模型路径接口与预测算子模型路径接口相连进行使用即可。具体配置参考“算法 IO 参数”。

将测试数据拖拽到预测算子输入数据接口,运行预测算子对测试数据进行预测。
算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*输出文件类型:csv
*输出数据是否包含 header 信息:是
*输出数据分隔符:逗号
*特征列:0-12
模型参数
*algorithm:GBTRegressor
*模型导入格式:PMML

输出数据示例

选择 PMML 格式的模型保存格式,其输出为一个 csv 文件,其中第一行为列名,最后一列 prediction 为模型的预测值。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL,prediction
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,23.979103297729004
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,21.61369094286721
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,34.7
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,33.40013510798882
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,36.20013510798882
0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,28.700135107988817
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,22.891274319702433
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,27.10013510798883
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,16.49988178050978
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,18.90013510798882
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0,15.00013510798882
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9,18.900135107988817
0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7,21.712249790986448
0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,20.397485106161618
0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2,18.19324464453941
0.62739,0.0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9,19.90411602389743
1.05393,0.0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,23.10071321988321
0.7842,0.0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5,17.50597771089627
0.80271,0.0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2,20.201317406905435

贝叶斯岭回归

原理

算法说明

贝叶斯回归是对极大似然估计法容易造成的过拟合问题进行了优化的线性回归算法,贝叶斯回归的过程是一个样本点逐步增加到学习器的过程,前一个样本点的后验会被下一次估计当作先验。具体原理可参考文档。这里实现的是贝叶斯岭回归算法。

参数配置

算法 IO 参数
*输入文件类型:格式包括以下两种:
csv :csv 文件
*输入数据包含 header 信息:默认为“是”。
*输入数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*输出文件类型:格式包括以下两种:
csv :csv 文件
*输出数据包含 header 信息:默认为“是”。
*输出数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*特征列:作为训练特征的列,从 0 开始编号。对于库表可以直接勾选,对于普通路径,可填形式如 a-b、c 或者它们的混合,用英文逗号分割(例如 0-10,15,17-19 表示第 0 到 10 列、15 、17 到 19 列总共 15 列作为特征)。
*标签列:作为标签的列,可用列号或者列名。
算法参数
*alpha_1 :默认值为 0.000001,小数,先验概率分布正则项系数 1 。
*alpha_2 :默认值为 0.000001,小数,先验概率分布正则项系数 2 。
*lambda_1 :默认值为 0.000001 ,小数,后验概率分布正则项系数 1 。
*lambda_2 :默认值为 0.000001 ,小数,后验概率分布正则项系数 2 。

Demo

输入数据示例

示例中第一行为列名,最后一列为标签。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
0.09378,12.5,7.87,0,0.524,5.889,39,5.4509,5,311,15.2,390.5,15.71,21.7
0.62976,0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21,396.9,8.26,20.4
0.63796,0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21,380.02,10.26,18.2
0.62739,0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21,395.62,8.47,19.9
1.05393,0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21,386.85,6.58,23.1
0.7842,0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21,386.75,14.67,17.5
0.80271,0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21,288.99,11.69,20.2

参数配置(训练节点)

算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*特征列:0-12
*标签列:13
算法参数
*alpha_1 :0.000001
*alpha_2 :0.000001
*lambda_1 :0.000001
*lambda_2 :0.000001

参数配置(预测节点)

训练算子训练成功后,将模型路径接口与预测算子模型路径接口相连进行使用即可。具体配置参考“算法 IO 参数”。

将测试数据拖拽到预测算子输入数据接口,运行预测算子对测试数据进行预测。
算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*输出文件类型:csv
*输出数据是否包含 header 信息:是
*特征列:0-12

预测节点输出数据示例

输出为一个 csv 文件,其中第一行为列名,最后一列 y_pred 为模型的预测值。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL,y_pred
0.00632,18.0,2.31,0,0.5379999999999999,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,23.157761841265643
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,27.76086105312428
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,28.931050421317426
0.032369999999999996,0.0,2.18,0,0.45799999999999996,6.997999999999999,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,32.30627346481941
0.06905,0.0,2.18,0,0.45799999999999996,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,31.7666227004525
0.02985,0.0,2.18,0,0.45799999999999996,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,31.409801694787095
0.08829,12.5,7.87,0,0.524,6.0120000000000005,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,20.886114671849498
0.14455,12.5,7.87,0,0.524,6.172000000000001,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,18.86492204327373
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,18.031263358925848
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,19.30273606529038
0.22489,12.5,7.87,0,0.524,6.377000000000001,94.3,6.3467,5,311,15.2,392.52,20.45,15.0,18.827921516941007
0.11747,12.5,7.87,0,0.524,6.0089999999999995,82.9,6.2267,5,311,15.2,396.9,13.27,18.9,19.86971829646153
0.09378,12.5,7.87,0,0.524,5.888999999999999,39.0,5.4509,5,311,15.2,390.5,15.71,21.7,22.395483063351453
0.62976,0.0,8.14,0,0.5379999999999999,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,21.373515011156837
0.6379600000000001,0.0,8.14,0,0.5379999999999999,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2,19.44257566101224
0.62739,0.0,8.14,0,0.5379999999999999,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9,21.66675550393098
1.05393,0.0,8.14,0,0.5379999999999999,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,23.207481931645017
0.7842,0.0,8.14,0,0.5379999999999999,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5,19.66398384672165
0.80271,0.0,8.14,0,0.5379999999999999,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2,20.03515785367339

随机森林回归(RandomForestRegressor)

原理

算法说明

RandomForestRegressor(随机森林回归算法),支持离散和连续特征。

参数配置

算法 IO 参数
*输入文件类型:格式包括以下两种:
csv :csv 文件
*输入数据包含 header 信息:默认为“是”。
*输入数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*输出文件类型:格式包括以下两种:
csv :csv 文件
*输出数据包含 header 信息:默认为“是”。
*输出数据分隔符:数据分隔符,默认为逗号,可下拉选择其他分隔符。
parquet :列式存储格式 parquet
*特征列:作为训练特征的列,从 0 开始编号。对于库表可以直接勾选,对于普通路径,可填形式如 a-b、c 或者它们的混合,用英文逗号分割(例如 0-10,15,17-19 表示第 0 到 10 列、15 、17 到 19 列总共 15 列作为特征)。
*标签列:作为标签的列,要求特征列的特征标签为 double 类型。
算法参数
*模型保存格式:ML 格式或者 PMML 格式的模型,保存在后台生成的路径下。
*checkpointInterval :每多少轮设置 checkpoint 一次,在迭代轮数非常多的时候,可以降低因为计算节点失败导致的级联重算风险。
*featureSubsetStrategy :特征采样比例策略,支持 auto、all、onethird、sqrt和log2 ,分别表示自动、全部、三分一、特征数的开方和特征数的对数。其中自动策略为:numTrees 为 1 时,该参数为 all ;numTrees 大于 1 时,该参数为 sqrt 。
*maxBins :计算分裂点时对,对连续特征的最大分桶数。最小为 2 。
*maxDepth :决策树最大深度。离散整正整数参数,比较合理的方式是根据默认值 5 在周围调节。
*minInfoGain :决策树分裂最小信息增益。
*minInstancesPerNode :决策树每个节点的样本下限,可以控制节点是否继续分裂。离散正整数参数,默认值为 1 ,可根据样本数目适当调节。
*subsamplingRate :样本数采样比例。
*numTrees :树的个数。

Demo

输入数据示例

示例中第一行为列名,最后一列为标签。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
0.09378,12.5,7.87,0,0.524,5.889,39,5.4509,5,311,15.2,390.5,15.71,21.7
0.62976,0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21,396.9,8.26,20.4
0.63796,0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21,380.02,10.26,18.2
0.62739,0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21,395.62,8.47,19.9
1.05393,0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21,386.85,6.58,23.1
0.7842,0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21,386.75,14.67,17.5
0.80271,0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21,288.99,11.69,20.2

参数配置

算法 IO 参数
*输入文件类型:csv
*输入数据包含 header 信息:是
*输入数据分隔符:逗号
*特征列:0-12
*标签列:13
算法参数
*模型保存格式:PMML
*checkpointInterval :-1
*featureSubsetStrategy :auto
*maxBins :32
*maxDepth :5
*minInfoGain :0.0
*minInstancesPerNode :1
*subsamplingRate :1.0
*numTrees :20

预测节点参数配置

训练算子训练成功后,将模型路径接口与预测算子模型路径接口相连进行使用即可。具体配置参考“算法 IO 参数”。

将测试数据拖拽到预测算子输入数据接口,运行预测算子对测试数据进行预测。
算法 IO 参数
*输入文件类型:csv
*输入数据是否包含 header 信息:是
*输入数据分隔符:逗号
*输出文件类型:csv
*输出数据是否包含 header 信息:是
*输出数据分隔符:逗号
*特征列:0-12
模型参数
*algorithm:RandomForestRegressor
*模型导入格式:PMML

输出数据示例

选择 PMML 格式的模型保存格式,其输出为一个 csv 文件,其中第一行为列名,最后一列 prediction 为模型的预测值。
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,LABEL,prediction
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,25.65075
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,22.211833333333335
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,32.130250000000004
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,33.26499999999999
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,33.1
0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,29.94785714285714
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,22.33244047619047
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,23.972857142857144
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,18.07785714285714
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,19.020857142857135
0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0,19.022857142857145
0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9,19.94869047619047
0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7,21.594083333333327
0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,20.628940476190472
0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2,18.628357142857137
0.62739,0.0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9,20.89958333333333
1.05393,0.0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,21.84275
0.7842,0.0,8.14,0,0.538,5.99,81.7,4.2579,4,307,21.0,386.75,14.67,17.5,18.45585714285714
0.80271,0.0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2,19.974999999999998