pyspark-ml学习笔记：逻辑回归、GBDT、xgboost参数介绍

MachineLP

发布于 2019-08-29 11:40:33

3.3K0

发布于 2019-08-29 11:40:33

文章被收录于专栏：小鹏的专栏

逻辑回归、GBDT可以参考pyspark开发文档：http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression 。

xgboost查看：https://xgboost.ai 。

下面只列出分类是的参数介绍：（对于回归时的自行查看）

逻辑回归：

featuresCol = 'features'
labelCol = 'label'
predictionCol = 'prediction'
# max number of iterations (>= 0).   最大迭代次数(>= 0)
self.maxIter = 100
# regularization parameter (>= 0).    正则化参数(>= 0)
regParam = 0.0
# range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.')   选择L1或者L2
elasticNetParam = 0.0
# the convergence tolerance for iterative algorithms (>= 0).    收敛性 (>= 0)
tol = 1e-06
# whether to fit an intercept term
fitIntercept = True
# Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].
threshold = 0.5
# Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.
thresholds = None
probabilityCol = 'probability'
rawPredictionCol = 'rawPrediction'
# whether to standardize the training features before fitting the model
standardization = True
# weight column name. If this is not set or empty, we treat all instance weights as 1.0.
weightCol = None
# suggested depth for treeAggregate (>= 2).
aggregationDepth = 2
# The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial
family = 'auto'
lowerBoundsOnCoefficients = None
# The upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression.
upperBoundsOnCoefficients = None
lowerBoundsOnIntercepts = None
# The upper bounds on intercepts if fitting under bound constrained optimization. The bound vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression.
upperBoundsOnIntercepts = None

GBDT：

featuresCol = 'features'
labelCol = 'label'
predictionCol = 'prediction'
# Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.  树的最大深度
maxDepth = 5
# Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.
maxBins = 32
# Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.
minInstancesPerNode = 1
# Minimum information gain for a split to be considered at a tree node.
minInfoGain = 0.0
# Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size.
maxMemoryInMB = 256
# If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.
cacheNodeIds = False
# set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'
checkpointInterval = 10
# Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic
lossType = 'logistic'
# set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'
checkpointInterval=10
# max number of iterations (>= 0).
maxIter=20
# Step size to be used for each iteration of optimization (>= 0).
stepSize=0.1
# random seed.
seed=None
# Fraction of the training data used for learning each decision tree, in range (0, 1].
subsamplingRate=1.0
# The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto
featureSubsetStrategy='all'

xgboost：

featuresCol = "features"
labelCol = "label"
predictionCol = "prediction"
weightCol = "weight"
checkpointInterval = -1 
missing = None
# XGBoost运行时的线程数。缺省值是当前系统可以获得的最大线程数
nthread = 1
nworkers = 1
# 取0时表示打印出运行时信息，取1时表示以缄默方式运行，不打印运行时信息。缺省值为0。
silent = 0
use_external_memory = False
base_score = 0.5
# 有两中模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算。缺省值为gbtree。
booster = "gbtree"
'''
校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标（rmse for regression, and error for classification, mean average precision for ranking）
用户可以添加多种评价指标，对于Python用户要以list传递参数对给程序，而不是map参数list参数不会覆盖’eval_metric’
The choices are listed below:
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
“ndcg-”,”map-”,”ndcg@n-”,”map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatively
“gamma-deviance”: [residual deviance for gamma regression]
'''
eval_metric = "error"
num_class = 2
num_round = 2
'''
“reg:linear” –线性回归。
“reg:logistic” –逻辑回归。
“binary:logistic”–二分类的逻辑回归问题，输出为概率。
“binary:logitraw”–二分类的逻辑回归问题，输出的结果为wTx。
“count:poisson”–计数问题的poisson回归，输出结果为poisson分布。在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)
“multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
“multi:softprob” –和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。
“rank:pairwise”–set XGBoost to do ranking task by minimizing the pairwise loss
'''
objective = "binary:logistic"
seed = None
alpha = 0.0
# 在建立树时对特征采样的比例。缺省值为1。 取值范围：(0,1]。
colsample_bytree = 1.0
colsample_bylevel = 1.0
# 为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3。取值范围为：[0,1]
eta = 0.3
gamma = 0.0
grow_policy = 'depthwise'
max_bin = 256
max_delta_step = 0.0
# 数的最大深度。缺省值为6。 取值范围为：[1,∞]
max_depth = 6
# 孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中，这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative。 取值范围为: [0,∞]。
min_child_weight = 1.0
reg_lambda = 0.0
scale_pos_weight = 1.0
sketch_eps = 0.03
# 用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的冲整个样本集合中随机的抽取出50%的子样本建立树模型，这能够防止过拟合。 取值范围为：(0,1]。
subsample = 1.0
tree_method = "auto"

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2019年08月13日，如有侵权请联系 cloudcommunity@tencent.com 删除

spark