def balanceDataset(dataset: DataFrame, label: String = "label"): DataFrame = {
  // Re-balancing (weighting) of records to be used in the logistic loss objective function
  val (datasetSize, positives) = dataset.select(count("*"), sum(dataset(label))).as[(Long, Double)].collect.head
  val balancingRatio = positives / datasetSize

  val weightedDataset = {
    dataset.withColumn("classWeightCol", when(dataset(label) === 0.0, balancingRatio).otherwise(1.0 - balancingRatio))
  }
  weightedDataset
}

我们创建分类器，就像他说的那样：

new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/33372838

复制

相似问题

问Spark MLlib中不平衡数据集的处理
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Spark MLlib中不平衡数据集的处理EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Spark MLlib中不平衡数据集的处理
EN