Rescale each feature individually to a common range min, max linearly using column summary
statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
feature E is calculated as:
For the case (E_{max} == E_{min}), (Rescaled(e_i) = 0.5 * (max + min)).
note :
Since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
核心代码:主要就是计算 最大最小值
override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
transformSchema(dataset.schema, logging = true)
val Row(max: Vector, min: Vector) = dataset
.select(Summarizer.metrics("max", "min").summary(col($(inputCol))).as("summary"))
.select("summary.max", "summary.min")
.first()
copyValues(new MinMaxScalerModel(uid, min.compressed, max.compressed).setParent(this))
}
注意: 上面的计算方式, 我们发现只能支持Vector的形式,那么对于但一值的情况如何转换呢?
val temp_mean = df_num.select(functions.mean(df_num.col("features"))).collect()(0)
println(temp_mean.getDouble(0))
val Row(mean2: Vector) =Row(Vectors.dense(temp_mean.getDouble(0)))
val df_num = spark.createDataFrame(Seq(
(0, 0.5, -1.0),
(1, 1.0, 1.0),
(2, 10.0, 2.0),
(3, 10.0, 0.0)
)).toDF("id", "features","result")
df.show()
系列文章:
spark 中的 特征相关内容处理的文档
概念简介
参考: