我有一个700万行长的大型数据帧,我需要添加一个列来计算某个人(由和Integer标识)以前出现过多少次,例如:
| Reg | randomdata |
| 123 | yadayadayada |
| 246 | yedayedayeda |
| 123 | yadeyadeyade |
|369 | adayeadayead |
| 123 | yadyadyadyad |
转到->
| Reg | randomdata | count
| 123 | yadayadayada | 1
| 246 | yedayedayeda | 1
| 123 | yadeyadeyade | 2
| 369 | adayeadayead | 1
| 123 | yadyadyadyad | 3
我已经做了一个groupBy,以了解每个重复的次数,但我需要在机器学习练习中获得这个计数,以便根据之前发生的次数来获得重复的概率。
发布于 2018-09-05 02:10:00
下面我们假设随机性可能意味着相同的随机值发生,并使用带有tempview的spark sql,但也可以使用带有select的DF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window._
case class xyz(k: Int, v: String)
val ds = Seq(
xyz(1,"917799423934"),
xyz(2,"019331224595"),
xyz(3,"8981251522"),
xyz(3,"8981251522"),
xyz(4,"8981251522"),
xyz(1,"8981251522"),
xyz(1,"uuu4553")).toDS()
ds.createOrReplaceTempView("XYZ")
spark.sql("""select z.k, z.v, dense_rank() over (partition by z.k order by z.seq) as seq from (select k,v, row_number() over (order by k) as seq from XYZ) z""").show
正在返回:
+---+------------+---+
| k| v|seq|
+---+------------+---+
| 1|917799423934| 1|
| 1| 8981251522| 2|
| 1| uuu4553| 3|
| 2|019331224595| 1|
| 3| 8981251522| 1|
| 3| 8981251522| 2|
| 4| 8981251522| 1|
+---+------------+---+
发布于 2018-09-04 23:56:22
你可以这样做
def countrds = udf((rds: Seq[String]) => {rds.length})
val df2 = df1.groupBy(col("Reg")).agg(collect_list(col("randomdata")).alias("rds"))
.withColumn("count", countrds(col("rds")))
df2.select('Reg', 'randomdata', 'count').show()
https://stackoverflow.com/questions/52174847
复制