我需要显示基于评论总数的前5个州和城市(在下面提到的原始方案中的评论计数)。下面给出了我的DF (来自Json文件)的描述。
+-------------+--------------------+-------+
| col_name| data_type|comment|
+-------------+--------------------+-------+
| attributes|struct<Accepts Cr...| null|
| city| string| null|
|neighborhoods| array<string>| null|
| open| boolean| null|
| review_count| bigint| null|
| stars| double| null|
| state| string| null|
| type| string| null|
+-------------+--------------------+-------+
我尝试过类似order by方法,但不起作用。最后了解了窗口函数here
在我编写的代码中,review count的值与Json文件中的值不完全相同。
我尝试的代码是:
val topcity=spark.sql("select city,state,review_count,RANK() OVER (ORDER BY review_count desc ) AS RANKING from yelp").show(5)
下面是我得到的输出:
+-------------+-----+------------+-------+
| city|state|review_count|RANKING|
+-------------+-----+------------+-------+
| Pittsburgh| PA| 3| 1|
| Carnegie| PA| 3| 2|
| Carnegie| PA| 3| 3|
| Carnegie| PA| 3| 4|
| Pittsburgh| PA| 3| 5|
+-------------+--------------------+-----+
因此,我的评论计数仅为常量值3。因此,我的问题是:
发布于 2019-03-26 03:12:50
下一个是实现,假设你正在寻找如何获得每个州-市组合的总评论(希望我没有弄错):
首先,我们使用以下命令生成一些虚拟数据:
cities_data = [
["Alameda", "California", 1],
["Alameda", "California", 3],
["Berkeley", "California", 2],
["Beverly Hills", "California", 2],
["Beverly Hills", "California", 3],
["Hollywood", "California", 4],
["Miami", "Florida", 3],
["Miami", "Florida", 2],
["Orlando", "Florida", 1],
["Cocoa Beach", "Florida", 1]]
cols = ["city", "state", "review_count"]
df = spark.createDataFrame(cities_data, cols)
df.show(10, False)
这将打印以下内容:
+-------------+----------+------------+
|city |state |review_count|
+-------------+----------+------------+
|Alameda |California|1 |
|Alameda |California|3 |
|Berkeley |California|2 |
|Beverly Hills|California|2 |
|Beverly Hills|California|3 |
|Hollywood |California|4 |
|Miami |Florida |3 |
|Miami |Florida |2 |
|Orlando |Florida |1 |
|Cocoa Beach |Florida |1 |
+-------------+----------+------------+
数据按州/城市分组,以获得total_reviews的总和。这是在pyspark中实现的,但是应该很容易将其更改为scala:
df = df.groupBy("state", "city") \
.agg(F.sum("review_count").alias("reviews_count")) \
.orderBy(F.desc("reviews_count")) \
.limit(5)
这应该是上面场景的输出:
+----------+-------------+-------------+
|state |city |reviews_count|
+----------+-------------+-------------+
|California|Beverly Hills|5 |
|Florida |Miami |5 |
|California|Alameda |4 |
|California|Hollywood |4 |
|California|Berkeley |2 |
+----------+-------------+-------------+
https://stackoverflow.com/questions/55338143
复制相似问题