blocks|key|105627|text|首先，我会把你的RDD变成一个DataSet：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|105628|val+spark:+org.apache.spark.sql.SparkSession+=+???
import+spark.implicits._

val+testDs+=+test.toDS()|code-block|syntax|javascript|105629|在这里你可以得到你的列的名字:)明智地使用它！|105630|testDs.schema.fields.foreach(x+=>+println(x))|105631|最后，您只需要使用groupBy：|105632|testDs.groupBy("City?",+"Name?")|105633|我认为RDD-s并不是2.0版本的真正方式。如果你有任何问题，请尽管问。|105634|entityMap^0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|W|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|X|8|@]|9|@]|A|$E|F]]|$1|O|3|P|5|6|7|Y|8|@]|9|@]|A|$]]|$1|Q|3|-4|5|6|7|Z|8|@]|9|@]|A|$]]]|R|$]]

First i would turn your RDD into a DataSet:

<pre><code>val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val testDs = test.toDS()
</code></pre>

<h1>Here you get your col names :) Use it wise !</h1>

<pre><code>testDs.schema.fields.foreach(x =&gt; println(x))
</code></pre>

In the end you only need to use a groupBy:

<pre><code>testDs.groupBy("City?", "Name?")
</code></pre>

RDD-s are not really the 2.0 version way I think.
If you have any question please just ask.

blocks|key|2853264|text|我建议您从创建一个case+class开始|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2853265|case+class+Monkey(city:+String,+firstName:+String)|code-block|syntax|javascript|2853266|此case+class应在主类外部定义。然后，您可以只使用toDS函数并使用groupBy和名为collect_list的aggregation函数，如下所示|2853267|import+sqlContext.implicits._
import+org.apache.spark.sql.functions._

val+test+=+Seq(("New+York",+"Jack"),
++("Los+Angeles",+"Tom"),
++("Chicago",+"David"),
++("Houston",+"John"),
++("Detroit",+"Michael"),
++("Chicago",+"Andrew"),
++("Detroit",+"Peter"),
++("Detroit",+"George")
)
sc.parallelize(test)
++.map(row+=>+Monkey(row._1,+row._2))
++.toDS()
++.groupBy("city")
++.agg(collect_list("firstName")+as+"list")
++.show(false)|2853268|您将得到如下输出|2853269|%2B-----------%2B------------------------%2B
%7Ccity+++++++%7Clist++++++++++++++++++++%7C
%2B-----------%2B------------------------%2B
%7CLos+Angeles%7C[Tom]+++++++++++++++++++%7C
%7CDetroit++++%7C[Michael,+Peter,+George]%7C
%7CChicago++++%7C[David,+Andrew]+++++++++%7C
%7CHouston++++%7C[John]++++++++++++++++++%7C
%7CNew+York+++%7C[Jack]++++++++++++++++++%7C
%2B-----------%2B------------------------%2B|2853270|只需调用.rdd函数，就可以转换回RDD|2853271|entityMap^0|9|A|0|0|1|A|T|4|12|7|1C|C|1P|B|0|0|0|0|4|4|H|3|0^^$0|@$1|2|3|4|5|6|7|W|8|@$9|X|A|Y|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|Z|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|10|8|@$9|11|A|12|B|C]|$9|13|A|14|B|C]|$9|15|A|16|B|C]|$9|17|A|18|B|C]|$9|19|A|1A|B|C]]|D|@]|E|$]]|$1|M|3|N|5|H|7|1B|8|@]|D|@]|E|$I|J]]|$1|O|3|P|5|6|7|1C|8|@]|D|@]|E|$]]|$1|Q|3|R|5|H|7|1D|8|@]|D|@]|E|$I|J]]|$1|S|3|T|5|6|7|1E|8|@$9|1F|A|1G|B|C]|$9|1H|A|1I|B|C]]|D|@]|E|$]]|$1|U|3|-4|5|6|7|1J|8|@]|D|@]|E|$]]]|V|$]]

I would suggest you to start with creating a <code>case class</code> as

<pre><code>case class Monkey(city: String, firstName: String)
</code></pre>

This <code>case class</code> should be defined outside the main class. Then you can just use <code>toDS</code> function and use <code>groupBy</code> and <code>aggregation</code> function called <code>collect_list</code> as below

<pre><code>import sqlContext.implicits._
import org.apache.spark.sql.functions._

val test = Seq(("New York", "Jack"),
 ("Los Angeles", "Tom"),
 ("Chicago", "David"),
 ("Houston", "John"),
 ("Detroit", "Michael"),
 ("Chicago", "Andrew"),
 ("Detroit", "Peter"),
 ("Detroit", "George")
)
sc.parallelize(test)
 .map(row =&gt; Monkey(row._1, row._2))
 .toDS()
 .groupBy("city")
 .agg(collect_list("firstName") as "list")
 .show(false)
</code></pre>

You will have output as 

<pre><code>+-----------+------------------------+
|city |list |
+-----------+------------------------+
|Los Angeles|[Tom] |
|Detroit |[Michael, Peter, George]|
|Chicago |[David, Andrew] |
|Houston |[John] |
|New York |[Jack] |
+-----------+------------------------+
</code></pre>

You can always convert back to <code>RDD</code> by just calling <code>.rdd</code> function

blocks|key|2641581|text|要创建数据集，首先将类外部的case类定义为|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2641582|case+class+Employee(city:+String,+name:+String)|code-block|syntax|javascript|2641583|然后，您可以将列表转换为Dataset，如下所示|2641584|++val+spark+=
++++SparkSession.builder().master("local").appName("test").getOrCreate()
++++import+spark.implicits._
++++val+test+=+Seq(("New+York",+"Jack"),
++++("Los+Angeles",+"Tom"),
++++("Chicago",+"David"),
++++("Houston",+"John"),
++++("Detroit",+"Michael"),
++++("Chicago",+"Andrew"),
++++("Detroit",+"Peter"),
++++("Detroit",+"George")
++++).toDF("city",+"name")
++++val+data+=+test.as[Employee]|2641585|或|2641586|++++import+spark.implicits._
++++val+test+=+Seq(("New+York",+"Jack"),
++++++("Los+Angeles",+"Tom"),
++++++("Chicago",+"David"),
++++++("Houston",+"John"),
++++++("Detroit",+"Michael"),
++++++("Chicago",+"Andrew"),
++++++("Detroit",+"Peter"),
++++++("Detroit",+"George")
++++)

++++val+data+=+test.map(r+=>+Employee(r._1,+r._2)).toDS()|2641587|现在，您可以groupby并执行任何聚合为|offset|length|style|CODE|2641588|data.groupBy("city").count().show

data.groupBy("city").agg(collect_list("name")).show|2641589|希望这能有所帮助！|2641590|entityMap^0|0|0|0|0|0|0|6|7|0|0|0^^$0|@$1|2|3|4|5|6|7|10|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|11|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|12|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|13|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|14|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|15|8|@]|9|@]|A|$E|F]]|$1|O|3|P|5|6|7|16|8|@$Q|17|R|18|S|T]]|9|@]|A|$]]|$1|U|3|V|5|D|7|19|8|@]|9|@]|A|$E|F]]|$1|W|3|X|5|6|7|1A|8|@]|9|@]|A|$]]|$1|Y|3|-4|5|6|7|1B|8|@]|9|@]|A|$]]]|Z|$]]

To create a data set first define a case class outside your class as 

<pre><code>case class Employee(city: String, name: String)
</code></pre>

Then you can convert the list to Dataset as 

<pre><code> val spark =
 SparkSession.builder().master("local").appName("test").getOrCreate()
 import spark.implicits._
 val test = Seq(("New York", "Jack"),
 ("Los Angeles", "Tom"),
 ("Chicago", "David"),
 ("Houston", "John"),
 ("Detroit", "Michael"),
 ("Chicago", "Andrew"),
 ("Detroit", "Peter"),
 ("Detroit", "George")
 ).toDF("city", "name")
 val data = test.as[Employee]
</code></pre>

Or 

<pre><code> import spark.implicits._
 val test = Seq(("New York", "Jack"),
 ("Los Angeles", "Tom"),
 ("Chicago", "David"),
 ("Houston", "John"),
 ("Detroit", "Michael"),
 ("Chicago", "Andrew"),
 ("Detroit", "Peter"),
 ("Detroit", "George")
 )

 val data = test.map(r =&gt; Employee(r._1, r._2)).toDS()
</code></pre>

Now you can <code>groupby</code> and perform any aggregation as

<pre><code>data.groupBy("city").count().show

data.groupBy("city").agg(collect_list("name")).show
</code></pre>

Hope this helps!

I have a request to use rdd to do so：
<pre><code>val test = Seq((&quot;New York&quot;, &quot;Jack&quot;),
 (&quot;Los Angeles&quot;, &quot;Tom&quot;),
 (&quot;Chicago&quot;, &quot;David&quot;),
 (&quot;Houston&quot;, &quot;John&quot;),
 (&quot;Detroit&quot;, &quot;Michael&quot;),
 (&quot;Chicago&quot;, &quot;Andrew&quot;),
 (&quot;Detroit&quot;, &quot;Peter&quot;),
 (&quot;Detroit&quot;, &quot;George&quot;)
 )
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)
</code></pre>
The result is that：
<blockquote>
(New York,List(Jack))
(Detroit,List(Michael, Peter, George))
(Los Angeles,List(Tom))
(Houston,List(John))
(Chicago,List(David, Andrew))
</blockquote>
How to do it use dataset with spark2.0?
I have a way to use a custom function, but the feeling is so complicated, there is no simple point method？

How to use dataset to groupby

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我有一个使用rdd这样做的请求：val test = Seq(("New York", "Jack"),    ("Los Angeles", "Tom"),    ("Chicago", "David"),    ("Houston", "John"),    ("Detroit", "Michael"),    (...

问如何使用dataset进行分组
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用dataset进行分组EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用dataset进行分组
EN