blocks|key|4460260|text|>df1.show()
%2B-----%2B--------------------%2B--------%2B----------%2B-----------%2B
%7Cfloor%7C+++++++++++timestamp%7C+++++uid%7C+++++++++x%7C++++++++++y%7C
%2B-----%2B--------------------%2B--------%2B----------%2B-----------%2B
%7C++++1%7C2014-07-19T16:00:...%7C600dfbe2%7C+103.79211%7C71.50419418%7C
%7C++++1%7C2014-07-19T16:00:...%7C5e7b40e1%7C+110.33613%7C100.6828393%7C
%7C++++1%7C2014-07-19T16:00:...%7C285d22e4%7C110.066315%7C86.48873585%7C
%7C++++1%7C2014-07-19T16:00:...%7C74d917a1%7C+103.78499%7C71.45633073%7C

>row1+=+df1.agg({"x":+"max"}).collect()[0]
>print+row1
Row(max(x)=110.33613)
>print+row1["max(x)"]
110.33613|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|4460261|答案与method3几乎相同。但似乎method3中的"asDict()“可以删除|unstyled|4460262|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|J|8|@]|9|@]|A|$]]|$1|G|3|-4|5|F|7|K|8|@]|9|@]|A|$]]]|H|$]]

<pre><code>&gt;df1.show()
+-----+--------------------+--------+----------+-----------+
|floor| timestamp| uid| x| y|
+-----+--------------------+--------+----------+-----------+
| 1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
| 1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
| 1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
| 1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|

&gt;row1 = df1.agg({"x": "max"}).collect()[0]
&gt;print row1
Row(max(x)=110.33613)
&gt;print row1["max(x)"]
110.33613
</code></pre>

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

blocks|key|4460271|text|如果有人想知道如何使用Scala+(使用Spark+2.0.%2B)来做到这一点，你可以去：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460272|scala>+df.createOrReplaceTempView("TEMP_DF")
scala>+val+myMax+=+spark.sql("SELECT+MAX(x)+as+maxval+FROM+TEMP_DF").
++++collect()(0).getInt(0)
scala>+print(myMax)
117|code-block|syntax|javascript|4460273|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

In case some wonders how to do it using Scala (using Spark 2.0.+), here you go:

<pre><code>scala&gt; df.createOrReplaceTempView("TEMP_DF")
scala&gt; val myMax = spark.sql("SELECT MAX(x) as maxval FROM TEMP_DF").
 collect()(0).getInt(0)
scala&gt; print(myMax)
117
</code></pre>

blocks|key|4460235|text|数据帧的特定列的最大值可以通过使用-|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460236|your_max_value+=+df.agg({"your-column":+"max"}).collect()[0][0]|offset|length|style|CODE|4460237|entityMap^0|0|0|1R|0^^$0|@$1|2|3|4|5|6|7|J|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|K|8|@$D|L|E|M|F|G]]|9|@]|A|$]]|$1|H|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|I|$]]

Max value for a particular column of a dataframe can be achieved by using -

<code>your_max_value = df.agg({"your-column": "max"}).collect()[0][0]
</code>

blocks|key|4460319|text|备注:+Spark旨在用于大数据-分布式计算。示例DataFrame的大小非常小，因此现实生活中示例的顺序可以相对于小示例进行更改。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460320|最慢:+Method_1，因为.describe("A")计算min、max、mean、stddev和count+(整个列有5次计算)。|offset|length|style|CODE|4460321|Medium:+Method_4，因为.rdd+(从DF到RDD的转换)会减慢这个过程。|4460322|更快:+Method_3+~+Method_2+~+Method_5，因为逻辑非常相似，所以Spark的catalyst优化器遵循非常相似的逻辑，只需最少的操作(获取特定列的最大值，收集单值数据帧；.asDict()增加了一些额外的时间来比较2，3和5)|4460323|import+pandas+as+pd
import+time

time_dict+=+{}

dfff+=+self.spark.createDataFrame([(1.,+4.),+(2.,+5.),+(3.,+6.)],+["A",+"B"])
#--++For+bigger/realistic+dataframe+just+uncomment+the+following+3+lines
#lst+=+list(np.random.normal(0.0,+100.0,+100000))
#pdf+=+pd.DataFrame({'A':+lst,+'B':+lst,+'C':+lst,+'D':+lst})
#dfff+=+self.sqlContext.createDataFrame(pdf)

tic1+=+int(round(time.time()+*+1000))
#+Method+1:+Use+describe()
max_val+=+float(dfff.describe("A").filter("summary+=+'max'").select("A").collect()[0].asDict()['A'])
tac1+=+int(round(time.time()+*+1000))
time_dict['m1']=+tac1+-+tic1
print+(max_val)

tic2+=+int(round(time.time()+*+1000))
#+Method+2:+Use+SQL
dfff.registerTempTable("df_table")
max_val+=+self.sqlContext.sql("SELECT+MAX(A)+as+maxval+FROM+df_table").collect()[0].asDict()['maxval']
tac2+=+int(round(time.time()+*+1000))
time_dict['m2']=+tac2+-+tic2
print+(max_val)

tic3+=+int(round(time.time()+*+1000))
#+Method+3:+Use+groupby()
max_val+=+dfff.groupby().max('A').collect()[0].asDict()['max(A)']
tac3+=+int(round(time.time()+*+1000))
time_dict['m3']=+tac3+-+tic3
print+(max_val)

tic4+=+int(round(time.time()+*+1000))
#+Method+4:+Convert+to+RDD
max_val+=+dfff.select("A").rdd.max()[0]
tac4+=+int(round(time.time()+*+1000))
time_dict['m4']=+tac4+-+tic4
print+(max_val)

tic5+=+int(round(time.time()+*+1000))
#+Method+5:+Use+agg()
max_val+=+dfff.agg({"A":+"max"}).collect()[0][0]
tac5+=+int(round(time.time()+*+1000))
time_dict['m5']=+tac5+-+tic5
print+(max_val)

print+time_dict|code-block|syntax|javascript|4460324|集群边缘节点上的结果，单位为毫秒(ms)：|4460325|小DF+(ms)：{'m1':+7096,+'m2':+205,+'m3':+165,+'m4':+211,+'m5':+180}|4460326|更大的DF+(ms)：{'m1':+10260,+'m2':+452,+'m3':+465,+'m4':+916,+'m5':+373}|4460327|entityMap^0|0|F|E|0|J|4|0|2R|9|0|0|0|9|1K|0|B|1L|0^^$0|@$1|2|3|4|5|6|7|Y|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Z|8|@$D|10|E|11|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|12|8|@$D|13|E|14|F|G]]|9|@]|A|$]]|$1|J|3|K|5|6|7|15|8|@$D|16|E|17|F|G]]|9|@]|A|$]]|$1|L|3|M|5|N|7|18|8|@]|9|@]|A|$O|P]]|$1|Q|3|R|5|6|7|19|8|@]|9|@]|A|$]]|$1|S|3|T|5|6|7|1A|8|@$D|1B|E|1C|F|G]]|9|@]|A|$]]|$1|U|3|V|5|6|7|1D|8|@$D|1E|E|1F|F|G]]|9|@]|A|$]]|$1|W|3|-4|5|6|7|1G|8|@]|9|@]|A|$]]]|X|$]]

Remark: Spark is intended to work on Big Data - distributed computing. The size of the example DataFrame is very small, so the order of real-life examples can be altered with respect to the small example.
Slowest: Method_1, because <code>.describe(&quot;A&quot;)</code> calculates min, max, mean, stddev, and count (5 calculations over the whole column).
Medium: Method_4, because, <code>.rdd</code> (DF to RDD transformation) slows down the process.
Faster: Method_3 ~ Method_2 ~ Method_5, because the logic is very similar, so Spark's catalyst optimizer follows very similar logic with minimal number of operations (get max of a particular column, collect a single-value dataframe; <code>.asDict()</code> adds a little extra-time comparing 2, 3 vs. 5)
<pre><code>import pandas as pd
import time

time_dict = {}

dfff = self.spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], [&quot;A&quot;, &quot;B&quot;])
#-- For bigger/realistic dataframe just uncomment the following 3 lines
#lst = list(np.random.normal(0.0, 100.0, 100000))
#pdf = pd.DataFrame({'A': lst, 'B': lst, 'C': lst, 'D': lst})
#dfff = self.sqlContext.createDataFrame(pdf)

tic1 = int(round(time.time() * 1000))
# Method 1: Use describe()
max_val = float(dfff.describe(&quot;A&quot;).filter(&quot;summary = 'max'&quot;).select(&quot;A&quot;).collect()[0].asDict()['A'])
tac1 = int(round(time.time() * 1000))
time_dict['m1']= tac1 - tic1
print (max_val)

tic2 = int(round(time.time() * 1000))
# Method 2: Use SQL
dfff.registerTempTable(&quot;df_table&quot;)
max_val = self.sqlContext.sql(&quot;SELECT MAX(A) as maxval FROM df_table&quot;).collect()[0].asDict()['maxval']
tac2 = int(round(time.time() * 1000))
time_dict['m2']= tac2 - tic2
print (max_val)

tic3 = int(round(time.time() * 1000))
# Method 3: Use groupby()
max_val = dfff.groupby().max('A').collect()[0].asDict()['max(A)']
tac3 = int(round(time.time() * 1000))
time_dict['m3']= tac3 - tic3
print (max_val)

tic4 = int(round(time.time() * 1000))
# Method 4: Convert to RDD
max_val = dfff.select(&quot;A&quot;).rdd.max()[0]
tac4 = int(round(time.time() * 1000))
time_dict['m4']= tac4 - tic4
print (max_val)

tic5 = int(round(time.time() * 1000))
# Method 5: Use agg()
max_val = dfff.agg({&quot;A&quot;: &quot;max&quot;}).collect()[0][0]
tac5 = int(round(time.time() * 1000))
time_dict['m5']= tac5 - tic5
print (max_val)

print time_dict
</code></pre>
Result on an edge-node of a cluster in milliseconds (ms):
small DF (ms): <code>{'m1': 7096, 'm2': 205, 'm3': 165, 'm4': 211, 'm5': 180}</code>
bigger DF (ms): <code>{'m1': 10260, 'm2': 452, 'm3': 465, 'm4': 916, 'm5': 373}</code>

blocks|key|4460293|text|我相信最好的解决方案是使用head()|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|4460294|考虑到您的示例：|4460295|%2B---%2B---%2B
%7C++A%7C++B%7C
%2B---%2B---%2B
%7C1.0%7C4.0%7C
%7C2.0%7C5.0%7C
%7C3.0%7C6.0%7C
%2B---%2B---%2B|code-block|syntax|javascript|4460296|使用python的agg和max方法，我们可以得到如下值：|4460297|from+pyspark.sql.functions+import+max+df.agg(max(df.A)).head()[0]|4460298|这将返回：3.0|4460299|确保您拥有正确的导入：|4460300|我们在这里使用的max函数是pySPark的sql库函数，而不是from+pyspark.sql.functions+import+max的默认max函数。|4460301|entityMap^0|D|6|0|0|0|0|0|1T|0|5|3|0|0|W|11|0^^$0|@$1|2|3|4|5|6|7|Y|8|@$9|Z|A|10|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|11|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|12|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|13|8|@]|D|@]|E|$]]|$1|O|3|P|5|6|7|14|8|@$9|15|A|16|B|C]]|D|@]|E|$]]|$1|Q|3|R|5|6|7|17|8|@$9|18|A|19|B|C]]|D|@]|E|$]]|$1|S|3|T|5|6|7|1A|8|@]|D|@]|E|$]]|$1|U|3|V|5|6|7|1B|8|@$9|1C|A|1D|B|C]]|D|@]|E|$]]|$1|W|3|-4|5|6|7|1E|8|@]|D|@]|E|$]]]|X|$]]

I believe the best solution will be using <code>head()</code>

Considering your example:

<pre><code>+---+---+
| A| B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+
</code></pre>

Using agg and max method of python we can get the value as following : 
 <code>from pyspark.sql.functions import max
df.agg(max(df.A)).head()[0]</code> 

This will return:
<code>3.0</code>

Make sure you have the correct import: 
<code>from pyspark.sql.functions import max</code>
The max function we use here is the pySPark sql library function, not the default max function of python.

blocks|key|4460326|text|另一种方法是：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460327|df.select(f.max(f.col("A")).alias("MAX")).limit(1).collect()[0].MAX|code-block|syntax|javascript|4460328|在我的数据中，我得到了这样的基准：|4460329|df.select(f.max(f.col("A")).alias("MAX")).limit(1).collect()[0].MAX
CPU+times:+user+2.31+ms,+sys:+3.31+ms,+total:+5.62+ms
Wall+time:+3.7+s

df.select("A").rdd.max()[0]
CPU+times:+user+23.2+ms,+sys:+13.9+ms,+total:+37.1+ms
Wall+time:+10.3+s

df.agg({"A":+"max"}).collect()[0][0]
CPU+times:+user+0+ns,+sys:+4.77+ms,+total:+4.77+ms
Wall+time:+3.75+s|4460330|他们都给出了相同的答案|4460331|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|S|8|@]|9|@]|A|$]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

Another way of doing it:

<pre><code>df.select(f.max(f.col("A")).alias("MAX")).limit(1).collect()[0].MAX
</code></pre>

On my data, I got this benchmarks:

<pre><code>df.select(f.max(f.col("A")).alias("MAX")).limit(1).collect()[0].MAX
CPU times: user 2.31 ms, sys: 3.31 ms, total: 5.62 ms
Wall time: 3.7 s

df.select("A").rdd.max()[0]
CPU times: user 23.2 ms, sys: 13.9 ms, total: 37.1 ms
Wall time: 10.3 s

df.agg({"A": "max"}).collect()[0][0]
CPU times: user 0 ns, sys: 4.77 ms, total: 4.77 ms
Wall time: 3.75 s
</code></pre>

All of them give the same answer

blocks|key|4460389|text|这是一种懒惰的方式，只需执行计算统计：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460390|df.write.mode("overwrite").saveAsTable("sampleStats")
Query+=+"ANALYZE+TABLE+sampleStats+COMPUTE+STATISTICS+FOR+COLUMNS+"+%2B+','.join(df.columns)
spark.sql(Query)

df.describe('ColName')|code-block|syntax|javascript|4460391|或|4460392|spark.sql("Select+*+from+sampleStats").describe('ColName')|4460393|或者你可以打开一个蜂巢外壳|4460394|describe+formatted+table+sampleStats;|4460395|您将在属性中看到统计信息-+min、max、distinct、nulls等。|4460396|entityMap^0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|W|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|X|8|@]|9|@]|A|$E|F]]|$1|O|3|P|5|6|7|Y|8|@]|9|@]|A|$]]|$1|Q|3|-4|5|6|7|Z|8|@]|9|@]|A|$]]]|R|$]]

Here is a lazy way of doing this, by just doing compute Statistics:

<pre><code>df.write.mode("overwrite").saveAsTable("sampleStats")
Query = "ANALYZE TABLE sampleStats COMPUTE STATISTICS FOR COLUMNS " + ','.join(df.columns)
spark.sql(Query)

df.describe('ColName')
</code></pre>

or

<pre><code>spark.sql("Select * from sampleStats").describe('ColName')
</code></pre>

or you can open a hive shell and 

<pre><code>describe formatted table sampleStats;
</code></pre>

You will see the statistics in the properties - min, max, distinct, nulls, etc.

blocks|key|4460381|text|import+org.apache.spark.sql.SparkSession
import+org.apache.spark.sql.functions._

val+testDataFrame+=+Seq(
++(1.0,+4.0),+(2.0,+5.0),+(3.0,+6.0)
).toDF("A",+"B")

val+(maxA,+maxB)+=+testDataFrame.select(max("A"),+max("B"))
++.as[(Double,+Double)]
++.first()
println(maxA,+maxB)|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|4460382|结果是(+3.0,6.0+)，与testDataFrame.agg(max($"A"),+max($"B")).collect()(0).However相同，testDataFrame.agg(max($"A"),+max($"B")).collect()(0)返回一个列表，3.0，6.0|unstyled|offset|length|style|CODE|4460383|entityMap^0|0|G|1G|27|1G|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|N|8|@$G|O|H|P|I|J]|$G|Q|H|R|I|J]]|9|@]|A|$]]|$1|K|3|-4|5|F|7|S|8|@]|9|@]|A|$]]]|L|$]]

<pre><code>import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val testDataFrame = Seq(
 (1.0, 4.0), (2.0, 5.0), (3.0, 6.0)
).toDF("A", "B")

val (maxA, maxB) = testDataFrame.select(max("A"), max("B"))
 .as[(Double, Double)]
 .first()
println(maxA, maxB)
</code></pre>

And the result is (3.0,6.0), which is the same to the <code>testDataFrame.agg(max($"A"), max($"B")).collect()(0)</code>.However, <code>testDataFrame.agg(max($"A"), max($"B")).collect()(0)</code> returns a List, [3.0,6.0]

blocks|key|4460448|text|在pyspark中，你可以这样做：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460449|max(df.select('ColumnName').rdd.flatMap(lambda+x:+x).collect())|code-block|syntax|javascript|4460450|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

in pyspark you can do this:

<pre><code>max(df.select('ColumnName').rdd.flatMap(lambda x: x).collect())
</code></pre>

blocks|key|4460480|text|下面的示例展示了如何在Spark+dataframe列中获取最大值。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460481|from+pyspark.sql.functions+import+max

df+=+sql_context.createDataFrame([(1.,+4.),+(2.,+5.),+(3.,+6.)],+["A",+"B"])
df.show()
%2B---%2B---%2B
%7C++A%7C++B%7C
%2B---%2B---%2B
%7C1.0%7C4.0%7C
%7C2.0%7C5.0%7C
%7C3.0%7C6.0%7C
%2B---%2B---%2B

result+=+df.select([max("A")]).show()
result.show()
%2B------%2B
%7Cmax(A)%7C
%2B------%2B
%7C+++3.0%7C
%2B------%2B

print+result.collect()[0]['max(A)']
3.0|code-block|syntax|javascript|4460482|类似地，可以按如下所示计算最小值、平均值等：|4460483|from+pyspark.sql.functions+import+mean,+min,+max

result+=+df.select([mean("A"),+min("A"),+max("A")])
result.show()
%2B------%2B------%2B------%2B
%7Cavg(A)%7Cmin(A)%7Cmax(A)%7C
%2B------%2B------%2B------%2B
%7C+++2.0%7C+++1.0%7C+++3.0%7C
%2B------%2B------%2B------%2B|4460484|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

The below example shows how to get the max value in a Spark dataframe column.

<pre><code>from pyspark.sql.functions import max

df = sql_context.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+

result = df.select([max("A")]).show()
result.show()
+------+
|max(A)|
+------+
| 3.0|
+------+

print result.collect()[0]['max(A)']
3.0
</code></pre>

Similarly min, mean, etc. can be calculated as shown below:

<pre><code>from pyspark.sql.functions import mean, min, max

result = df.select([mean("A"), min("A"), max("A")])
result.show()
+------+------+------+
|avg(A)|min(A)|max(A)|
+------+------+------+
| 2.0| 1.0| 3.0|
+------+------+------+
</code></pre>

blocks|key|4460461|text|首先添加导入行：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4460462|from+pyspark.sql.functions+import+min,+max|offset|length|style|CODE|4460463|要在数据帧中查找age的最小值，请执行以下操作：|4460464|df.agg(min("age")).show()

%2B--------%2B
%7Cmin(age)%7C
%2B--------%2B
%7C++++++29%7C
%2B--------%2B|code-block|syntax|javascript|4460465|要在数据帧中查找年龄的最大值：|4460466|df.agg(max("age")).show()

%2B--------%2B
%7Cmax(age)%7C
%2B--------%2B
%7C++++++77%7C
%2B--------%2B|4460467|entityMap^0|0|0|16|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|V|8|@$D|W|E|X|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|Y|8|@]|9|@]|A|$]]|$1|J|3|K|5|L|7|Z|8|@]|9|@]|A|$M|N]]|$1|O|3|P|5|6|7|10|8|@]|9|@]|A|$]]|$1|Q|3|R|5|L|7|11|8|@]|9|@]|A|$M|N]]|$1|S|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|T|$]]

First add the import line:

<code>from pyspark.sql.functions import min, max</code>

<h1>To find the min value of age in the dataframe:</h1>

<pre><code>df.agg(min("age")).show()

+--------+
|min(age)|
+--------+
| 29|
+--------+
</code></pre>

<h1>To find the max value of age in the dataframe:</h1>

<pre><code>df.agg(max("age")).show()

+--------+
|max(age)|
+--------+
| 77|
+--------+
</code></pre>

I'm trying to figure out the best way to get the largest value in a Spark dataframe column.

Consider the following example:

<pre><code>df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
df.show()
</code></pre>

Which creates:

<pre><code>+---+---+
| A| B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+
</code></pre>

My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:

<pre><code># Method 1: Use describe()
float(df.describe("A").filter("summary = 'max'").select("A").first().asDict()['A'])

# Method 2: Use SQL
df.registerTempTable("df_table")
spark.sql("SELECT MAX(A) as maxval FROM df_table").first().asDict()['maxval']

# Method 3: Use groupby()
df.groupby().max('A').first().asDict()['max(A)']

# Method 4: Convert to RDD
df.select("A").rdd.max()[0]
</code></pre>

Each of the above gives the right answer, but in the absence of a Spark profiling tool I can't tell which is best. 

Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage, or whether there is a more direct method than the ones above?

Best way to get the max value in a Spark dataframe column

分布式计算

大数据

Spark 

我正在尝试找出在Spark dataframe列中获得最大值的最佳方法。考虑以下示例：df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])df.show()这将创建：+---+---+|  A|  B|+---+---+|1.0|4...

问在Spark dataframe列中获取最大值的最佳方法
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Spark dataframe列中获取最大值的最佳方法EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Spark dataframe列中获取最大值的最佳方法
EN