blocks|key|140717|text|以下解决方案适用于+since+1.5：|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|140718|低于：|140719|//+filter+data+where+the+date+is+lesser+than+2015-03-14
data.filter(data("date").lt(lit("2015-03-14")))++++++|code-block|syntax|javascript|140720|大于：|140721|//+filter+data+where+the+date+is+greater+than+2015-03-14
data.filter(data("date").gt(lit("2015-03-14")))+|140722|对于相等，您可以使用equalTo或===：|CODE|140723|data.filter(data("date")+===+lit("2015-03-14"))|140724|如果DataFrame+date列的类型为StringType，则可以使用to_date函数对其进行转换：|140725|//+filter+data+where+the+date+is+greater+than+2015-03-14
data.filter(to_date(data("date")).gt(lit("2015-03-14")))+|140726|您还可以使用year函数根据一年进行过滤：|140727|//+filter+data+where+year+is+greater+or+equal+to+2016
data.filter(year($"date").geq(lit(2016)))+|140728|entityMap^0|9|A|0|0|0|0|0|A|7|I|3|0|0|2|9|L|A|11|7|0|0|6|4|0|0^^$0|@$1|2|3|4|5|6|7|15|8|@$9|16|A|17|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|18|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|19|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|1A|8|@]|D|@]|E|$]]|$1|O|3|P|5|J|7|1B|8|@]|D|@]|E|$K|L]]|$1|Q|3|R|5|6|7|1C|8|@$9|1D|A|1E|B|S]|$9|1F|A|1G|B|S]]|D|@]|E|$]]|$1|T|3|U|5|J|7|1H|8|@]|D|@]|E|$K|L]]|$1|V|3|W|5|6|7|1I|8|@$9|1J|A|1K|B|S]|$9|1L|A|1M|B|S]|$9|1N|A|1O|B|S]]|D|@]|E|$]]|$1|X|3|Y|5|J|7|1P|8|@]|D|@]|E|$K|L]]|$1|Z|3|10|5|6|7|1Q|8|@$9|1R|A|1S|B|S]]|D|@]|E|$]]|$1|11|3|12|5|J|7|1T|8|@]|D|@]|E|$K|L]]|$1|13|3|-4|5|6|7|1U|8|@]|D|@]|E|$]]]|14|$]]

The following solutions are applicable since spark 1.5 :



For lower than :

<pre class="lang-scala prettyprint-override"><code>// filter data where the date is lesser than 2015-03-14
data.filter(data("date").lt(lit("2015-03-14"))) 
</code></pre>

For greater than :

<pre class="lang-scala prettyprint-override"><code>// filter data where the date is greater than 2015-03-14
data.filter(data("date").gt(lit("2015-03-14"))) 
</code></pre>

For equality, you can use either <code>equalTo</code> or <code>===</code> :

<pre class="lang-scala prettyprint-override"><code>data.filter(data("date") === lit("2015-03-14"))
</code></pre>

If your <code>DataFrame</code> date column is of type <code>StringType</code>, you can convert it using the <code>to_date</code> function :

<pre class="lang-scala prettyprint-override"><code>// filter data where the date is greater than 2015-03-14
data.filter(to_date(data("date")).gt(lit("2015-03-14"))) 
</code></pre>

You can also filter according to a year using the <code>year</code> function :

<pre class="lang-scala prettyprint-override"><code>// filter data where year is greater or equal to 2016
data.filter(year($"date").geq(lit(2016))) 
</code></pre>

blocks|key|140740|text|在PySpark(python)中，其中一个选项是让列在unix_timestamp中，format.We可以将字符串转换为unix_timestamp并指定格式，如下所示。注意，我们需要导入unix_timestamp和lit功能。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|140741|from+pyspark.sql.functions+import+unix_timestamp,+lit

df.withColumn("tx_date",+to_date(unix_timestamp(df_cast["date"],+"MM/dd/yyyy").cast("timestamp")))|code-block|syntax|javascript|140742|现在我们可以应用过滤器了|140743|df_cast.filter(df_cast["tx_date"]+>=+lit('2017-01-01'))+\
+++++++.filter(df_cast["tx_date"]+<=+lit('2017-01-31')).show()|140744|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

In PySpark(python) one of the option is to have the column in unix_timestamp format.We can convert string to unix_timestamp and specify the format as shown below.
Note we need to import unix_timestamp and lit function

<pre><code>from pyspark.sql.functions import unix_timestamp, lit

df.withColumn("tx_date", to_date(unix_timestamp(df_cast["date"], "MM/dd/yyyy").cast("timestamp")))
</code></pre>

Now we can apply the filters

<pre><code>df_cast.filter(df_cast["tx_date"] &gt;= lit('2017-01-01')) \
 .filter(df_cast["tx_date"] &lt;= lit('2017-01-31')).show()
</code></pre>

blocks|key|4307270|text|不要像其他答案所建议的那样使用这个|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4307271|.filter(f.col("dateColumn")+<+f.lit('2017-11-01'))|code-block|syntax|javascript|4307272|但是用这个代替|4307273|.filter(f.col("dateColumn")+<+f.unix_timestamp(f.lit('2017-11-01+00:00:00')).cast('timestamp'))|4307274|这将使用TimestampType而不是StringType，后者在某些情况下将具有更高的性能。例如，Parquet谓词下推只适用于后者。|offset|length|style|CODE|4307275|编辑:这两个片段都假定此导入：|4307276|from+pyspark.sql+import+functions+as+f|4307277|entityMap^0|0|0|0|0|4|D|K|A|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|X|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Y|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|Z|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|10|8|@$M|11|N|12|O|P]|$M|13|N|14|O|P]]|9|@]|A|$]]|$1|Q|3|R|5|6|7|15|8|@]|9|@]|A|$]]|$1|S|3|T|5|D|7|16|8|@]|9|@]|A|$E|F]]|$1|U|3|-4|5|6|7|17|8|@]|9|@]|A|$]]]|V|$]]

Don't use this as suggested in other answers
<pre><code>.filter(f.col(&quot;dateColumn&quot;) &lt; f.lit('2017-11-01'))
</code></pre>
But use this instead
<pre><code>.filter(f.col(&quot;dateColumn&quot;) &lt; f.unix_timestamp(f.lit('2017-11-01 00:00:00')).cast('timestamp'))
</code></pre>
This will use the <code>TimestampType</code> instead of the <code>StringType</code>, which will be more performant in some cases. For example Parquet predicate pushdown will only work with the latter.
Edit: Both snippets assume this import:
<pre><code>from pyspark.sql import functions as f
</code></pre>

blocks|key|140786|text|df=df.filter(df["columnname"]>='2020-01-13')|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|140787|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>df=df.filter(df["columnname"]&gt;='2020-01-13')
</code></pre>

blocks|key|19418|text|我发现最易读的表达方式是使用sql表达式：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|19419|df.filter("my_date+<+date'2015-01-01'")|code-block|syntax|javascript|19420|我们可以通过查看.explain()中的物理计划来验证此操作是否正确。|offset|length|style|CODE|19421|%2B-+*(1)+Filter+(isnotnull(my_date#22)+&&+(my_date#22+<+16436))|19422|entityMap^0|0|0|8|A|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@$I|T|J|U|K|L]]|9|@]|A|$]]|$1|M|3|N|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|O|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|P|$]]

I find the most readable way to express this is using a sql expression:

<pre><code>df.filter("my_date &lt; date'2015-01-01'")
</code></pre>

we can verify this works correctly by looking at the physical plan from <code>.explain()</code>

<pre><code>+- *(1) Filter (isnotnull(my_date#22) &amp;&amp; (my_date#22 &lt; 16436))
</code></pre>

blocks|key|140834|text|我们还可以在过滤器中使用SQL类表达式:|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|140835|注意，->在这里给出了两个条件和一个日期范围，供将来参考：|blockquote|140836|ordersDf.filter("order_status+=+'PENDING_PAYMENT'+AND+order_date+BETWEEN+'2013-07-01'+AND+'2013-07-31'+")|code-block|syntax|javascript|140837|entityMap^0|0|K|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@$9|Q|A|R|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|S|8|@]|D|@]|E|$]]|$1|I|3|J|5|K|7|T|8|@]|D|@]|E|$L|M]]|$1|N|3|-4|5|6|7|U|8|@]|D|@]|E|$]]]|O|$]]

We can also use SQL kind of expression inside filter :
<hr />
<blockquote>
Note -&gt; Here I am showing two conditions and a date range for future
reference :
</blockquote>
<hr />
<pre><code>ordersDf.filter(&quot;order_status = 'PENDING_PAYMENT' AND order_date BETWEEN '2013-07-01' AND '2013-07-31' &quot;)
</code></pre>

blocks|key|140848|text|imho应该是这样的：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|140849|import+java.util.Date
import+java.util.Calendar
import+java.sql.Timestamp
import+java.sql.Date

val+jDate+=+Calendar.getInstance().getTime()
val+sqlDateTime+=+new+java.sql.Timestamp(jDate.getTime())
val+sqlDate+=+new+java.sql.Date(jDate.getTime())

data.filter(data("date").gt(sqlDate))+
data.filter(data("date").gt(sqlDateTime))|code-block|syntax|javascript|140850|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

imho it should be like this:
<pre><code>import java.util.Date
import java.util.Calendar
import java.sql.Timestamp
import java.sql.Date

val jDate = Calendar.getInstance().getTime()
val sqlDateTime = new java.sql.Timestamp(jDate.getTime())
val sqlDate = new java.sql.Date(jDate.getTime())

data.filter(data(&quot;date&quot;).gt(sqlDate)) 
data.filter(data(&quot;date&quot;).gt(sqlDateTime))
</code></pre>

I have a dataframe of 

<pre><code>date, string, string
</code></pre>

I want to select dates before a certain period. I have tried the following with no luck

<pre><code> data.filter(data("date") &lt; new java.sql.Date(format.parse("2015-03-14").getTime))
</code></pre>

I'm getting an error stating the following

<pre><code>org.apache.spark.sql.AnalysisException: resolved attribute(s) date#75 missing from date#72,uid#73,iid#74 in operator !Filter (date#75 &lt; 16508);
</code></pre>

As far as I can guess the query is incorrect. Can anyone show me what way the query should be formatted? 

I checked that all enteries in the dataframe have values - they do.

Filtering a spark dataframe based on date

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我有一份数据date, string, string我想在某个时间段之前选择日期。我在没有运气的情况下尝试了下面的方法 data.filter(data("date") < new java.sql.Date(format.parse("2015-03-14").getTime))我收到一个错误，说明了以下内容org....

问基于日期的火花数据过滤
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于日期的火花数据过滤EN