我试图在一个名为timestamp_value的新列中拆分在utc中找到的utc值。我试着使用Python,但是我做不到。谢谢你的回答!
我的数据文件就是这样的
+--------+----------------------------+
|machine |timestamp_value |
+--------+----------------------------+
|1 |2022-01-06T07:47:37.319+0000|
|2 |2022-01-06T07:47:37.319+0000|
|3 |2022-01-06T07:47:37.319+0000|
+--------+----------------------------+它应该是这样的
+--------+----------------------------+-----+
|machine |timestamp_value |utc |
+--------+----------------------------------+
|1 |2022-01-06T07:47:37.319 |+0000|
|2 |2022-01-06T07:47:37.319 |+0000|
|3 |2022-01-06T07:47:37.319 |+0000|
+--------+----------------------------------+发布于 2022-11-17 13:16:46
您可以分别使用regexp_extract和regexp_replace来完成这一任务。
import pyspark.sql.functions as F
(df
.withColumn('utc', F.regexp_extract('timestamp_value', '.*(\+.*)', 1))
.withColumn('timestamp_value', F.regexp_replace('timestamp_value', '\+(.*)', ''))
).show(truncate=False)
+-------+-----------------------+-----+
|machine|timestamp_value |utc |
+-------+-----------------------+-----+
|1 |2022-01-06T07:47:37.319|+0000|
|2 |2022-01-06T07:47:37.319|+0000|
|3 |2022-01-06T07:47:37.319|+0000|
+-------+-----------------------+-----+要更好地理解正则表达式的含义,请看一下这个工具。
https://stackoverflow.com/questions/74476037
复制相似问题