我试图从如下表达式中提取最后一组数字:
urn:fb:xyx:266730227 -> 266730227
urn:fb:pqr:(urn:fb:abc:6217401,10444030746) -> 10444030746这段代码适用于蜂巢,
SELECT
CAST(regexp_extract(value.project, '.*,(\d+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '(\d+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '(\d+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10但是,在Spark中,这会引发以下错误:
FAILED, exitCode: 15, (reason: User class threw exception: scala.StringContext$InvalidEscapeException: invalid escape '\d' not one of [\b, \t, \n, \f, \r, \\, \", \'] at index 94 in "
|SELECT
| *,
| CAST(regexp_extract(value.project, '.*,(\d+)', 1) AS BIGINT) AS p_id,
| CAST(regexp_extract(value.context, '(\d+)', 0) AS BIGINT) AS co_id,
| CAST(regexp_extract(key.identity, '(\d+)', 0) AS BIGINT) AS ca_id
|FROM ". Use \\ for literal \.当我尝试做错误建议时,
SELECT
CAST(regexp_extract(value.project, '.*,(\\d+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '(\\d+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '(\\d+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10结果列都为空。
发布于 2020-12-21 18:49:16
我也曾多次被窃听--最后,我放弃了在我的正则表达式中使用任何反斜杠:
SELECT
CAST(regexp_extract(value.project, '.*,([0-9]+)', 1) AS BIGINT) AS p_id,
CAST(regexp_extract(value.context, '([0-9]+)', 0) AS BIGINT) AS co_id,
CAST(regexp_extract(key.identity, '([0-9]+)', 0) AS BIGINT) AS can_id
FROM some.dataset LIMIT 10发布于 2020-12-22 06:10:30
这都是关于如何在字符串文本中转义字符。默认情况下,字符串文本(包括regex模式)不会转义,您必须使用反斜杠手动转义文字反斜杠。例如:
val df = Seq("urn:fb:xyx:266730227", "urn:fb:pqr:(urn:fb:abc:6217401,10444030746)").toDF("value")
df.createOrReplaceTempView("tbl")
// pattern as a regular string literal
df.withColumn("p_id", regexp_extract($"value", "(\\d+)", 1)).show(false)
+-------------------------------------------+---------+
|value |p_id |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227 |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401 |
+-------------------------------------------+---------+您可以通过使用原始字符串或多行字符串(如@mazaneicha所述)跳过此模式:
// pattern as a raw string, keep backslash as-is
df.withColumn("p_id", regexp_extract($"value", raw"(\d+)", 1)).show(false)
+-------------------------------------------+---------+
|value |p_id |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227 |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401 |
+-------------------------------------------+---------+
// pattern as a multi-Line string where backslash is not escaped
df.withColumn("p_id", regexp_extract($"value", """(\d+)""", 1)).show(false)
+-------------------------------------------+---------+
|value |p_id |
+-------------------------------------------+---------+
|urn:fb:xyx:266730227 |266730227|
|urn:fb:pqr:(urn:fb:abc:6217401,10444030746)|6217401 |
+-------------------------------------------+---------+当反斜杠和regexp_extract时(regexp_replace,split,str_to_map等)显示在SQL表达式(基本上是字符串)中,如expr()、df.selectExpr()、spark.sql()、df.filter()、df.where()等,您将不得不双转义反斜杠,例如:
// regular string literals
spark.sql("select *, regexp_extract(value, '(\\\\d+)', 1) as p_id from tbl").show
// raw string to SQL expression
spark.sql(raw"select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl").show
// multi-Line string to SQL expression
spark.sql("""select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl""").show
df.withColumn("p_id", expr("regexp_extract(value, '(\\\\d+)', 1)")).show(false)
df.withColumn("p_id", expr(raw"regexp_extract(value, '(\\d+)', 1)")).show(false)
df.withColumn("p_id", expr("""regexp_extract(value, '(\\d+)', 1)""")).show(false)
df.filter("value rlike '\\\\d'").show
df.filter(raw"value rlike '\\d'").show
df.filter("""value rlike '\\d'""").show注意:表达式中没有处理Scala原始字符串或多行字符串的语法。要做同样的事情,可以设置spark.sql.parser.escapedStringLiterals=true (默认值是false,参考文献),下面是链接
buildConf("spark.sql.parser.escapedStringLiterals") ESCAPED_STRING_LITERALS =..doc(“当为真时,字符串文本(包括regex模式)仍然在SQL”+“解析器中转义。自Spark2.0以来默认为false。将其设置为true可以恢复Spark2.0之前的行为”+“) .version("2.2.1") .booleanConf .createWithDefault(false)
示例:
spark.conf.set("spark.sql.parser.escapedStringLiterals", "true")
df.withColumn("p_id", expr("regexp_extract(value, '(\\d+)', 1)")).show(false)
df.withColumn("p_id", expr(raw"regexp_extract(value, '(\d+)', 1)")).show(false)
spark.sql("select *, regexp_extract(value, '(\\d+)', 1) as p_id from tbl").show
spark.sql("""select *, regexp_extract(value, '(\d+)', 1) as p_id from tbl""").show
df.filter(raw"value rlike '\d'").show侧注:上面讨论的是如何在字符串中转义文字反斜杠,如果您想要实际转义字符(即换行符\n、TAB \t、NUL char \0或\u0000等),则不需要额外的反斜杠,例如:
// merge multiple lines into one line
spark.sql("select *, regexp_replace(x,'\n+',',') as y from values ('1,2\n3\n\n4') as (x)").show
// split string into an array using NUL char or/and TAB
spark.sql("select *, split(x,'[\t\u0000]+') as y from values ('s\u0000x\ty\tz') as (x)").showhttps://stackoverflow.com/questions/65398231
复制相似问题