语法总览

最近更新时间:2026-05-20 14:11:22

我的收藏

描述

Spark 支持 SELECT 语句并符合 ANSI SQL 标准。查询用于从一个或多个表中检索结果集。

语法

[ WITH with_query [ , ... ] ]
select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_statement, ... ]
[ ORDER BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ]
[ SORT BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ]
[ CLUSTER BY { expression [ , ... ] } ]
[ DISTRIBUTE BY { expression [, ... ] } ]
[ WINDOW { named_window [ , WINDOW named_window, ... ] } ]
[ LIMIT { ALL | expression } ]
其中 select_statement 定义为:
SELECT [ hints , ... ] [ ALL | DISTINCT ] { named_expression | regex_column_names | TRANSFORM (...) }
FROM { from_item [ , ... ] }
[ PIVOT clause ]
[ UNPIVOT clause ]
[ LATERAL VIEW clause ] [ ... ]
[ WHERE boolean_expression ]
[ GROUP BY expression [ , ... ] ]
[ HAVING boolean_expression ]

参数

子句/关键字
说明
with_query
在主查询块之前指定公共表表达式(CTE)
hints
提示帮助 Spark 优化器做出更好的规划决策
ALL
选择所有匹配的行(默认)
DISTINCT
去除结果中的重复行
named_expression
带有赋值名称的表达式。语法: expression [[AS] alias]
from_item
查询的输入源:表、JOIN、子查询、内联表等。
WHERE
根据谓词过滤 FROM 子句的结果
GROUP BY
指定用于分组行的表达式
HAVING
指定用于过滤 GROUP BY 产生的行的谓词
ORDER BY
对完整结果集排序
SORT BY
在每个分区内排序
CLUSTER BY
重新分区并排序
DISTRIBUTE BY
重新分区但不排序
LIMIT
限制返回的最大行数

示例

-- 创建测试表
CREATE EXTERNAL TABLE sel_t1 (id INT, name STRING, age INT)
USING PARQUET LOCATION 'cosn://<your_cos_bucket>/test_select/sel_t1';
INSERT INTO sel_t1 VALUES (1,'Alice',25),(2,'Bob',30),(3,'Charlie',25);

SELECT * FROM sel_t1;
SELECT DISTINCT age FROM sel_t1;
SELECT id, name FROM sel_t1 WHERE age > 25;
SELECT age, COUNT(*) AS cnt FROM sel_t1 GROUP BY age;
SELECT age, COUNT(*) AS cnt FROM sel_t1 GROUP BY age HAVING cnt > 1;
SELECT * FROM sel_t1 ORDER BY age DESC;
SELECT * FROM sel_t1 LIMIT 2;
SELECT id AS student_id, name AS student_name FROM sel_t1;
SELECT 1 AS one, 'hello' AS greeting;