描述
Spark 支持
SELECT 语句并符合 ANSI SQL 标准。查询用于从一个或多个表中检索结果集。语法
[ WITH with_query [ , ... ] ]select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_statement, ... ][ ORDER BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ][ SORT BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] } ][ CLUSTER BY { expression [ , ... ] } ][ DISTRIBUTE BY { expression [, ... ] } ][ WINDOW { named_window [ , WINDOW named_window, ... ] } ][ LIMIT { ALL | expression } ]
其中
select_statement 定义为:SELECT [ hints , ... ] [ ALL | DISTINCT ] { named_expression | regex_column_names | TRANSFORM (...) }FROM { from_item [ , ... ] }[ PIVOT clause ][ UNPIVOT clause ][ LATERAL VIEW clause ] [ ... ][ WHERE boolean_expression ][ GROUP BY expression [ , ... ] ][ HAVING boolean_expression ]
参数
子句/关键字 | 说明 |
with_query | 在主查询块之前指定公共表表达式(CTE) |
hints | 提示帮助 Spark 优化器做出更好的规划决策 |
ALL | 选择所有匹配的行(默认) |
DISTINCT | 去除结果中的重复行 |
named_expression | 带有赋值名称的表达式。语法: expression [[AS] alias] |
from_item | 查询的输入源:表、JOIN、子查询、内联表等。 |
WHERE | 根据谓词过滤 FROM 子句的结果 |
GROUP BY | 指定用于分组行的表达式 |
HAVING | 指定用于过滤 GROUP BY 产生的行的谓词 |
ORDER BY | 对完整结果集排序 |
SORT BY | 在每个分区内排序 |
CLUSTER BY | 重新分区并排序 |
DISTRIBUTE BY | 重新分区但不排序 |
LIMIT | 限制返回的最大行数 |
示例
-- 创建测试表CREATE EXTERNAL TABLE sel_t1 (id INT, name STRING, age INT)USING PARQUET LOCATION 'cosn://<your_cos_bucket>/test_select/sel_t1';INSERT INTO sel_t1 VALUES (1,'Alice',25),(2,'Bob',30),(3,'Charlie',25);SELECT * FROM sel_t1;SELECT DISTINCT age FROM sel_t1;SELECT id, name FROM sel_t1 WHERE age > 25;SELECT age, COUNT(*) AS cnt FROM sel_t1 GROUP BY age;SELECT age, COUNT(*) AS cnt FROM sel_t1 GROUP BY age HAVING cnt > 1;SELECT * FROM sel_t1 ORDER BY age DESC;SELECT * FROM sel_t1 LIMIT 2;SELECT id AS student_id, name AS student_name FROM sel_t1;SELECT 1 AS one, 'hello' AS greeting;