SELECT Statement: Retrieves data rows from zero or more tables.Syntax
[ WITH with_query [, ...] ]SELECT [ ALL | DISTINCT ] select_expression [, ...][ FROM from_item [, ...] ][ WHERE condition ][ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ][HAVING condition ][ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ][ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ][ LIMIT [ count | ALL ] ]
Category
[ WITH with_query [, ....] ]
The WITH clause can be used to flatten nested queries or simplify subqueries. The syntax for
with_query is as follows:subquery_table_name [ ( column_name [, ...] ) ] AS (subquery)
The subquery_table_name is the unique name of the temporary table that defines the result of the WITH clause subquery. Each subquery must have a table name that can be referenced in the FROM clause.
The column_name [, ...] is an optional list of output column names. The number of column names must be equal to or less than the number of columns defined by the subquery.
A subquery is any query statement.
[ ALL | DISTINCT ] select_expr
The ALL and DISTINCT options specify whether duplicate rows should be returned. If these options are not provided, the default is ALL (return all matching rows). DISTINCT specifies the removal of duplicate rows from the result set.
FROM from_item [, ...]
The from_item can be a view, table, or subquery. If multiple tables are joined, the supported join types are as follows:
[ INNER ] JOINLEFT [ OUTER ] JOINRIGHT [ OUTER ] JOINFULL [ OUTER ] JOINCROSS JOINON join_condition, if join_condition is used, you can specify column names for join keys in multiple tables; if join_column is used, it requires that join_column exists in both tables.[ WHERE condition ]
The results are filtered based on the condition you specify, returning a result set that meets the criteria.
[ GROUP BY [ ALL | DISTINCT ] grouping_expressions [, ...] ]
The GROUP BY expression allows for the grouping of output according to specified column names.
[HAVING condition ]
Used in conjunction with aggregate functions and the GROUP BY clause. It controls which groups are selected, thereby eliminating groups that do not meet the condition. This filtering occurs after the calculation of groups and aggregation.
[{UNION | INTERSECT | EXCEPT} [ALL | DISTINCT] union_query]
UNION, INTERSECT, and EXCEPT combine multiple results together, with UNION combining the rows generated by the first query with those generated by the second. To eliminate duplicates, UNION constructs a hash table, which consumes memory. For better performance, it is recommended to use UNION ALL.INTERSECT only returns the rows that exist in the results of both the first and second queries.EXCEPT returns the rows in the first query result, excluding the rows found in the second query.[ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
The result set is sorted by one or more output expressions. When the clause contains multiple expressions, the result set is sorted according to the first
expression. Subsequently, the second expression is applied to rows with matching values in the first expression, and so on.Sample
WITH Clause
The WITH clause defines named relations used in the query. It allows for the flattening of nested queries or the simplification of subqueries. For instance, the following queries are equivalent:
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)SELECT a, b FROM x;
It can also follow multiple subqueries:
WITHt1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a),t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)SELECT t1.*, t2.*FROM t1JOIN t2 ON t1.a = t2.a;
GROUP BY Clause
The GROUPBY clause divides the output of the SELECT statement into groups of rows with matching values. A simple GROUPBY clause can contain any expression composed of input columns, or it can be an ordinal that selects output columns by position:
SELECT count(*), nationkey FROM customer GROUP BY 2;
SELECT count(*), nationkey FROM customer GROUP BY nationkey;
SELECT count(*) FROM customer GROUP BY mktsegment;
GROUPING SETS
Grouping sets allow users to specify multiple lists of columns to be grouped. Columns not belonging to the given sub-list of grouping columns are set to null.
SELECT origin_state, origin_zip, destination_state, sum(package_weight)FROM shippingGROUP BY GROUPING SETS ((origin_state),(origin_state, origin_zip),(destination_state));
SELECT origin_state, NULL, NULL, sum(package_weight)FROM shipping GROUP BY origin_stateUNION ALLSELECT origin_state, origin_zip, NULL, sum(package_weight)FROM shipping GROUP BY origin_state, origin_zipUNION ALLSELECT NULL, NULL, destination_state, sum(package_weight)FROM shipping GROUP BY destination_state;
HAVING Clause
The HAVING clause is used in conjunction with aggregate functions and the GROUP BY clause to control which groups are selected. The HAVING clause eliminates groups that do not meet the given conditions.
SELECT count(*), mktsegment, nationkey,CAST(sum(acctbal) AS bigint) AS totalbalFROM customerGROUP BY mktsegment, nationkeyHAVING sum(acctbal) > 5700000ORDER BY totalbal DESC;
IN
The IN operator allows you to specify multiple values in the WHERE clause.
SELECT nameFROM nationWHERE regionkey IN (SELECT regionkey FROM region)
EXISTS
The EXISTS operator is used to determine whether there are records in the query clause. If one or more records exist, it returns True; otherwise, it returns False.
SELECT column_name(s)FROM table_nameWHERE EXISTS(SELECT column_name FROM table_name WHERE condition)
USING
Simplified using the 'using' keyword.
The query must be an equijoin.
In an equijoin, the columns must have identical names and data types.
SELECT *FROM table_1JOIN table_2USING (key_A, key_B)
SELECT *FROM (VALUES(1, 3, 10),(2, 4, 20)) AS table_1 (key_A, key_B, y1)LEFT JOIN (VALUES(1, 3, 100),(2, 4, 200)) AS table_2 (key_A, key_B, y2)USING (key_A, key_B);
CROSS JOIN
A cross join returns the Cartesian product of two relations (all combinations).
SELECT *FROM nationCROSS JOIN region
LIMIT Clause
The LIMIT clause restricts the number of rows in the result set.
SELECT orderdate FROM orders LIMIT 5
ORDER BY Clause
The ORDER BY clause is used to sort the result set based on one or more output expressions.
Syntax: ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]
SELECT name, age FROM person ORDER BY age
SELECT * FROM studentORDER BY student_id
SELECT * FROM studentORDER BY student_id,student_name
EXCEPT
The EXCEPT clause/operator is used to combine two SELECT statements and returns rows from the first SELECT statement that are not returned by the second SELECT statement. This means that EXCEPT only returns rows that are not available in the second SELECT statement.
The same rules apply when using the EXCEPT operator, as with the UNION operation.
SELECT * FROM (VALUES 13, 42)EXCEPTSELECT 13
INTERSECT
Returns distinct rows from two or more result sets from the SELECT statement.
SELECT * FROM (VALUES 13, 42)INTERSECTSELECT 1
UNION
Combine the result sets of two or more SELECT statements into one result set. To retain duplicate rows in the result set, use the
UNION ALL operator.SELECT 13UNIONSELECT 42
SELECT id FROM aUNION ALLSELECT id FROM b;
TABLESAMPLE
BERNOULLI: Selects each row as a table sample with a probability equal to the sample percentage. When sampling a table using the Bernoulli method, all physical blocks of the table are scanned and certain rows are skipped (based on a comparison between the sampling percentage and a randomly calculated value at runtime). The probability of a row being included in the results is independent of any other row. This does not reduce the time required to read the sample table from the disk. Further processing of the sample output may affect the total query time.SYSTEM: This sampling method divides the table into logical data segments and samples the table at this granularity. This sampling method either selects all rows from a specific data segment or skips it (based on the comparison between the sampling percentage and the randomly calculated value at runtime). The rows selected in system sampling will depend on the connector used. For example, when used with Hive, it depends on how the data is laid out on HDFS. This method cannot guarantee independent sampling probabilities.SELECT *FROM users TABLESAMPLE BERNOULLI (50);
SELECT *FROM users TABLESAMPLE SYSTEM (75);
PIVOT Clause
Returns aggregated values based on specific columns.
SELECT * FROM personPIVOT (SUM(age) AS a, AVG(class) AS cFOR name IN ('John' AS john, 'Mike' AS mike));
Lateral View Clause
LATERAL VIEW [ OUTER ] generator_function ( expression [ , ... ] ) [ table_alias ] AS column_alias [ , ... ]
Escaping
To escape a single quote, precede it with another single quote, as demonstrated in the following example:
Select 'dlc''test'
Specify an escape character when creating a table, for instance, by using the following method:
CREATE EXTERNAL TABLE IF NOT EXISTScsv_test_2222(_c0STRING,_c1INTEGER,_c2INTEGER,_c3INTEGER)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ('separatorChar' = '''','quoteChar' = '''')STORED AStextfileLOCATION 'cosn://dlc-nj-1258469122/csv/100M/