Help & Documentation>Data Lake Compute

SELECT STATEMENT

Last updated: 2024-01-12 10:28:33

SELECT Statement: Retrieves data rows from zero or more tables.

Syntax

[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[HAVING condition ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]

Category

[ WITH with_query [, ....] ]

The WITH clause can be used to flatten nested queries or simplify subqueries. The syntax for with_query is as follows:
subquery_table_name [ ( column_name [, ...] ) ] AS (subquery)
The subquery_table_name is the unique name of the temporary table that defines the result of the WITH clause subquery. Each subquery must have a table name that can be referenced in the FROM clause.
The column_name [, ...] is an optional list of output column names. The number of column names must be equal to or less than the number of columns defined by the subquery.
A subquery is any query statement.

[ ALL | DISTINCT ] select_expr

The ALL and DISTINCT options specify whether duplicate rows should be returned. If these options are not provided, the default is ALL (return all matching rows). DISTINCT specifies the removal of duplicate rows from the result set.

FROM from_item [, ...]

The from_item can be a view, table, or subquery. If multiple tables are joined, the supported join types are as follows:
[ INNER ] JOIN
LEFT [ OUTER ] JOIN
RIGHT [ OUTER ] JOIN
FULL [ OUTER ] JOIN
CROSS JOIN
ON join_condition, if join_condition is used, you can specify column names for join keys in multiple tables; if join_column is used, it requires that join_column exists in both tables.

[ WHERE condition ]

The results are filtered based on the condition you specify, returning a result set that meets the criteria.

[ GROUP BY [ ALL | DISTINCT ] grouping_expressions [, ...] ]

The GROUP BY expression allows for the grouping of output according to specified column names.

[HAVING condition ]

Used in conjunction with aggregate functions and the GROUP BY clause. It controls which groups are selected, thereby eliminating groups that do not meet the condition. This filtering occurs after the calculation of groups and aggregation.

[{UNION | INTERSECT | EXCEPT} [ALL | DISTINCT] union_query]

UNION, INTERSECT, and EXCEPT combine multiple results together, with UNION combining the rows generated by the first query with those generated by the second. To eliminate duplicates, UNION constructs a hash table, which consumes memory. For better performance, it is recommended to use UNION ALL.
INTERSECT only returns the rows that exist in the results of both the first and second queries.
EXCEPT returns the rows in the first query result, excluding the rows found in the second query.

[ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]

The result set is sorted by one or more output expressions. When the clause contains multiple expressions, the result set is sorted according to the first expression. Subsequently, the second expression is applied to rows with matching values in the first expression, and so on.

Sample

WITH Clause

The WITH clause defines named relations used in the query. It allows for the flattening of nested queries or the simplification of subqueries. For instance, the following queries are equivalent:
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;
It can also follow multiple subqueries:
WITH
t1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a),
t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)
SELECT t1.*, t2.*
FROM t1
JOIN t2 ON t1.a = t2.a;

GROUP BY Clause

The GROUPBY clause divides the output of the SELECT statement into groups of rows with matching values. A simple GROUPBY clause can contain any expression composed of input columns, or it can be an ordinal that selects output columns by position:
SELECT count(*), nationkey FROM customer GROUP BY 2;
SELECT count(*), nationkey FROM customer GROUP BY nationkey;
SELECT count(*) FROM customer GROUP BY mktsegment;

GROUPING SETS

Grouping sets allow users to specify multiple lists of columns to be grouped. Columns not belonging to the given sub-list of grouping columns are set to null.
SELECT origin_state, origin_zip, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_state),
(origin_state, origin_zip),
(destination_state));
SELECT origin_state, NULL, NULL, sum(package_weight)
FROM shipping GROUP BY origin_state
UNION ALL
SELECT origin_state, origin_zip, NULL, sum(package_weight)
FROM shipping GROUP BY origin_state, origin_zip
UNION ALL
SELECT NULL, NULL, destination_state, sum(package_weight)
FROM shipping GROUP BY destination_state;

HAVING Clause

The HAVING clause is used in conjunction with aggregate functions and the GROUP BY clause to control which groups are selected. The HAVING clause eliminates groups that do not meet the given conditions.
SELECT count(*), mktsegment, nationkey,
CAST(sum(acctbal) AS bigint) AS totalbal
FROM customer
GROUP BY mktsegment, nationkey
HAVING sum(acctbal) > 5700000
ORDER BY totalbal DESC;

IN

The IN operator allows you to specify multiple values in the WHERE clause.
SELECT name
FROM nation
WHERE regionkey IN (SELECT regionkey FROM region)

EXISTS

The EXISTS operator is used to determine whether there are records in the query clause. If one or more records exist, it returns True; otherwise, it returns False.
SELECT column_name(s)
FROM table_name
WHERE EXISTS
(SELECT column_name FROM table_name WHERE condition)

USING

Simplified using the 'using' keyword.
The query must be an equijoin.
In an equijoin, the columns must have identical names and data types.
SELECT *
FROM table_1
JOIN table_2
USING (key_A, key_B)
SELECT *
FROM (
VALUES
(1, 3, 10),
(2, 4, 20)
) AS table_1 (key_A, key_B, y1)
LEFT JOIN (
VALUES
(1, 3, 100),
(2, 4, 200)
) AS table_2 (key_A, key_B, y2)
USING (key_A, key_B);

CROSS JOIN

A cross join returns the Cartesian product of two relations (all combinations).
SELECT *
FROM nation
CROSS JOIN region

LIMIT Clause

The LIMIT clause restricts the number of rows in the result set.
SELECT orderdate FROM orders LIMIT 5

ORDER BY Clause

The ORDER BY clause is used to sort the result set based on one or more output expressions.
Syntax: ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]
SELECT name, age FROM person ORDER BY age
SELECT * FROM student
ORDER BY student_id
SELECT * FROM student
ORDER BY student_id,student_name

EXCEPT

The EXCEPT clause/operator is used to combine two SELECT statements and returns rows from the first SELECT statement that are not returned by the second SELECT statement. This means that EXCEPT only returns rows that are not available in the second SELECT statement.
The same rules apply when using the EXCEPT operator, as with the UNION operation.
SELECT * FROM (VALUES 13, 42)
EXCEPT
SELECT 13

INTERSECT

Returns distinct rows from two or more result sets from the SELECT statement.
SELECT * FROM (VALUES 13, 42)
INTERSECT
SELECT 1

UNION

Combine the result sets of two or more SELECT statements into one result set. To retain duplicate rows in the result set, use the UNION ALL operator.
SELECT 13
UNION
SELECT 42
SELECT id FROM a
UNION ALL
SELECT id FROM b;

TABLESAMPLE

BERNOULLI: Selects each row as a table sample with a probability equal to the sample percentage. When sampling a table using the Bernoulli method, all physical blocks of the table are scanned and certain rows are skipped (based on a comparison between the sampling percentage and a randomly calculated value at runtime). The probability of a row being included in the results is independent of any other row. This does not reduce the time required to read the sample table from the disk. Further processing of the sample output may affect the total query time.
SYSTEM: This sampling method divides the table into logical data segments and samples the table at this granularity. This sampling method either selects all rows from a specific data segment or skips it (based on the comparison between the sampling percentage and the randomly calculated value at runtime). The rows selected in system sampling will depend on the connector used. For example, when used with Hive, it depends on how the data is laid out on HDFS. This method cannot guarantee independent sampling probabilities.
SELECT *
FROM users TABLESAMPLE BERNOULLI (50);
SELECT *
FROM users TABLESAMPLE SYSTEM (75);

PIVOT Clause

Returns aggregated values based on specific columns.
SELECT * FROM person
PIVOT (
SUM(age) AS a, AVG(class) AS c
FOR name IN ('John' AS john, 'Mike' AS mike)
);

Lateral View Clause

LATERAL VIEW [ OUTER ] generator_function ( expression [ , ... ] ) [ table_alias ] AS column_alias [ , ... ]

Escaping

To escape a single quote, precede it with another single quote, as demonstrated in the following example:
Select 'dlc''test'
Specify an escape character when creating a table, for instance, by using the following method:
CREATE EXTERNAL TABLE IF NOT EXISTS csv_test_2222 (
_c0 STRING,
_c1 INTEGER,
_c2 INTEGER,
_c3 INTEGER
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'separatorChar' = '''',
'quoteChar' = ''''
)
STORED AS textfile
LOCATION 'cosn://dlc-nj-1258469122/csv/100M/