首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >十五、Hive 窗口函数

十五、Hive 窗口函数

作者头像
IvanCodes
发布2025-09-28 11:44:40
发布2025-09-28 11:44:40
2900
代码可运行
举报
运行总次数:0
代码可运行

作者:IvanCodes 日期:2025年5月22日 专栏:Hive教程

Hive中的窗口函数为复杂数据分析提供了强大的支持。它们在不改变原始行数的前提下,对与当前行相关的“窗口”内数据进行计算。核心在于OVER()子句,它定义了窗口的范围(分区、排序、帧)。

思维导图
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
一、核心语法回顾
代码语言:javascript
代码运行次数:0
运行
复制
window_function([args]) OVER ([PARTITION BY cols] [ORDER BY cols [ASC|DESC]] [window_frame])
二、排名函数 (Ranking Functions) 用法与示例

用于为分区内的行分配名次。

  • ROW_NUMBER(): 唯一连续排名。 示例:为每个部门的员工按薪水从高到低编号。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_dept
FROM employees;
  • RANK(): 并列跳号排名。 示例:同上,但薪水相同者名次相同。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_dept
FROM employees;
  • DENSE_RANK(): 并列不跳号排名。 示例:同上,但排名更紧凑。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank_in_dept
FROM employees;
  • NTILE(n): 将分区内数据平均分成n个桶,返回桶号。 示例:将每个部门的员工按薪水分为高、中、低三档。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
NTILE(3) OVER (PARTITION BY department ORDER BY salary DESC) as salary_tier
FROM employees;
三、分析函数 (Analytic Functions) / 值函数 (Value Functions) 用法与示例

用于获取窗口内其他行的值。

  • LAG(col, offset, default_value): 取当前行在分区内排序后向上第offset行的col值。 示例:计算每位员工当月销售额与上月销售额的比较。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_id, sale_month, sales_amount,
LAG(sales_amount, 1, 0) OVER (PARTITION BY emp_id ORDER BY sale_month) as prev_month_sales
FROM monthly_sales;
  • LEAD(col, offset, default_value): 取当前行在分区内排序后向下第offset行的col值。 示例:预测下一次购买时间。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT user_id, purchase_date,
LEAD(purchase_date, 1) OVER (PARTITION BY user_id ORDER BY purchase_date) as next_purchase_date
FROM purchases;
  • FIRST_VALUE(col): 取当前窗口框架内 第一行的col值。 示例:显示每个部门中薪水最低的员工的薪水。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lowest_salary_in_dept
FROM employees;
  • LAST_VALUE(col): 取当前窗口框架内 最后一行的col值。 获取分区真正最后一个值示例
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as highest_salary_in_dept
FROM employees;
四、聚合函数作为窗口函数 (Aggregate Functions as Window Functions) 用法与示例

标准聚合函数 (SUM, AVG, COUNT, MAX, MIN) 可与 OVER() 结合。

  • SUM() OVER(...): 累积求和或分区总和。 示例1 (分区总和)
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
SUM(salary) OVER (PARTITION BY department) as total_dept_salary
FROM employees;

示例2 (累积求和)

代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary, emp_id,
SUM(salary) OVER (PARTITION BY department ORDER BY emp_id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_salary_by_id
FROM employees;
  • AVG() OVER(...): 移动平均或分区平均。 示例 (3日移动平均销售额)
代码语言:javascript
代码运行次数:0
运行
复制
SELECT sale_date, sales_amount,
AVG(sales_amount) OVER (ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg_3day
FROM daily_sales;
  • COUNT() OVER(...): 分区内计数或累积计数。 示例
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department,
COUNT(*) OVER (PARTITION BY department) as num_employees_in_dept
FROM employees;
  • MAX() OVER(...) / MIN() OVER(...): 分区内最大/最小值。 示例
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
MAX(salary) OVER (PARTITION BY department) as max_salary_in_dept
FROM employees;
五、窗口帧 (Window Frame Clause) 的详细用法与示例

精确定义聚合计算的行范围,主要与 ORDER BY 配合。 {ROWS | RANGE} BETWEEN frame_start AND frame_end

  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: 从分区开始到当前行。 示例 (默认累积行为):计算每个产品按日期顺序的累计销售额。
代码语言:javascript
代码运行次数:0
运行
复制
SELECT product_id, sale_date, sales_amount,
SUM(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date) as cumulative_sales_default_frame -- 默认即此帧
FROM product_sales;

等同于:

代码语言:javascript
代码运行次数:0
运行
复制
SELECT product_id, sale_date, sales_amount,
SUM(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sales_explicit_frame
FROM product_sales;
  • ROWS BETWEEN N PRECEDING AND CURRENT ROW: 当前行及前N行。 示例 (最近3次活动的平均时长)
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_id, activity_date, duration_minutes,
AVG(duration_minutes) OVER (PARTITION BY emp_id ORDER BY activity_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avg_duration_last_3_activities
FROM employee_activity;
  • ROWS BETWEEN CURRENT ROW AND N FOLLOWING: 当前行及后N行。 示例 (当前及未来2次访问的总页面数)
代码语言:javascript
代码运行次数:0
运行
复制
SELECT session_id, start_time, pages_viewed,
SUM(pages_viewed) OVER (PARTITION BY user_id ORDER BY start_time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) as total_pages_current_and_next_2_sessions
FROM web_sessions;
  • ROWS BETWEEN N PRECEDING AND M FOLLOWING: 当前行前N行到后M行。 示例 (中心化的5日移动平均,当前行为中心)
代码语言:javascript
代码运行次数:0
运行
复制
SELECT sale_date, sales_amount,
AVG(sales_amount) OVER (ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) as centered_moving_avg_5day
FROM daily_sales;
  • ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: 整个分区。 示例 (用于 FIRST_VALUE/LAST_VALUE 获取分区首尾值)
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_name, department, salary,
FIRST_VALUE(emp_name) OVER (PARTITION BY department ORDER BY salary ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lowest_paid_emp_in_dept,
LAST_VALUE(emp_name) OVER (PARTITION BY department ORDER BY salary ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as highest_paid_emp_in_dept
FROM employees;

RANGE BETWEEN 的说明: 基于 ORDER BY 列的值来定义边界。Hive 对 RANGE 的支持,特别是涉及 INTERVAL 的复杂场景,可能不如其他数据库完善或直观,ROWS 通常更常用且行为更可预测。

总结: Hive 窗口函数通过灵活的分区、排序和窗口帧定义,极大增强了SQL在数据分析中的表达能力。掌握不同类型函数的特性和组合,是进行深度数据洞察的关键。


练习题

背景 Hive 表:

表名

字段

product_sales_hive

product_id INT, sale_date DATE, sales_amount DECIMAL(10,2), region STRING, category STRING

employee_activity_hive

emp_id INT, activity_date DATE, activity_type STRING, duration_minutes INT, department STRING

web_sessions_hive

session_id STRING, user_id INT, start_time TIMESTAMP, end_time TIMESTAMP, country STRING

(请自行准备样例数据)

题目:

  1. product_sales_hive 表中每个 category 内的产品,按 sales_amount 从高到低进行 DENSE_RANK() 排名。
  2. 计算 employee_activity_hive 表中每个员工 (emp_id) 的总活动时长 (duration_minutes),并将此总时长显示在该员工的每条活动记录旁。
  3. 对于 product_sales_hive 表,计算每个产品 (product_id) 在每个销售日 (sale_date) 的销售额与该产品上一个销售日的销售额的差额。
  4. 找出 employee_activity_hive 表中每个 department 内,activity_type 为 ‘Meeting’ 且 duration_minutes 最长的前2条记录 (如果并列,选择任意2条)。
  5. 计算 product_sales_hive 表中每个 regionsale_date 升序的累计销售总额。
  6. employee_activity_hive 表中 department = 'Engineering' 的员工,按其在 activity_date 介于 ‘2023-01-01’ 和 ‘2023-03-31’ 之间的总 duration_minutes 分为4个等级 (使用 NTILE(4))。
  7. 对于 web_sessions_hive 表,找出每个用户 (user_id) 当前会话 (session_id) 开始后,其下一个会话的开始时间。
  8. 计算 product_sales_hive 表中每个产品当前销售日期及其前后各1个销售日 (共3个销售日窗口) 的平均销售额,按 region 分区。
  9. employee_activity_hive 表中,找出每个 department 中,duration_minutes 最短的 activity_type (如果多个活动类型时长并列最短,显示一个即可,并显示该时长)。
  10. 对于 product_sales_hive 表,计算每次销售的 sales_amount 占该 product_id 在该 region 总销售额的百分比。
答案:
  1. 分类内产品销售额排名:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT product_id, category, sales_amount,
DENSE_RANK() OVER (PARTITION BY category ORDER BY sales_amount DESC) as rank_in_category
FROM product_sales_hive;
  1. 员工总活动时长:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_id, activity_date, activity_type, duration_minutes,
SUM(duration_minutes) OVER (PARTITION BY emp_id) as total_emp_duration
FROM employee_activity_hive;
  1. 产品日销售额与上一日差额:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT product_id, sale_date, sales_amount,
sales_amount - LAG(sales_amount, 1, 0.0) OVER (PARTITION BY product_id ORDER BY sale_date) as diff_from_prev_day_sales
FROM product_sales_hive;
  1. 部门会议时长Top2记录:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_id, activity_date, duration_minutes, department
FROM (
SELECT emp_id, activity_date, duration_minutes, department,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY duration_minutes DESC) as rn
FROM employee_activity_hive
WHERE activity_type = 'Meeting'
) ranked_meetings
WHERE rn <= 2;
  1. 区域累计销售总额:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT region, sale_date, sales_amount,
SUM(sales_amount) OVER (PARTITION BY region ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_region_sales
FROM product_sales_hive;
  1. 工程部员工Q1活动时长等级:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT emp_id, total_q1_duration,
NTILE(4) OVER (ORDER BY total_q1_duration DESC) as duration_tier
FROM (
SELECT emp_id, SUM(duration_minutes) as total_q1_duration
FROM employee_activity_hive
WHERE department = 'Engineering' AND activity_date BETWEEN '2023-01-01' AND '2023-03-31'
GROUP BY emp_id
) emp_q1_durations;
  1. 用户下一个会话开始时间:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT user_id, session_id, start_time,
LEAD(start_time, 1) OVER (PARTITION BY user_id ORDER BY start_time) as next_session_start_time
FROM web_sessions_hive;
  1. 产品区域内3日移动平均销售额:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT product_id, region, sale_date, sales_amount,
AVG(sales_amount) OVER (PARTITION BY product_id, region ORDER BY sale_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as moving_avg_3day_sales
FROM product_sales_hive;
  1. 部门最短活动类型及时长:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT department, activity_type, duration_minutes
FROM (
SELECT department, activity_type, duration_minutes,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY duration_minutes ASC, activity_type ASC) as rn -- activity_type ASC for tie-breaking
FROM employee_activity_hive
) ranked_activities
WHERE rn = 1;
  1. 销售额占产品区域总额百分比:
代码语言:javascript
代码运行次数:0
运行
复制
SELECT product_id, region, sale_date, sales_amount,
sales_amount * 100.0 / SUM(sales_amount) OVER (PARTITION BY product_id, region) as percentage_of_region_sales
FROM product_sales_hive;
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2025-09-28,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 思维导图
  • 一、核心语法回顾
  • 二、排名函数 (Ranking Functions) 用法与示例
  • 三、分析函数 (Analytic Functions) / 值函数 (Value Functions) 用法与示例
  • 四、聚合函数作为窗口函数 (Aggregate Functions as Window Functions) 用法与示例
  • 五、窗口帧 (Window Frame Clause) 的详细用法与示例
  • 练习题
  • 答案:
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档