我有一个大表,它有100k行,主键的数据类型是NUMBER。在此列中填充数据的方式是使用随机数生成器。
所以我的问题是,有没有可能有一个SQL查询,可以帮助我用值的范围均匀地对表进行分区。例如:如果我的列值如下:
1
2
3
4
5
6
7
8
9
10
我想把它分成三个分区,然后我会得到这样的输出:
Range 1 1-3
Range 2 4-7
Range 3 8-10
发布于 2018-04-15 16:54:16
听起来您需要WIDTH_BUCKET()
函数。Find out more.
此查询将根据id
给出一个表的起始和结束范围,该表包含1250行,分为20个存储桶
with bkt as (
select id
, width_bucket(id, 1, 1251, 20) as id_bucket
from t23
)
select id_bucket
, min(id) as bkt_start
, max(id) as bkt_end
, count(*)
from bkt
group by id_bucket
order by 1
;
中间的两个参数指定最小和最大值;最后一个参数指定存储桶的数量。输出是最小和最大弓之间的行,尽可能均匀地分割到指定数量的桶中。小心使用min和max参数;我发现选择不当的边界可能会对拆分产生奇怪的影响。
发布于 2018-04-15 17:12:53
此解决方案不需要width_bucket
函数即可工作。虽然它更冗长,效率也更低,但它会尽可能平均地拆分数据,即使缺少一些ID值。
CREATE TABLE t AS
SELECT rownum AS id
FROM dual
CONNECT BY level <= 10;
WITH
data AS (
SELECT id, rownum as row_num
FROM t
),
total AS (
SELECT count(*) AS total_rows
FROM data
),
parts AS (
SELECT rownum as part_no, total.total_rows, total.total_rows / 3 as part_rows
FROM dual, total
CONNECT BY level <= 3
),
bounds AS (
SELECT parts.part_no,
parts.total_rows,
parts.part_rows,
COALESCE(LAG(data.row_num) OVER (ORDER BY parts.part_no) + 1, 1) AS start_row_num,
data.row_num AS end_row_num
FROM data
JOIN parts
ON data.row_num = ROUND(parts.part_no * parts.part_rows, 0)
)
SELECT bounds.part_no, d1.ID AS start_id, d2.ID AS end_id
FROM bounds
JOIN data d1
ON d1.row_num = bounds.start_row_num
JOIN data d2
ON d2.row_num = bounds.end_row_num
ORDER BY bounds.part_no;
PART_NO START_ID END_ID
---------- ---------- ----------
1 1 3
2 4 7
3 8 10
https://stackoverflow.com/questions/49838659
复制相似问题