前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >What are the types of sampling in hive ?

What are the types of sampling in hive ?

作者头像
一个会写诗的程序员
发布2021-12-16 11:02:32
3680
发布2021-12-16 11:02:32
举报

This recipe explains what are the types of sampling in hive.

Recipe Objective

In big data scenarios , when data volume is huge, we may need to find a subset of data to speed up data analysis. Here comes a technique to select and analyze a subset of data in order to identify patterns and trends in the data known as sampling.

代码语言:javascript
复制
## System requirements :

*   Install ubuntu in the virtual machine [click here](https://d35mpxyw7m7k7g.cloudfront.net/bigdata_30/Download_and_install_VM_Ubuntu_ISO.pdf)
*   Install single node hadoop machine [click here](https://d35mpxyw7m7k7g.cloudfront.net/bigdata_30/Install_of_Single_Node_Hadoop_Machine_in_Ubuntu_VM.pdf)
*   Install apache hive [click here](https://d35mpxyw7m7k7g.cloudfront.net/bigdata_30/Installation_of_HIVE.pdf)

## Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Before create a table open the hive shell and we need to create a database as follows :Open the hive shell as below

To create database using below queries :

Create database dezyre_db; use dezyre_db;

As follows below:

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:

CREATE TABLE employee ( employee_id int, company_id int, seniority int, salary int, join_date string, quit_date string, dept string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Loading the data into hive table and verifying the data

load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

代码语言:javascript
复制
                                                                                                                                                                                                     ![禅与计算机程序设计艺术](https://upload-images.jianshu.io/upload_images/1233356-9c7ce4644719efc6.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

In Hive, there are three ways of sampling data: Random sampling, Bucket table sampling, and Block sampling.

Step 3 : Sampling using Random function

Random sampling uses the RAND() function and LIMIT keyword to get the sampling of data as shown in the following example. The DISTRIBUTE and SORT keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY RAND() statement can also achieve the same purpose, but the performance is not good.

代码语言:javascript
复制
                                                                                                                                                                                                     Here we are going run an example query using random funtion on the hive table as follows

                                                                                                                                                                                                     `hive>SELECT * FROM employee DISTRIBUTE BY RAND() SORT BY RAND() LIMIT 10;`

                                                                                                                                                                                                     Output of the above query :

                                                                                                                                                                                                     ![禅与计算机程序设计艺术](https://upload-images.jianshu.io/upload_images/1233356-187efb2b41dd3901.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

                                                                                                                                                                                                     ## Step 4: Create a Bucketed table

Here we are going to create bucketed table bucket with "clustered by" is as follows

CREATE TABLE emp_bucketed_tbl_only ( employee_id int, company_id int, dept string, seniority int, salary int, join_date string, quit_date string ) CLUSTERED BY (salary) SORTED BY (salary ASC) INTO 4 BUCKETS;

the above query runs as follows

Step 5 : Load data into Bucketed table

Here we are going to insert data into the bucketed table which is without any partition:

INSERT OVERWRITE TABLE emp_bucketed_tbl_only SELECT * FROM employee;

When you load the data into the table i will performs map reduce job in the background as looks below

Output of the above query as follows:

Step 6 : Sampling using bucketing

Bucket table sampling is a special sampling optimized for bucket tables as shown in the following syntax and example. The colname value specifies the column where to sample the data. The RAND() function can also be used when sampling is on the entire rows. If the sample column is also the CLUSTERED BY column, the TABLESAMPLE statement will be more Efficient.

The below query will fetch data from the 1st bucket

hive> select employee_id, company_id,seniority,dept from emp_bucketed_tbl_only TABLESAMPLE(BUCKET 1 OUT OF 4 ON company_id);

Output of the above query :

We can specify the company_id column for sampling.The following query will produce the same result as the above query

hive> select employee_id, company_id,seniority,dept from emp_bucketed_tbl_only TABLESAMPLE(BUCKET 1 OUT OF 4 ON company_id);

Output of the above query :

Step 7 : Block sampling in hive

Block sampling allows Hive to randomly pick up N rows of data, percentage (n percentage) of data size, or N byte size of data. The sampling granularity is the HDFS block size.

Using percent :

hive>select * from emp_bucketed_tbl_only TABLESAMPLE(20 PERCENT);

Output of the above query :

Using Rows :

hive>select * from emp_bucketed_tbl_only TABLESAMPLE(2 ROWS);

Output of the above query :

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2021/9/9 下午,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Recipe Objective
  • Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data
  • Step 3 : Sampling using Random function
  • Step 5 : Load data into Bucketed table
  • Step 6 : Sampling using bucketing
  • Step 7 : Block sampling in hive
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档