Hive桶表的操作

发表于2020-06-05 | 更新于 2020-06-05 | Hive

| 字数总计: 1003 | 阅读时长: 1分钟 | 阅读量: 237

创建桶表：

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';
第三行说明创建几个桶表分区

需要设置属性：

hive (default)> set hive.enforce.bucketing=true;
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck

如果要给桶表导入数据必须还要创建一个普通的表格通过“子查询”的方式进行导入。

（1）创建一个表字段名必须与桶表相同

create table stu(id int, name string)
row format delimited fields terminated by '\t';

（2）向普通的stu表中导入数据

load data local inpath '/opt/module/datas/student.txt' into table stu;

普通表有数据了后，然后我们将普通表的数据导入桶表中：

insert into table stu_buck  select id, name from stu;

分桶抽样查询：

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。
代码详解：x表示从第几个表开始查询 y表示查询多少个桶的数据，通过创建桶表的语句中 into 4 buckets 可以看出桶表设置了4个，y也是4，那么桶表个数➗y 就是4/4=1 查询一个通的数据，结果就是，从第一个桶开始查询一个桶的数据。

评论