HDFS 异构存储

需求

Hadoop 从 2.4 后开始支持异构存储,异构存储是为了解决爆炸式的存储容量增长以及计算能力增长所带来的数据存储需求,一份数据热数据在经历计算产生出新的数据,那么原始数据有可能变为冷数据,随着数据不断增长差异化存储变的非常迫切,需要经常被计算或者读取的热数据为了保证性能需要存储在高速存储设备上,当一些数据变为冷数据后不经常会用到的数据会变为归档数据,可以使用大容量性能要差一些的存储设备来存储来减少存储成本,HDFS 可以按照一定的规则来存储这些数据,具体架构如下:

存储类型&存储策略

存储类型

  • RAM_DISK 内存镜像文件系统
  • SSD SSD 盘
  • DSIK 普通磁盘
  • ARCHIVE 归档

存储策略

策略 ID

策略名称

块分布

creationFallbacks

replicationFallbacks

15

Lazy_Persist

RAM_DISK: 1, DISK: n-1

DISK

DISK

12

All_SSD

SSD: n

DISK

DISK

10

One_SSD

SSD: 1, DISK: n-1

SSD, DISK

SSD, DISK

7

Hot (default)

DISK: n

< none >

ARCHIVE

5

Warm

DISK: 1, ARCHIVE: n-1

ARCHIVE, DISK

ARCHIVE, DISK

2

Cold

ARCHIVE: n

< none >

< none >

存储策略名称分别从 Lazy_Persist 到 Clod,分别代表了设备的访问速度从快到慢,访问速度最快的为内存文件系统,其次是 SSD,再是普通盘,最后是归档性存储,我们可以利用上面的策略来控制数据的分布以达到降低成本的目的。

creationFallbacks

对于第一个创建的 block 块的 fallback 情况时的可选存储类型

replicationFallbacks

对于的 block 块的其余副本的 fallback 情况时的可选存储类型,这里出现了 fallback 的情况,什么叫做 fallback 的情况呢,当前存储类型不可用的时候,退一级所选择使用的存储类型

测试环境验证

环境信息准备

  • 数据拷贝数 2
  • datanode 节点信息

DataNode

存储介质

初始空间

HDFS 设置介质类型

100.67.57.220

SSD

100G

DISK

100.67.57.221

SSD

100G

DISK

100.67.57.222

SSD

100G

DISK

10.108.100.24

普通盘

100G

ARCHIVE

10.108.100.71

普通盘

100G

ARCHIVE

初始集群只有 220、221、222 三个存储节点,默认的存储类型没有设置即为 DISK 类型(实际盘是 SSD),24 和 71 节点为新扩节点实际磁盘为机械盘在 hdfs 里设置的存储类型为 ARCHIVE

  • 初始文件信息

bin/hadoop fs -ls / |awk '{print $8}'|xargs bin/hadoop fs -du -s -h

在 HDFS 默认策略下有一个 hot 目录下面有 1G 的文件

  • 初始块分布
[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hdfs fsck /hot  -files -blocks -locations 

0\. BP-983125464-100.67.159.132-1474351508701:blk_1073742694_1878 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.220:4028]

1\. BP-983125464-100.67.159.132-1474351508701:blk_1073742695_1879 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.221:4028]

2\. BP-983125464-100.67.159.132-1474351508701:blk_1073742696_1880 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.221:4028]

3\. BP-983125464-100.67.159.132-1474351508701:blk_1073742697_1881 len=67108864 repl=2 [100.67.57.221:4028, 100.67.57.222:4028]

4\. BP-983125464-100.67.159.132-1474351508701:blk_1073742698_1882 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.221:4028]

5\. BP-983125464-100.67.159.132-1474351508701:blk_1073742699_1883 len=67108864 repl=2 [100.67.57.221:4028, 100.67.57.222:4028]

6\. BP-983125464-100.67.159.132-1474351508701:blk_1073742700_1884 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.220:4028]

7\. BP-983125464-100.67.159.132-1474351508701:blk_1073742701_1885 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.221:4028]

8\. BP-983125464-100.67.159.132-1474351508701:blk_1073742702_1886 len=67108864 repl=2 [100.67.57.221:4028, 100.67.57.220:4028]

9\. BP-983125464-100.67.159.132-1474351508701:blk_1073742703_1887 len=67108864 repl=2 [100.67.57.220:4028, 100.67.57.221:4028]

10\. BP-983125464-100.67.159.132-1474351508701:blk_1073742704_1888 len=67108864 repl=2 [100.67.57.220:4028, 100.67.57.222:4028]

11\. BP-983125464-100.67.159.132-1474351508701:blk_1073742705_1889 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.220:4028]

12\. BP-983125464-100.67.159.132-1474351508701:blk_1073742706_1890 len=67108864 repl=2 [100.67.57.220:4028, 100.67.57.222:4028]

13\. BP-983125464-100.67.159.132-1474351508701:blk_1073742707_1891 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.220:4028]

14\. BP-983125464-100.67.159.132-1474351508701:blk_1073742708_1892 len=67108864 repl=2 [100.67.57.220:4028, 100.67.57.221:4028]

15\. BP-983125464-100.67.159.132-1474351508701:blk_1073742709_1893 len=67108864 repl=2 [100.67.57.220:4028, 100.67.57.221:4028]

可以确认 16 个块均匀的分布在 220 到 222 三个存储节点上

  • 设置不同的目录不同的策略

hot 目录是默认策略不用修改

设置 warm 目录策略为 warm

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hdfs dfsadmin -setStoragePolicy /warm Warm

Set storage policy Warm on /warm

设置 cold 目录策略为 clod

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hdfs dfsadmin -setStoragePolicy /cold Cold

Set storage policy Cold on /cold

此时在集群还未加入存储类型为 ARCHIVE 类型的几点的时候如果向/cold 目录写入数据会抛出异常

数据降冷 Hot 到 Warm

查看 Warm 目录的存储策略

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hdfs dfsadmin -getStoragePolicy /warm

The storage policy of /warm:

BlockStoragePolicy{WARM:5, storageTypes=[DISK, ARCHIVE], creationFallbacks=[DISK, ARCHIVE], replicationFallbacks=[DISK, ARCHIVE]}

移动数据到 warm 目录

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hadoop fs -ls /hot /warm 

Found 1 items

drwxr-xr-x   \- hadoop supergroup          0 2016-09-27 14:29 /warm/data

执行 mover

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hdfs mover /warm  /hot

16/09/28 10:24:20 INFO mover.Mover: namenodes = {hdfs://HDFS40028=null}

16/09/28 10:24:21 INFO net.NetworkTopology: Adding a new node: /default-rack/100.67.57.220:4028

16/09/28 10:24:21 INFO net.NetworkTopology: Adding a new node: /default-rack/100.67.57.222:4028

16/09/28 10:24:21 INFO net.NetworkTopology: Adding a new node: /default-rack/10.108.100.24:4000

16/09/28 10:24:21 INFO net.NetworkTopology: Adding a new node: /default-rack/100.67.57.221:4028

16/09/28 10:24:21 INFO net.NetworkTopology: Adding a new node: /default-rack/10.108.100.71:4000

16/09/28 10:24:46 INFO balancer.Dispatcher: Successfully moved blk_1073742694_1878 with size=67108864 from 100.67.57.220:4028:DISK to 10.108.100.24:4000:ARCHIVE through 100.67.57.220:4028

16/09/28 10:24:50 INFO balancer.Dispatcher: Successfully moved blk_1073742703_1887 with size=67108864 from 100.67.57.220:4028:DISK to 10.108.100.71:4000:ARCHIVE through 100.67.57.220:4028

16/09/28 10:24:50 INFO balancer.Dispatcher: Successfully moved blk_1073742702_1886 with size=67108864 from 100.67.57.221:4028:DISK to 10.108.100.71:4000:ARCHIVE through 100.67.57.221:4028

16/09/28 10:24:50 INFO balancer.Dispatcher: Successfully moved blk_1073742700_1884 with size=67108864 from 100.67.57.222:4028:DISK to 10.108.100.71:4000:ARCHIVE through 100.67.57.222:4028

16/09/28 10:24:52 INFO balancer.Dispatcher: Successfully moved blk_1073742697_1881 with size=67108864 from 100.67.57.222:4028:DISK to 10.108.100.24:4000:ARCHIVE through 100.67.57.222:4028

16/09/28 10:24:52 INFO balancer.Dispatcher: Successfully moved blk_1073742701_1885 with size=67108864 from 100.67.57.221:4028:DISK to 10.108.100.71:4000:ARCHIVE through 100.67.57.221:4028

16/09/28 10:24:52 INFO balancer.Dispatcher: Successfully moved blk_1073742698_1882 with size=67108864 from 100.67.57.221:4028:DISK to 10.108.100.24:4000:ARCHIVE through 100.67.57.221:4028

16/09/28 10:24:52 INFO balancer.Dispatcher: Successfully moved blk_1073742696_1880 with size=67108864 from 100.67.57.222:4028:DISK to 10.108.100.24:4000:ARCHIVE through 100.67.57.222:4028

16/09/28 10:24:53 INFO balancer.Dispatcher: Successfully moved blk_1073742695_1879 with size=67108864 from 100.67.57.221:4028:DISK to 10.108.100.24:4000:ARCHIVE through 100.67.57.221:4028

16/09/28 10:24:53 INFO balancer.Dispatcher: Successfully moved blk_1073742699_1883 with size=67108864 from 100.67.57.221:4028:DISK to 10.108.100.71:4000:ARCHIVE through 100.67.57.221:4028

16/09/28 10:25:21 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 60116ms (threshold=30000ms); ack: seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 146874, targets: [100.67.57.221:4028, 100.67.57.220:4028]

Sep 28, 2016 10:25:29 AM Mover took 1mins, 8sec

检查数据块分布

bin/hdfs fsck /warm  -files -blocks -locations

0\. BP-983125464-100.67.159.132-1474351508701:blk_1073742694_1878 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

1\. BP-983125464-100.67.159.132-1474351508701:blk_1073742695_1879 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

2\. BP-983125464-100.67.159.132-1474351508701:blk_1073742696_1880 len=67108864 repl=2 [10.108.100.24:4000, 100.67.57.221:4028]

3\. BP-983125464-100.67.159.132-1474351508701:blk_1073742697_1881 len=67108864 repl=2 [100.67.57.221:4028, 10.108.100.24:4000]

4\. BP-983125464-100.67.159.132-1474351508701:blk_1073742698_1882 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

5\. BP-983125464-100.67.159.132-1474351508701:blk_1073742699_1883 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.222:4028]

6\. BP-983125464-100.67.159.132-1474351508701:blk_1073742700_1884 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.220:4028]

7\. BP-983125464-100.67.159.132-1474351508701:blk_1073742701_1885 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.71:4000]

8\. BP-983125464-100.67.159.132-1474351508701:blk_1073742702_1886 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.220:4028]

9\. BP-983125464-100.67.159.132-1474351508701:blk_1073742703_1887 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.221:4028]

10\. BP-983125464-100.67.159.132-1474351508701:blk_1073742704_1888 len=67108864 repl=2 [100.67.57.220:4028, 100.67.57.222:4028, 10.108.100.24:4000]

11\. BP-983125464-100.67.159.132-1474351508701:blk_1073742705_1889 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

12\. BP-983125464-100.67.159.132-1474351508701:blk_1073742706_1890 len=67108864 repl=2 [100.67.57.220:4028, 100.67.57.222:4028, 10.108.100.24:4000]

13\. BP-983125464-100.67.159.132-1474351508701:blk_1073742707_1891 len=67108864 repl=2 [100.67.57.222:4028, 100.67.57.220:4028, 10.108.100.24:4000]

14\. BP-983125464-100.67.159.132-1474351508701:blk_1073742708_1892 len=67108864 repl=2 [100.67.57.220:4028, 10.108.100.24:4000]

15\. BP-983125464-100.67.159.132-1474351508701:blk_1073742709_1893 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.221:4028]

可以看出数据分布已经是一半的块在 ssd,一半的块在普通盘

数据降冷 Warm 到 Clod

查看 Clod 目录存储策略

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hdfs dfsadmin -getStoragePolicy /cold

The storage policy of /cold:

BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], replicationFallbacks=[]}

移动数据到 cold 目录

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hadoop fs -mv /warm/data /cold

[hadoop@100 /usr/local/service/40028/hadoop]$ 

[hadoop@100 /usr/local/service/40028/hadoop]$ 

[hadoop@100 /usr/local/service/40028/hadoop]$ bin/hadoop fs -ls /warm /cold       

Found 1 items

drwxr-xr-x   \- hadoop supergroup          0 2016-09-27 14:29 /cold/data

执行 mover

bin/hdfs mover /warm /cold

检查数据块分布

bin/hdfs fsck /cold  -files -blocks -locations

0\. BP-983125464-100.67.159.132-1474351508701:blk_1073742694_1878 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

1\. BP-983125464-100.67.159.132-1474351508701:blk_1073742695_1879 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

2\. BP-983125464-100.67.159.132-1474351508701:blk_1073742696_1880 len=67108864 repl=2 [10.108.100.24:4000, 10.108.100.71:4000]

3\. BP-983125464-100.67.159.132-1474351508701:blk_1073742697_1881 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

4\. BP-983125464-100.67.159.132-1474351508701:blk_1073742698_1882 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

5\. BP-983125464-100.67.159.132-1474351508701:blk_1073742699_1883 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

6\. BP-983125464-100.67.159.132-1474351508701:blk_1073742700_1884 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

7\. BP-983125464-100.67.159.132-1474351508701:blk_1073742701_1885 len=67108864 repl=2 [10.108.100.24:4000, 10.108.100.71:4000]

8\. BP-983125464-100.67.159.132-1474351508701:blk_1073742702_1886 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

9\. BP-983125464-100.67.159.132-1474351508701:blk_1073742703_1887 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

10\. BP-983125464-100.67.159.132-1474351508701:blk_1073742704_1888 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

11\. BP-983125464-100.67.159.132-1474351508701:blk_1073742705_1889 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

12\. BP-983125464-100.67.159.132-1474351508701:blk_1073742706_1890 len=67108864 repl=2 [10.108.100.71:4000, 10.108.100.24:4000]

13\. BP-983125464-100.67.159.132-1474351508701:blk_1073742707_1891 len=67108864 repl=2 [10.108.100.24:4000, 10.108.100.71:4000]

14\. BP-983125464-100.67.159.132-1474351508701:blk_1073742708_1892 len=67108864 repl=2 [100.67.57.220:4028, 10.108.100.24:4000]

15\. BP-983125464-100.67.159.132-1474351508701:blk_1073742709_1893 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.221:4028]

可以看出数据已经完全分布在冷设备上

数据生热 cold 到 warm

移动数据到 hot 目录

bin/hadoop fs -mv /cold/data /warm

执行 mover

bin/hdfs mover

检查数据块分布

bin/hdfs fsck /warm  -files -blocks -locations  

0\. BP-983125464-100.67.159.132-1474351508701:blk_1073742694_1878 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.220:4028]

1\. BP-983125464-100.67.159.132-1474351508701:blk_1073742695_1879 len=67108864 repl=2 [100.67.57.220:4028, 10.108.100.24:4000]

2\. BP-983125464-100.67.159.132-1474351508701:blk_1073742696_1880 len=67108864 repl=2 [10.108.100.24:4000, 100.67.57.220:4028]

3\. BP-983125464-100.67.159.132-1474351508701:blk_1073742697_1881 len=67108864 repl=2 [100.67.57.220:4028, 10.108.100.24:4000]

4\. BP-983125464-100.67.159.132-1474351508701:blk_1073742698_1882 len=67108864 repl=2 [100.67.57.220:4028, 10.108.100.24:4000]

5\. BP-983125464-100.67.159.132-1474351508701:blk_1073742699_1883 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.222:4028]

6\. BP-983125464-100.67.159.132-1474351508701:blk_1073742700_1884 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

7\. BP-983125464-100.67.159.132-1474351508701:blk_1073742701_1885 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.71:4000]

8\. BP-983125464-100.67.159.132-1474351508701:blk_1073742702_1886 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.222:4028]

9\. BP-983125464-100.67.159.132-1474351508701:blk_1073742703_1887 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

10\. BP-983125464-100.67.159.132-1474351508701:blk_1073742704_1888 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.222:4028]

11\. BP-983125464-100.67.159.132-1474351508701:blk_1073742705_1889 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

12\. BP-983125464-100.67.159.132-1474351508701:blk_1073742706_1890 len=67108864 repl=2 [100.67.57.222:4028, 10.108.100.24:4000]

13\. BP-983125464-100.67.159.132-1474351508701:blk_1073742707_1891 len=67108864 repl=2 [10.108.100.24:4000, 100.67.57.222:4028]

14\. BP-983125464-100.67.159.132-1474351508701:blk_1073742708_1892 len=67108864 repl=2 [100.67.57.220:4028, 10.108.100.24:4000]

15\. BP-983125464-100.67.159.132-1474351508701:blk_1073742709_1893 len=67108864 repl=2 [10.108.100.71:4000, 100.67.57.221:4028]

可以看出数据块分布已经是一半在 SSD,一半在普通盘。

原创声明,本文系作者授权云+社区发表,未经许可,不得转载。

如有侵权,请联系 yunjia_community@tencent.com 删除。

编辑于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Petrichor的专栏

numpy: np.ndarray.flatten

Args: 顺序:{‘C’,’F’,’A’,’K’},可选。 “C” 意思是以行大(C形)的顺序变平。 “F” 表示按列主要(Fortran风格)...

1097
来自专栏Spark学习技巧

如何将RDD或者MLLib矩阵zhuanzhi

最近老有人在qq群或者公众号留言问浪尖如何将Spark Mllib的矩阵或者将一个RDD进行转置操作。Spark Mllib的矩阵有多种形式,分布式和非分布式,...

2089
来自专栏SeanCheney的专栏

《Pandas Cookbook》第05章 布尔索引1. 计算布尔值统计信息2. 构建多个布尔条件3. 用布尔索引过滤4. 用标签索引代替布尔索引5. 用唯一和有序索引选取6. 观察股价7. 翻译SQ

第01章 Pandas基础 第02章 DataFrame运算 第03章 数据分析入门 第04章 选取数据子集 第05章 布尔索引 第06章 索引对齐 ...

662
来自专栏杨建荣的学习笔记

巧用闪回数据库来查看历史数据 (r10笔记第47天)

国庆期间有一个例行维护的任务,需要在大早上7点起来,先根据业务指定的SQL查出指定数据,然后运行一个存储过程来更新数据。 查出来的这部分数据需要...

3124
来自专栏杨建荣的学习笔记

生产环境sqlldr加载性能问题及分析之一 (r2第17天)

在测试环境中进行了多轮测试,使用sqlldr批量加载数据,csv文件大概有120G左右,在一致的数据量的情况下,测试环境都在一个小时左右,但是在生产环境中竟然跑...

3296
来自专栏SeanCheney的专栏

《Pandas Cookbook》第09章 合并Pandas对象

701
来自专栏闻道于事

JFinal极速开发框架使用笔记(二) 两个问题,一个发现

最近给新人出了一个小测试,我也用JFinal框架做了一下,记录一下使用过程中遇到的坑和新学到的知识点 首先是遇到的两个小问题, 一个是用最新版的eclipse运...

3725
来自专栏CreateAMind

Curiosity-driven Exploration 好奇心代码阅读

1242
来自专栏SeanCheney的专栏

《Pandas Cookbook》第07章 分组聚合、过滤、转换1. 定义聚合2. 用多个列和函数进行分组和聚合3. 分组后去除多级索引4. 自定义聚合函数5. 用 *args 和 **kwargs

第01章 Pandas基础 第02章 DataFrame运算 第03章 数据分析入门 第04章 选取数据子集 第05章 布尔索引 第06章 索引对齐 ...

832
来自专栏写代码的海盗

入坑第二式 golang入坑系列

史前必读: 这是入坑系列的第二式,如果错过了第一式,可以去gitbook( https://andy-zhangtao.gitbooks.io/golang/c...

2414

扫码关注云+社区