职男说：CEPH中PG和PGP解析

文章来源：企鹅号 - 职能IT智慧圈

在CEPH中有两个容易混淆的概念

PG (Placement Group)

PGP (Placement Group for Placement)

对于数据（对象），最终会写入到实际的物料磁盘OSD中，而PG起到了逻辑归组的作用，类似于数据库中表空间的概念，它的结构图如下：

再来看看PGP的描述：

PGP is Placement Group for Placementpurpose, which should be kept equal to the total number of placement groups(pg_num). For a Ceph pool, if you increase the number of placement groups, thatis, pg_num, you should also increase pgp_num to the same integer value aspg_num so that the cluster can start rebalancing. The undercover rebalancing mechanismcan be understood in the following way.The pg_num value defines the number ofplacement groups, which are mapped to OSDs. When pg_num is increased for anypool, every PG of this pool splits into half, but they all remain mapped totheir parent OSD. Until this time, Ceph does not start rebalancing. Now, whenyou increase the pgp_num value for the same pool, PGs start to migrate from theparent to some other OSD, and cluster rebalancing starts. In this way, PGPplays an important role in cluster rebalancing

为了更好的理解这段话的意思，可以做一个实验，在测试环境中搭建两台服务器，每台4个OSD，鉴于试验环境限制，无法将三副本平均分布到三台物理机上，所以我对8个OSD进行逻辑分组：

OSD进行逻辑分组：

[osd.0]

crush_location = host=node1 root=default

[osd.1]

crush_location = host=node2 root=default

[osd.2]

crush_location = host=node3 root=default

[osd.3]

crush_location = host=node3 root=default

[osd.4]

crush_location = host=node1 root=default

[osd.5]

crush_location = host=node2 root=default

[osd.6]

crush_location = host=node3 root=default

[osd.7]

crush_location = host=node2 root=default

这样，每一个PG最终落盘必然是下列的组合之一：

node1 --【osd.0，osd.4】

node2 --【osd.1，osd.5，osd.7】

node3 --【osd.2，osd.3，osd.6】

接下来先创建一个用于测试的pool,pg_num设置为6,副本数为3

# ceph osd pool create test 6 6

# ceph osd pool set test size 3

查看pg的状态

# ceph pg dump pgs |awk ''

PG_STAT OBJECTS UP_PRIMARY

1.5 0 [5,6,0]

1.4 0 [3,7,0]

1.3 0 [3,1,0]

1.2 0 [7,4,3]

1.1 0 [5,6,4]

1.0 0 [7,0,3]

可以看到当前有6个PG，每个PG的对象为，分布到三个OSD中，观察方括号中的OSD，是之前设置的node1,node2,node3中的任意一个组合。

写入数据到POOL中：

# rados -p testpool bench 10 write--no-cleanup

sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)

0 0 0 0 0 0 - 0

1 16 279 263 1051.94 1052 0.0653432 0.0591477

2 16 554 538 1075.89 1100 0.0913849 0.0582301

3 16 808 792 1055.87 1016 0.0937684 0.0596275

4 16 1048 1032 1031.88 960 0.109827 0.0610142

5 16 1323 1307 1045.47 1100 0.0403374 0.0607608

# ceph pg dump pgs |awk ''

PG_STAT OBJECTS UP_PRIMARY

1.5 624 [5,6,0]

1.4 610 [3,7,0]

1.3 1252 [3,1,0]

1.2 1354 [7,4,3]

1.1 625 [5,6,4]

1.0 631 [7,0,3]

观察上面结果，发现PG序号和所在OSD没有变化，但是有数据写入(即第二列的objects不为0)

现在添加两个PG，并观察结果

# ceph osd pool set testpool pg_num 8

PG_STAT OBJECTS UP_PRIMARY

1.7 626 [3,1,0]

1.6 677 [7,4,3]

1.5 624 [5,6,0]

1.4 610 [3,7,0]

1.3 626 [3,1,0] 626+626 =1252

1.2 677 [7,4,3] 677+677=1354 (和原来的1.2相同）

1.1 625 [5,6,4]

1.0 631 [7,0,3]

可以看到，新增了2个PG(1.6,1.7),其中1.6是从1.2中分裂出来的，他们的对象数目相加和原来1.2上的一样，但是OSD没有变化，依然是[osd.7,osd.4,osd.3]

结论1：PG增加后，CEPH不会从原来的各个PG随机抽取部分数据到新的PG中，而是分裂某个PG，从而产生新的PG。上面的例子可以看到，原有的6个PG只有２个分裂，其它４个保持对象不变

1.2 --> 1.6

1.3 --> 1.7

这种方式可以有效的减少大量数据迁移导致的性能问题。

接着对PGP进行修改，让PGP和PG都为8

# ceph osd pool set testpool pgp_num 8

# ceph pg dump pgs |awk ''

PG_STAT OBJECTS UP_PRIMARY

1.7 626 [1,3,0]

1.6 677 [6,7,0]

1.5 624 [5,6,0]

1.4 610 [3,7,0]

1.3 626[3,1,0]

1.2 677[7,4,3]

1.1 625 [5,6,4]

1.0 631 [7,0,3]

此时再观察1.6,会发现object没有改变，但OSD的分布产生了新的变化，即数据从[7,4,3]迁移到了[6,7,0]

结论2：当PGP变化时，CEPH才会开始真正的数据重平衡。

发表于: 2018-03-162018-03-16 17:00:50
原文链接：http://kuaibao.qq.com/s/20180316G1416500?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

职男说：CEPH中PG和PGP解析

相关快讯

扫码

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐