ceph版本12.2.8,一个PG卡在remapped状态,但是集群状态是OK的,为了修复这个remapped状态,才有了下面的操作。
[root@demohost cephuser]# ceph -s
cluster:
id: 21cc0dcd-06f3-4d5d-82c2-dbd411ef0ed9
health: HEALTH_OK
services:
mon: 3 daemons, quorum demohost-6,demohost-8,demohost-37
mgr: demohost-8(active), standbys: demohost-37, demohost-6
osd: 90 osds: 90 up, 90 in; 1 remapped pgs
rgw: 1 daemon active
data:
pools: 7 pools, 3712 pgs
objects: 1.30k objects, 1.68GiB
usage: 103GiB used, 415TiB / 415TiB avail
pgs: 3711 active+clean
1 active+clean+remapped
io:
client: 16.8KiB/s rd, 16op/s rd, 0op/s wr
检查具体卡remapped的PG信息,对应的OSD为88,48,18,其中88是主OSD
[root@demohost cephuser]# ceph pg dump |grep remapped
dumped all
6.9c 0 0 0 0 0 0 0 0 active+clean+remapped 2018-09-20 11:27:59.251616 0'0 9784:16679 [88,48] 88 [88,48,18] 88 0'0 2018-09-18 23:17:18.531269 0'0 2018-09-17 23:00:02.496995 0
检查OSD的信息发现这个几个OSD都额外新增了一个class为ssd的类型选项
[root@demohost cephuser]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 90.00000 root default
......
-2 18.00000 mediagroup site1-ssd
-21 6.00000 media site1-rack1-ssd
-202 2.00000 host demohost-ssd
19 1.00000 osd.19 up 1.00000 1.00000
18 ssd 1.00000 osd.18 up 1.00000 1.00000 #SSD class
-203 2.00000 host demohost-36-ssd
28 1.00000 osd.28 up 1.00000 1.00000
29 1.00000 osd.29 up 1.00000 1.00000
-201 2.00000 host demohost-6-ssd
8 1.00000 osd.8 up 1.00000 1.00000
9 1.00000 osd.9 up 1.00000 1.00000
-22 6.00000 media site1-rack2-ssd
-205 2.00000 host demohost-40-ssd
49 1.00000 osd.49 up 1.00000 1.00000
48 ssd 1.00000 osd.48 up 1.00000 1.00000 #SSD class
-206 2.00000 host demohost-42-ssd
58 1.00000 osd.58 up 1.00000 1.00000
59 1.00000 osd.59 up 1.00000 1.00000
-204 2.00000 host demohost-8-ssd
38 1.00000 osd.38 up 1.00000 1.00000
39 1.00000 osd.39 up 1.00000 1.00000
-23 6.00000 media site1-rack3-ssd
-207 2.00000 host demohost-37-ssd
68 1.00000 osd.68 up 1.00000 1.00000
69 1.00000 osd.69 up 1.00000 1.00000
-208 2.00000 host demohost-38-ssd
78 1.00000 osd.78 up 1.00000 1.00000
79 1.00000 osd.79 up 1.00000 1.00000
-209 2.00000 host demohost-39-ssd
89 1.00000 osd.89 up 1.00000 1.00000
88 ssd 1.00000 osd.88 up 1.00000 1.00000 #SSD class
检查class类型,多了一个ssd
[root@demohost cephuser]# ceph osd crush class ls
[
"ssd"
]
于是手工删除掉对应OSD的class,然后重启,但是重启以后ssd的class依旧会重新添加进来
[root@demohost cephuser]# ceph osd crush rm-device-class 48
done removing class of osd(s): 48
[root@demohost cephuser]# ceph osd crush rm-device-class 88
done removing class of osd(s): 88
[root@demohost cephuser]# ceph osd crush rm-device-class 18
done removing class of osd(s): 18
查了一下L版本新增的配置,发现有一个自动打class类型配置,于是在ceph.conf中关闭这个功能。
#ceph.conf
osd_class_update_on_start = false
之后试着重启OSD 18,ssd的class已经不会自动添加,但是发现remapped状态变成了undersized,真是奇葩。
[root@demohost cephuser]# ceph -s
cluster:
id: 21cc0dcd-06f3-4d5d-82c2-dbd411ef0ed9
health: HEALTH_WARN
Degraded data redundancy: 1 pg undersized
services:
mon: 3 daemons, quorum demohost-6,demohost-8,demohost-37
mgr: demohost-8(active), standbys: demohost-37, demohost-6
osd: 90 osds: 90 up, 90 in
rgw: 1 daemon active
data:
pools: 7 pools, 3712 pgs
objects: 1.30k objects, 1.68GiB
usage: 103GiB used, 415TiB / 415TiB avail
pgs: 3711 active+clean
1 active+undersized
io:
client: 17.0KiB/s rd, 17op/s rd, 0op/s wr
[root@demohost cephuser]# ceph health detail
HEALTH_WARN Degraded data redundancy: 1 pg undersized
PG_DEGRADED Degraded data redundancy: 1 pg undersized
pg 6.9c is stuck undersized for 206.423798, current state active+undersized, last acting [88,48]
考虑到短时间内重启OSD,并不能触发crushmap的重新计算,于是先停掉了主OSD88服务,想办法让集群主动触发crushmap的重新计算。
先停掉OSD88
[root@demohost-40 supdev]# ceph -s
cluster:
id: 21cc0dcd-06f3-4d5d-82c2-dbd411ef0ed9
health: HEALTH_WARN
1 osds down
Degraded data redundancy: 10/3897 objects degraded (0.257%), 8 pgs degraded
services:
mon: 3 daemons, quorum demohost-6,demohost-8,demohost-37
mgr: demohost-8(active), standbys: demohost-37, demohost-6
osd: 90 osds: 89 up, 90 in; 1 remapped pgs
rgw: 1 daemon active
data:
pools: 7 pools, 3712 pgs
objects: 1.30k objects, 1.68GiB
usage: 103GiB used, 415TiB / 415TiB avail
pgs: 10/3897 objects degraded (0.257%)
3670 active+clean
33 active+undersized
8 active+undersized+degraded
1 active+undersized+remapped
io:
client: 24.5KiB/s rd, 24op/s rd, 0op/s wr
recovery: 0B/s, 2objects/s
将OSD88移出crush,触发重新crushmap的计算
[root@demohost-40 supdev]# ceph osd out 88
marked out osd.88.
[root@demohost-40 supdev]# ceph -s
cluster:
id: 21cc0dcd-06f3-4d5d-82c2-dbd411ef0ed9
health: HEALTH_OK
services:
mon: 3 daemons, quorum demohost-6,demohost-8,demohost-37
mgr: demohost-8(active), standbys: demohost-37, demohost-6
osd: 90 osds: 89 up, 89 in; 1 remapped pgs
rgw: 1 daemon active
data:
pools: 7 pools, 3712 pgs
objects: 1.30k objects, 1.68GiB
usage: 102GiB used, 414TiB / 414TiB avail
pgs: 3712 active+clean
io:
client: 8.92KiB/s rd, 8op/s rd, 0op/s wr
recovery: 0B/s, 0keys/s, 0objects/s
之后启动OSD88,将其放回crush中,最终完成PG的异常修复。
[root@demohost-40 supdev]# ceph osd in 88
marked in osd.88.
[root@demohost-40 supdev]# ceph -s
cluster:
id: 21cc0dcd-06f3-4d5d-82c2-dbd411ef0ed9
health: HEALTH_OK
services:
mon: 3 daemons, quorum demohost-6,demohost-8,demohost-37
mgr: demohost-8(active), standbys: demohost-37, demohost-6
osd: 90 osds: 90 up, 90 in; 1 remapped pgs #这里仍然有个rempped,又是一个bug
rgw: 1 daemon active
data:
pools: 7 pools, 3712 pgs
objects: 1.30k objects, 1.68GiB
usage: 103GiB used, 415TiB / 415TiB avail
pgs: 3712 active+clean
io:
client: 13.0KiB/s rd, 12op/s rd, 0op/s wr
从整个排错过程来看,crush算法在L版本以后引入了自动化根据磁盘类型来生成class标签,之后再按class类型自动化生成rule,这个原本是为了简化crush配置的设置,却在用户自定义crush的场景中埋下了导火索。因此,强烈建议所有需要自定义crush规则的用户,都在ceph.conf中加上osd_class_update_on_start = false,来避免本文发生的悲剧。同时整个PG状态的统计和显示在L版本还存在一些bug,虽然不影响正常使用,但是仍然会给很多人带来困惑,甚至是误导,就如很早以前一个同行说的,对待存储一定要时刻保持敬畏之心,所有的操作一定要慎重,不然分分钟丢掉饭碗。