Ceph亚太峰会RGW议题分享

Ceph亚太峰会RGW部分议题分享

本次Ceph亚太峰会干货最实在的的要数Redhat的《Common Support Issues and How to Troubleshoot Them》这里把RGW部分摘出来,和大家分享一下,本次议题主要是涉及到RGW中Object数量过多导致的OSD异常如何处理。

故障现象描述

Flapping OSD's when RGW buckets have millions of objects
● Possible causes
○ The first issue here is when RGW buckets have millions of objects their
bucket index shard RADOS objects become very large with high
number OMAP keys stored in leveldb. Then operations like deep-scrub,
bucket index listing etc takes a lot of time to complete and this triggers
OSD's to flap. If sharding is not used this issue become worse because
then only one RADOS index objects will be holding all the OMAP keys.

RGW的index数据以omap形式存储在OSD所在节点的leveldb中,当单个bucket存储的Object数量高达百万数量级的时候,deep-scrub和bucket list一类的操作将极大的消耗磁盘资源,导致对应OSD出现异常,如果不对bucket的index进行shard切片操作(shard切片实现了将单个bucket index的LevelDB实例水平切分到多个OSD上),数据量大了以后很容易出事。

○ The second issue is when you have good amount of DELETEs it causes
loads of stale data in OMAP and this triggers leveldb compaction all the
time which is single threaded and non optimal with this kind of workload
and causes osd_op_threads to suicide because it is always compacting
hence OSD’s starts flapping.

RGW在处理大量DELETE请求的时候,会导致底层LevelDB频繁进行数据库compaction(数据压缩,对磁盘性能损耗很大)操作,而且刚好整个compaction在LevelDB中又是单线程处理,很容易到达osdopthreads超时上限而导致OSD自杀。

● Possible causes contd ...
○ OMAP backend is leveldb in jewel and older clusters. Any luminous
clusters which were upgraded from older releases have leveldb as
OMAP backend.

jewel以及之前的版本的OMAP都是以LevelDB作为存储引擎,如果是从旧版本升级到最新的luminous,那么底层OMAP仍然是LevelDB。

○ All new luminous clusters have default OMAP backend as rocksdb
which is great because rocksdb has multithreaded compaction and in
Ceph we use 8 compaction thread by default and many other enhanced
features as compare to leveldb.

最新版本的Luminous开始,OMAP底层的存储引擎换成了rocksDB,rocksDB采用多线程方式进行compaction(默认8个),所以rocksdb在compaction效率上要比LevelDB强很多。

临时解决方案

临时方法1:通过关闭整个集群或者独立的pool的deep-scrub去实现对集群稳定性的提升。

○ The first temporary action should be setting nodeep-scrub flag either
global in the cluster with ceph osd set nodeep-scrub or only to the RGW
index pool with - ceph osd pool set <pool-name> nodeep-scrub 1.


○ Then the second temporary step could be taken if OSD's are not
stopping from hitting suicide timeout. Increase the OSD op threads
normal timeout and suicide timeout values and if filestore op threads are
also hitting timeout then increase normal and suicide timeout for
filestore op threads.

临时方法2:调优一些op相关的timeout参数,减少触发OSD自杀的概率,比如下面的一些参数

○ Add these options in [osd.id] section or in [osd] section to make them
permanent till the time troubleshooting of this issue is going on and use
ceph tell injectargs command to inject them at run time.
osd_op_thread_timeout = 90 #default is 15
osd_op_thread_suicide_timeout = 2000 #default is 150
If filestore op threads are hitting timeout
filestore_op_thread_timeout = 180 #default is 60
filestore_op_thread_suicide_timeout = 2000 #default is 180
Same can be done for recovery thread also.
osd_recovery_thread_timeout = 120 #default is 30
osd_recovery_thread_suicide_timeout = 2000 #default is 300

临时方法3:当OMAP目录过大时,手工触发一些osd的Leveldb compaction操作,以压缩OSD的LevelDB体积。

○ The third temporary step could be taken if OSD's have very large OMAP
directories you can verify it with command: du -sh /var/lib/ceph/osd/ceph-$id/current/omap, then do manual leveldb compaction for OSD's.
■ ceph tell osd.$id compact or
■ ceph daemon osd.$id compact or
■ Add leveldb_compact_on_mount = true in [osd.$id] or [osd] section
and restart the OSD.
■ This makes sure that it compacts the leveldb and then bring the
OSD back up/in which really helps.

永久方案

○ Calculate the bucket index shard RADOS object size
■ Count the OMAP keys in index shard object
● rados -p <rgw index pool name> listomapkeys
<index-shard-object-name> | wc -l
■ Each OMAP key is of 200 bytes for getting the size of object
● <count from above command> * 200 = <value in bytes>
○ If the index shard object is very big like above 20 MB consider resharding
because shard count is not set as per recommendation or sharding is not
used at all.

■ radosgw-admin bucket reshard is the command more details can be
found in upstream documentation. This is offline reshard tool.
■ Because of these issues now Luminous has dynamic resharding.
● http://docs.ceph.com/docs/master/radosgw/dynamicresharding

按每个index shard object去遍历index pool的对应的omap条目数(最好不要听PPT作者的去进行遍历,很容易雪上加霜),按每个key占用200byte方式统计每个omap对象的容量大小,当超过20MB的时候去手工进行reshard操作,注意reshard操作过程中bucket有元数据丢失的风险,谨慎使用,具体可以看之前公众号的文章,怎么去备份bucket元数据信息。另外分享中还提到了最新的Luminous可以实现动态的reshard(根据单个bucket当前的Object数量,实时动态调整shard数量),其实这里面也有很大的坑,动态reshard对用户来讲不够透明,而且reshard过程中会造成bucket的读写发生一定时间的阻塞,所以从我的个人经验来看,这个功能最好关闭,能够做到在一开始就设计好单个bucket的shard数量,一步到位是最好。至于如何做好一步到位的设计可以看公众号之前的文章。(《RGW Bucket Shard设计与优化》系列)

Permanent Solutions contd ...
○ If RGW index pool is not backed by SSD or NVME OSD’s and OSD’s are
running above 80% disk util(Disk bound) during leveldb compaction
consider migrating Index pool to new CRUSH ruleset which is backed by
SSD or NVME SSD’s.
○ If RGW index pool OSD’s are always using above 100% CPU(CPU bound)
during leveldb compaction consider converting omap backend to rocksdb
from leveldb.
○ Jewel still do not support omap backend as rocksdb - this jewel pull
request 18010 will bring the rocksdb support in jewel.

另外可以做到的就是单独使用SSD或者NVME作为index pool的OSD,但是Leveldb从设计上对SSD的支持比较有限,最好能够切换到rocksdb上面去,同时在jewel之前的版本还不支持切换omap引擎到rocksdb,除非打上下面的补丁 https://github.com/ceph/ceph/pull/18010

关于如何切换Leveldb到rocksdb,也给了详细的操作流程,但是简单起见,还是切换配置然后重建OSD要省心很多。具体操作如下

Permanent Solutions contd ...
○ After rocksdb support in jewel and luminous already has it these
commands can be used to convert omap bakend to rocksdb from leveldb:
■ Stop the OSD
■ mv /var/lib/ceph/osd/ceph-<id>/current/omap
/var/lib/ceph/osd/ceph-<id>/omap.orig
■ ulimit -n 65535
■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-<id>/omap.orig
store-copy /var/lib/ceph/osd/ceph-<id>/current/omap 10000 rocksdb
■ ceph-osdomap-tool --omap-path
/var/lib/ceph/osd/ceph-<id>/current/omap --command check
■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-<id>/superblock
■ chown ceph.ceph /var/lib/ceph/osd/ceph-<id>/current/omap -R
■ cd /var/lib/ceph/osd/ceph-<id>; rm -rf omap.orig
■ Start the OSD
○ If you do not want to go with above steps then you can rebuild the OSD
with filestore_omap_backend = "rocksdb".

总结

In summary:
○ Have RGW index pool backed by SSD or NVME.
○ Have proper bucket index shard count set to a nice value from starting
considering future growth.
○ Have RGW index pool OSD’s using rocksdb with 8 compaction threads,
rocksdb compression disabled and rocksdb_cache_size tuned properly as
per your workload starting point 1G and can be increased more.
○ If you still see index pool OSD’s flapping during deep-scrub you can keep
nodeep-scrub flag set on the index pool and this luminous pull request
luminous: osd: deep-scrub preemption will fix this issue and you can unset
nodeep-scrub after upgrading to fixed luminous version.

从PPT分享结合个人经验来看,解决这类问题的思路基本上如下:

  1. 一定要有SSD作为index pool
  2. bucket 的index shard数量提前做好规划,这个可以参考本公众号之前的几篇bucket index shard相关内容。
  3. jewel之前的版本LevelDB如果硬件条件允许可以考虑切换到rocksdb同时考虑在业务高峰期关闭deep-scrub。如果是新上的集群用L版本的ceph,放弃Filestore,同时使用Bluestore作为默认的存储引擎。

总而言之bucket index的性能需要有SSD加持,大规模集群一定要做好初期设计,等到数据量大了再做调整,很难做到亡羊补牢!

原文发布于微信公众号 - Ceph对象存储方案(cephbook)

原文发表时间:2018-04-01

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Albert陈凯

2018-10-14 Redisson项目介绍Redisson项目介绍i flym

Rui Gu edited this page <relative-time datetime="2018-05-23T22:07:43Z" title="Ma...

863
来自专栏Java3y

移动商城第一篇【搭建项目环境+数据模型】

前言 本次该项目使用的技术如下: ? 这里写图片描述 搭建Oracle数据库环境 本次我们用Oracle作为我们的服务器,我们一般开发并不是把数据表放在我们练习...

3989
来自专栏Golang语言社区

nodejs php go语言了解

1、Nodejs 1) 简单的说 Node.js 就是运行在服务端的 JavaScript。 2) Node.js 是一个基于Chrome JavaScript...

30711
来自专栏腾讯Bugly的专栏

Android 内存优化总结&实践

导语 智能手机发展到今天已经有十几个年头,手机的软硬件都已经发生了翻天覆地的变化,特别是Android阵营,从一开始的一两百M到今天动辄4G,6G内存。然而大部...

3857
来自专栏Java3y

两个月的Java实习结束,继续努力

另外值得一说的是:别以为我写了那么多博客的就很厉害,很牛逼,其实我渣得一批!校招的算法笔试题基本没有ac的,在面试的时候,知识点说忘就忘。我写博客主要是记录一下...

712
来自专栏生信技能树

生信菜鸟团博客2周年精选文章集(5)seq-answer和bio-star论坛爬虫

生信常用论坛seq-answer里面所有帖子爬取 生信常用论坛bio-star里面所有帖子爬取 这个是爬虫专题第一集,主要讲如何分析bio-star这个网站并爬...

3558
来自专栏.NET技术

.net core实践系列之短信服务-架构优化

通过前面的几篇文章,讲解了一个短信服务的架构设计与实现。然而初始方案并非100%完美的,我们仍可以对该架构做一些优化与调整。

462
来自专栏菩提树下的杨过

使用NUnit在.Net编程中进行单元测试

原文地址:http://www.microsoft.com/china/community/Column/59.mspx 引言: 举一个可能会发生在...

1745
来自专栏微商软件知识共享

头脑王者php源码答题小程序

2、php版本5.5以上、数据库是mysql5.5, 即时通讯采用workerman服务

1904
来自专栏点滴积累

Landsat元数据批量下载工具

目录 前言 landsat数据情况简介 下载元数据 总结 一、前言        最近由于工作需要,需要下载部分landsat数据的元数据,老板大手一挥,给了十...

4167

扫码关注云+社区