Bluestore下的OSD开机自启动分析

Bluestore下的OSD开机自启动分析

OSD 开机启动原理

整个Bluestore现在由官方推出的ceph-volume工具进行管理,用以替代之前的ceph-disk。回顾之前ceph-disk是通过在xfs文件系统中打上相应的attributes,之后通过制定udev rules来实现启动。无论Ceph版本如何变化,基本的OSD自启动思路都是给块设备打标签->定制开机服务去触发执行。总结一下就是 Filestore的思路是: OSD打上xfs(attr) -> 由ceph-disk 触发执行 Bluestore的思路是: OSD打上LVM(tag) -> 由ceph-volume 触发执行 因此只要搞清楚LVM的tag机制,基本上就能很快搞定OSD自启动的相关排错问题。

  • Filestore的OSD启动机制可以参考 http://www.zphj1987.com/2016/12/26/manage-ceph-osd-journal-uuid/
  • udev配置: https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules。
  • LVM 科普: https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/

LVM 基本构架

一个简单 LVM 逻辑卷的组成如下

  • LVM 逻辑卷的基本物理存储单元是块设备,比如一个分区或者整张磁盘。将这个设备初始化为 LVM 物理卷(PV)。
  • 要创建一个 LVM 逻辑卷,就要将物理卷合并到卷组(VG)中。这就生成了磁盘空间池,用它可分配 LVM 逻辑卷(LV)。这个过程和将磁盘分区的过程类似。逻辑卷由文件系统和应用程序(比如数据库)使用。

LVM tag简介

  • tag信息存储在PV的metadata中。
  • tag名称最长可为 1024 个字符的字符串。可使用小横线开始。允许的字符包括 [A-Za-z0-9_+.-]。从 Red Hat Enterprise Linux 6.1 发行本开始,允许的字符列表扩大为可包含 "/"、"="、"!"、":"、"#" 和"&" 字符。

Bluestore打标签

打标签是在OSD初始化的prepare阶段进行,各种tag的含义直接在源码中标注一下,具体代码如下

#ceph-12.2.5/src/ceph-volume/ceph_volume/devices/lvm/prepare.py
def prepare(self, args):
        # FIXME we don't allow re-using a keyring, we always generate one for the
        # OSD, this needs to be fixed. This could either be a file (!) or a string
        # (!!) or some flags that we would need to compound into a dict so that we
        # can convert to JSON (!!!)
        secrets = {'cephx_secret': prepare_utils.create_key()}
        cephx_lockbox_secret = ''
        encrypted = 1 if args.dmcrypt else 0
        cephx_lockbox_secret = '' if not encrypted else prepare_utils.create_key()
        if encrypted:
            secrets['dmcrypt_key'] = encryption_utils.create_dmcrypt_key()
            secrets['cephx_lockbox_secret'] = cephx_lockbox_secret
        cluster_fsid = conf.ceph.get('global', 'fsid')
        osd_fsid = args.osd_fsid or system.generate_uuid()
        crush_device_class = args.crush_device_class
        if crush_device_class:
            secrets['crush_device_class'] = crush_device_class
        # reuse a given ID if it exists, otherwise create a new ID
        self.osd_id = prepare_utils.create_id(osd_fsid, json.dumps(secrets), osd_id=args.osd_id)
        tags = {
            'ceph.osd_fsid': osd_fsid, #osd 的uuid,目前官方是通过ceph osd new UUID去创建OSD ID
            'ceph.osd_id': self.osd_id, #osd 的ID序号
            'ceph.cluster_fsid': cluster_fsid, #ceph集群的fsid
            'ceph.cluster_name': conf.cluster, #集群名称,默认是ceph
            'ceph.crush_device_class': crush_device_class, #存储介质类型,目前默认提供HDD和SSD
        }
        if args.filestore:
            if not args.journal:
                raise RuntimeError('--journal is required when using --filestore')
            data_lv = self.get_lv(args.data)
            if not data_lv:
                data_lv = self.prepare_device(args.data, 'data', cluster_fsid, osd_fsid)
            tags['ceph.data_device'] = data_lv.lv_path #LV 逻辑卷的块设备地址            tags['ceph.data_uuid'] = data_lv.lv_uuid #LV 逻辑卷的UUID
            tags['ceph.cephx_lockbox_secret'] = cephx_lockbox_secret #加密密钥
            tags['ceph.encrypted'] = encrypted #是否加密
            tags['ceph.vdo'] = api.is_vdo(data_lv.lv_path) #是否为VDO设备
            journal_device, journal_uuid, tags = self.setup_device('journal', args.journal, tags)
            tags['ceph.type'] = 'data' #这个对应Filestore的数据存储类型
            data_lv.set_tags(tags)
            prepare_filestore(
                data_lv.lv_path,
                journal_device,
                secrets,
                tags,
                self.osd_id,
                osd_fsid,
            )
        elif args.bluestore:
            block_lv = self.get_lv(args.data)
            if not block_lv:
                block_lv = self.prepare_device(args.data, 'block', cluster_fsid, osd_fsid)
            tags['ceph.block_device'] = block_lv.lv_path
            tags['ceph.block_uuid'] = block_lv.lv_uuid
            tags['ceph.cephx_lockbox_secret'] = cephx_lockbox_secret
            tags['ceph.encrypted'] = encrypted
            tags['ceph.vdo'] = api.is_vdo(block_lv.lv_path)
            wal_device, wal_uuid, tags = self.setup_device('wal', args.block_wal, tags)
            db_device, db_uuid, tags = self.setup_device('db', args.block_db, tags)
            tags['ceph.type'] = 'block' #这个对应Bluestore的数据存储类型
            block_lv.set_tags(tags) #设置LVM的tag标记
            prepare_bluestore(
                block_lv.lv_path,
                wal_device,
                db_device,
                secrets,
                tags,
                self.osd_id,
                osd_fsid,
            )

最终的效果可以通过下面命令查看

[root@demo cephuser]# /usr/sbin/lvs --noheadings --readonly --separator="   " -o lv_tags,lv_path,lv_name,vg_name,lv_uuid
     /dev/centos/root   root   centos   fq7hUc-BtCA-AeIk-VvBD-Mxad-7uix-ZtK4I6
     /dev/centos/swap   swap   centos   r7oGD8-Tf15-v4ir-eOR3-LeDg-qqrf-gblNWI
  ceph.block_device=/dev/ceph-508f2cbf-23e3-4b62-a6e1-7143b00d08dd/osd-block-903b1698-9271-4040-8c8f-cd09417a7aaa,ceph.block_uuid=261j3z-Sk9A-LVQS-dFrB-YnvD-ZcaW-x8IdUf,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=21cc0dcd-06f3-4d5d-82c2-dbd411ef0ed9,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=903b1698-9271-4040-8c8f-cd09417a7aaa,ceph.osd_id=1,ceph.type=block,ceph.vdo=0   /dev/ceph-508f2cbf-23e3-4b62-a6e1-7143b00d08dd/osd-block-903b1698-9271-4040-8c8f-cd09417a7aaa   osd-block-903b1698-9271-4040-8c8f-cd09417a7aaa   ceph-508f2cbf-23e3-4b62-a6e1-7143b00d08dd   261j3z-Sk9A-LVQS-dFrB-YnvD-ZcaW-x8IdUf

ceph-volume 服务自启动OSD

查看/usr/lib/systemd/system/ceph-volume@.service,ceph-volume服务对应的内容如下:

[Unit]
Description=Ceph Volume activation: %i
After=local-fs.target
Wants=local-fs.target

[Service]
Type=oneshot
KillMode=none
Environment=CEPH_VOLUME_TIMEOUT=10000
ExecStart=/bin/sh -c 'timeout $CEPH_VOLUME_TIMEOUT /usr/sbin/ceph-volume-systemd  %i' #调用ceph-volume-systemd
TimeoutSec=0

[Install]
WantedBy=multi-user.target

查看ceph-volume-systemd会发现是调用的ceph-volume trigger去触发读取lv tag,代码如下

def main(args=None):
...
    command = ['ceph-volume', sub_command, 'trigger', extra_data]
    tries = os.environ.get('CEPH_VOLUME_SYSTEMD_TRIES', 30)
    interval = os.environ.get('CEPH_VOLUME_SYSTEMD_INTERVAL', 5)
    while tries > 0:
        try:
            # don't log any output to the terminal, just rely on stderr/stdout
            # going to logging
            process.run(command, terminal_logging=False)
            logger.info('successfully trggered activation for: %s', extra_data)
            break
        except RuntimeError as error:
            logger.warning(error)
            logger.warning('failed activating OSD, retries left: %s', tries)
            tries -= 1
            time.sleep(interval)

trigger类具体的实现如下

class Trigger(object):
    help = 'systemd helper to activate an OSD'
    def __init__(self, argv):
        self.argv = argv
    @decorators.needs_root
    def main(self):
        sub_command_help = dedent("""
        ** DO NOT USE DIRECTLY **
        This tool is meant to help the systemd unit that knows about OSDs.
        Proxy OSD activation to ``ceph-volume lvm activate`` by parsing the
        input from systemd, detecting the UUID and ID associated with an OSD::
            ceph-volume lvm trigger {SYSTEMD-DATA}
        The systemd "data" is expected to be in the format of::
            {OSD ID}-{OSD UUID}
        The lvs associated with the OSD need to have been prepared previously,
        so that all needed tags and metadata exist.
        """)
        parser = argparse.ArgumentParser(
            prog='ceph-volume lvm trigger',
            formatter_class=argparse.RawDescriptionHelpFormatter,
            description=sub_command_help,
        )
        parser.add_argument(
            'systemd_data',
            metavar='SYSTEMD_DATA',
            nargs='?',
            help='Data from a systemd unit containing ID and UUID of the OSD, like asdf-lkjh-0'
        )
        if len(self.argv) == 0:
            print(sub_command_help)
            return
        args = parser.parse_args(self.argv)
        osd_id = parse_osd_id(args.systemd_data)
        osd_uuid = parse_osd_uuid(args.systemd_data)
        Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main() #实例化Activate 

跟Activate实现,最终干活的就是activate_bluestore,具体看代码

def activate_bluestore(lvs, no_systemd=False):
    # find the osd
    osd_lv = lvs.get(lv_tags={'ceph.type': 'block'}) #获取数据类型,类型为Bluestore数据存储
    if not osd_lv:
        raise RuntimeError('could not find a bluestore OSD to activate')
    is_encrypted = osd_lv.tags.get('ceph.encrypted', '0') == '1' #是否加密
    dmcrypt_secret = None
    osd_id = osd_lv.tags['ceph.osd_id'] #获取osd id
    conf.cluster = osd_lv.tags['ceph.cluster_name'] #获取集群名称
    osd_fsid = osd_lv.tags['ceph.osd_fsid'] #获取ceph集群的fsid
    # mount on tmpfs the osd directory 挂载tmpfs目录,依次建立好 block、block.db或者block.wal的软连接
    osd_path = '/var/lib/ceph/osd/%s-%s' % (conf.cluster, osd_id)
    if not system.path_is_mounted(osd_path):
        # mkdir -p and mount as tmpfs
        prepare_utils.create_osd_path(osd_id, tmpfs=True)
    # XXX This needs to be removed once ceph-bluestore-tool can deal with
    # symlinks that exist in the osd dir
    for link_name in ['block', 'block.db', 'block.wal']: 
        link_path = os.path.join(osd_path, link_name)
        if os.path.exists(link_path):
            os.unlink(os.path.join(osd_path, link_name))
    # encryption is handled here, before priming the OSD dir
    if is_encrypted:
        osd_lv_path = '/dev/mapper/%s' % osd_lv.lv_uuid
        lockbox_secret = osd_lv.tags['ceph.cephx_lockbox_secret']
        encryption_utils.write_lockbox_keyring(osd_id, osd_fsid, lockbox_secret)
        dmcrypt_secret = encryption_utils.get_dmcrypt_key(osd_id, osd_fsid)
        encryption_utils.luks_open(dmcrypt_secret, osd_lv.lv_path, osd_lv.lv_uuid)
    else:
        osd_lv_path = osd_lv.lv_path
    db_device_path = get_osd_device_path(osd_lv, lvs, 'db', dmcrypt_secret=dmcrypt_secret)
    wal_device_path = get_osd_device_path(osd_lv, lvs, 'wal', dmcrypt_secret=dmcrypt_secret)
    # Once symlinks are removed, the osd dir can be 'primed again.
    process.run([
        'ceph-bluestore-tool', '--cluster=%s' % conf.cluster,
        'prime-osd-dir', '--dev', osd_lv_path,
        '--path', osd_path])
    # always re-do the symlink regardless if it exists, so that the block,
    # block.wal, and block.db devices that may have changed can be mapped
    # correctly every time
    process.run(['ln', '-snf', osd_lv_path, os.path.join(osd_path, 'block')])
    system.chown(os.path.join(osd_path, 'block'))
    system.chown(osd_path)
    if db_device_path:
        destination = os.path.join(osd_path, 'block.db')
        process.run(['ln', '-snf', db_device_path, destination])
        system.chown(db_device_path)
    if wal_device_path:
        destination = os.path.join(osd_path, 'block.wal')
        process.run(['ln', '-snf', wal_device_path, destination])
        system.chown(wal_device_path)
    if no_systemd is False:
        # enable the ceph-volume unit for this OSD 目录挂载完毕以后,通过systemd启动对应OSD服务
        systemctl.enable_volume(osd_id, osd_fsid, 'lvm')
        # start the OSD
        systemctl.start_osd(osd_id)
    terminal.success("ceph-volume lvm activate successful for osd ID: %s" % osd_id)

小结一下流程

Tag的手工维护

新建lv tag

[root@demo cephuser]# lvchange --addtag ceph.cephx_lockbox_secret= /dev/mapper/ceph--508f2cbf--23e3--4b62--a6e1--7143b00d08dd-osd--block--903b1698--9271--4040--8c8f--cd09417a7aaa
  Logical volume ceph-508f2cbf-23e3-4b62-a6e1-7143b00d08dd/osd-block-903b1698-9271-4040-8c8f-cd09417a7aaa changed.

删除lv tag

[root@demo cephuser]# lvchange --deltag ceph.block_device /dev/mapper/ceph--508f2cbf--23e3--4b62--a6e1--7143b00d08dd-osd--block--903b1698--9271--4040--8c8f--cd09417a7aaa
  Logical volume ceph-508f2cbf-23e3-4b62-a6e1-7143b00d08dd/osd-block-903b1698-9271-4040-8c8f-cd09417a7aaa changed.

查看lv tag

[root@demo cephuser]# /usr/sbin/lvs --noheadings --readonly  -o lv_tags

原文发布于微信公众号 - Ceph对象存储方案(cephbook)

原文发表时间:2018-06-14

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏分布式系统进阶

Kafka源码分析-启动流程

使用getPropsFromArgs方法来获取各配置项, 然后将启动和停止动作全部代理给KafkaServerStartable类;

24000
来自专栏Linux驱动

15.linux-LCD层次分析(详解)

如果我们的系统要用GUI(图形界面接口),这时LCD设备驱动程序就应该编写成frambuffer接口,而不是像之前那样只编写操作底层的LCD控制器接口。 什么是...

28460
来自专栏个人随笔

JFinal 3.3 学习 -- JFinalConfig (配置web项目)

50350
来自专栏Java3y

WebService就是这么简单

WebService介绍 首先我们来谈一下为什么需要学习webService这样的一个技术吧…. 问题一 如果我们的网站需要提供一个天气预报这样一个需求的话,那...

5.2K150
来自专栏Ceph对象存储方案

Luminous下删除和新建OSD的正确姿势

L版本开始极大的降低了对运维操作复杂度,新增了很多命令去确保数据安全,很多新手在删除OSD的时候很容易忽视了集群PGs的状态最终导致数据丢失,因此官方加入以下几...

42720
来自专栏java达人

Spring声明式事务的一个注意点及原理简析

以前我们说过,Spring通过ThreadLocal机制解除了事务管理模块与数据访问层的紧密耦合,提高了模块的可重用性,也保证了多线程环境下的对connecti...

22860
来自专栏Hellovass 的博客

社交化分享组件踩坑

问题是这样的,项目里的社交化分享是基于 UMShare 封装成的一个 ShareLib module,为了让这个 module 对调用者说更透明,我将 WXEn...

37350
来自专栏分布式系统进阶

KafkaController分析7-启动流程

19210
来自专栏坚毅的PHP

jersey处理支付宝异步回调通知的问题:java.lang.IllegalArgumentException: Error parsing media type 'application/x-www

tcpflow以流为单位分析请求内容,非常适合服务器端接口类服务查问题 这次遇到的问题跟支付宝支付后的回调post结果有关 淘宝的代码例子: publi...

62950
来自专栏Java3y

从零开始写项目第二篇【登陆注册、聊天、收藏夹模块】

登陆模块目标 我要将其弄成类似的登陆,功能是要比较完善的。 ? 本来我是想做一步写一步的,但是发现这样文章就会太乱,因为要改的地方太多了。前面写过的,后边就被修...

1.2K80

扫码关注云+社区

领取腾讯云代金券