前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >qemu live migration代码分析

qemu live migration代码分析

作者头像
惠伟
发布2021-02-24 11:23:18
3.6K0
发布2021-02-24 11:23:18
举报
文章被收录于专栏:虚拟化笔记虚拟化笔记

研究热迁移了是为了解决热迁移慢和迁移经常失败的问题,物理机升级内核时需要把上面的虚拟机一台台迁移走,很慢很耗时,有时还提示迁移失败。

热迁移虚拟机还是要短暂停止的,只是时间很短,影响比较小,给人感觉服务没有中断,记住这点,别被广告误导了。

live migration用法

src启动qemu-kvm增加参数-incoming tcp:0:6666

dst进去qemu monitor执行migrate tcp:$ip:6666,可以用info migrate查看信息

如果是post copy执行migrate_set_capability postcopy-ram on,然后执行migrate_start_postcopy

openstack环境nova-compute通过libvirt操作qemu,可以用virsh qemu-monitor-command domain --hmp command执行

live migration原理

qemu中有两个概念save_vm和load_vm,migration和snaptshot等都用到。save_vm把qemu cpu state,mem,device state保存在一个fd,这个fd可以本地文件也可以是socket等。load_vm正好相反,把保存的信息恢复到虚拟机。热迁移就是在dst启动一个虚拟机,把src虚拟机的发送过来的状态都恢复到对应位置。因为mem比较大,发送时间长,根据发送时机不同分为pre_copy和post_copy。pre_copy先发送mem,达到一定阈值,停止src虚拟机运行,发送cpu state和device state,dst收到,然后运行。post_copy先发送cpu state和device state,停止src虚拟机运行,dst标志page都无效,开始运行,qemu捕捉pagefault,然后从src请求page,src把page加入发送队列,dst等到这个page通知内核处理pagefualt然后继续运行。

postcopy模式src背后还是一直默默给dst发送page的,只是dst等不着了一些page时插个队,优先发送要的page。总体来说,live migration状态多,线程多(数据传递加保护,同步),和kvm交互多(log dirty page和userfault),容易出错,可优化地方多。

基础

迁移要处理cpu state, ram和device state。cpu就一堆register和stack什么,VMCS定义的那些状态。ram是大头,包括ROM,PCI mem和DIMM等,要求按page访问的。device就多了,有寄存器,队列等,千差万别,肯定得自己实现save和load函数,然后register给migration流程。

代码语言:javascript
复制
typedef struct SaveStateEntry {
    QTAILQ_ENTRY(SaveStateEntry) entry;
    char idstr[256];
    int instance_id;
    int alias_id;
    int version_id;
    /* version id read from the stream */
    int load_version_id;
    int section_id;
    /* section id read from the stream */
    int load_section_id;
    SaveVMHandlers *ops;
    const VMStateDescription *vmsd;
    void *opaque;
    CompatEntry *compat;
    int is_ram;
} SaveStateEntry;

typedef struct SaveState {
    QTAILQ_HEAD(, SaveStateEntry) handlers;
    int global_section_id;
    uint32_t len;
    const char *name;
    uint32_t target_page_bits;
} SaveState;

static SaveState savevm_state = {
    .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
    .global_section_id = 0,
};
typedef struct SaveVMHandlers {
    /* This runs inside the iothread lock.  */
    SaveStateHandler *save_state;

    void (*save_cleanup)(void *opaque);
    int (*save_live_complete_postcopy)(QEMUFile *f, void *opaque);
    int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);

    /* This runs both outside and inside the iothread lock.  */
    bool (*is_active)(void *opaque);
    bool (*has_postcopy)(void *opaque);

    /* is_active_iterate
     * If it is not NULL then qemu_savevm_state_iterate will skip iteration if
     * it returns false. For example, it is needed for only-postcopy-states,
     * which needs to be handled by qemu_savevm_state_setup and
     * qemu_savevm_state_pending, but do not need iterations until not in
     * postcopy stage.
     */
    bool (*is_active_iterate)(void *opaque);

    /* This runs outside the iothread lock in the migration case, and
     * within the lock in the savevm case.  The callback had better only
     * use data that is local to the migration thread or protected
     * by other locks.
     */
    int (*save_live_iterate)(QEMUFile *f, void *opaque);

    /* This runs outside the iothread lock!  */
    int (*save_setup)(QEMUFile *f, void *opaque);
    void (*save_live_pending)(QEMUFile *f, void *opaque,
                              uint64_t threshold_size,
                              uint64_t *res_precopy_only,
                              uint64_t *res_compatible,
                              uint64_t *res_postcopy_only);
    /* Note for save_live_pending:
     * - res_precopy_only is for data which must be migrated in precopy phase
     *     or in stopped state, in other words - before target vm start
     * - res_compatible is for data which may be migrated in any phase
     * - res_postcopy_only is for data which must be migrated in postcopy phase
     *     or in stopped state, in other words - after source vm stop
     *
     * Sum of res_postcopy_only, res_compatible and res_postcopy_only is the
     * whole amount of pending data.
     */


    LoadStateHandler *load_state;
    int (*load_setup)(QEMUFile *f, void *opaque);
    int (*load_cleanup)(void *opaque);
} SaveVMHandlers;

static SaveVMHandlers savevm_ram_handlers = {
    .save_setup = ram_save_setup,
    .save_live_iterate = ram_save_iterate,
    .save_live_complete_postcopy = ram_save_complete,
    .save_live_complete_precopy = ram_save_complete,
    .has_postcopy = ram_has_postcopy,
    .save_live_pending = ram_save_pending,
    .load_state = ram_load,
    .save_cleanup = ram_save_cleanup,
    .load_setup = ram_load_setup,
    .load_cleanup = ram_load_cleanup,
};

void ram_mig_init(void)
{
    qemu_mutex_init(&XBZRLE.lock);
    register_savevm_live(NULL, "ram", 0, 4, &savevm_ram_handlers, &ram_state);
}
struct VMStateField {
    const char *name;
    const char *err_hint;
    size_t offset;
    size_t size;
    size_t start;
    int num;
    size_t num_offset;
    size_t size_offset;
    const VMStateInfo *info;
    enum VMStateFlags flags;
    const VMStateDescription *vmsd;
    int version_id;
    bool (*field_exists)(void *opaque, int version_id);
};

struct VMStateDescription {
    const char *name;
    int unmigratable;
    int version_id;
    int minimum_version_id;
    int minimum_version_id_old;
    MigrationPriority priority;
    LoadStateHandler *load_state_old;
    int (*pre_load)(void *opaque);
    int (*post_load)(void *opaque, int version_id);
    int (*pre_save)(void *opaque);
    bool (*needed)(void *opaque);
    VMStateField *fields;
    const VMStateDescription **subsections;
};

static const VMStateDescription vmstate_e1000;
static void e1000_class_init(ObjectClass *klass, void *data)
{
    dc->vmsd = &vmstate_e1000;
}
int vmstate_register_with_alias_id(DeviceState *dev, int instance_id,
                                   const VMStateDescription *vmsd,
                                   void *base, int alias_id,
                                   int required_for_version,
                                   Error **errp);

/* Returns: 0 on success, -1 on failure */
static inline int vmstate_register(DeviceState *dev, int instance_id,
                                   const VMStateDescription *vmsd,
                                   void *opaque)
{
    return vmstate_register_with_alias_id(dev, instance_id, vmsd,
                                          opaque, -1, 0, NULL);
}

全局变量savevm_state是链表,ram和device把实现的save和load函数放在链表节点上,

迁移时遍历链表执行一遍就OK了。ram和device不同,ram用SaveVMHandlers,

device用VMStateDescription,VMStateDescription可以嵌套,实现基本数据类型的save和load操作。

pre_copy

pre_copy先处理ram,开始标志所有ram为dirty page,循环发送ram,同时CPU在执行写ram,每次循环从kvm获取CPU写过的ram,直到达到一个条件,停止CPU,发送剩下的ram,再发送CPU和device state。

代码语言:javascript
复制
migrate_fd_connect
{   //创建cleanup bh用于migration结束时,结束时触发执行
    s->cleanup_bh = qemu_bh_new(migrate_fd_cleanup, s);
    //创建migration工作线程
    qemu_thread_create(migration_thread)
}

migration_thread
{
    qemu_savevm_state_setup
    {
        ram_save_setup
        {
            ram_init_all->ram_init_bitmaps
                          {
                              ram_list_init_bitmaps
                              memory_global_dirty_log_start
                              migration_bitmap_sync->kvm_physical_sync_dirty_bitmap
                          }              
            创建线程compress_threads_save_setup
         }
    }
    while(true) 
    {
        qemu_savevm_state_pending->ram_save_pending->migration_bitmap_sync
        migration_iteration_run
        {
             if(!threshhold)
                 qemu_savevm_state_iterate
                     ram_save_iterate->ram_find_and_save_block
              else
                  qemu_savevm_state_complete_precopy->ram_save_complete
                  //其它设备状态
                  vmstate_save_state
               
        }
    }

    migration_iteration_finish->qemu_bh_schedule(s->cleanup_bh);
}

migrate_fd_cleanup
{
    qemu_savevm_state_cleanup->ran_save_cleanup
    停止线程 migration_thread
}


process_incoming_migration_co
{
    qemu_loadvm_state
     {
          qemu_loadvm_state_setup->ram_load_setup->compress_threads_load_setup
          //创建线程do_data_decompress和compress_threads_save_setup对应
          vmstate_load_state
          qemu_loadvm_state_main
          {
               case:    qemu_loadvm_section_start_full->vmstate_load_state
               case:    qemu_loadvm_section_part_end->vmstate_load_state
          }
          qemu_loadvm_state_cleanup
      }
    process_incoming_migration_bh
}

post_copy

为什么需要postcopy,因为pre_copy有可能无法收敛,虚拟机运行的飞快,不断产生dirty page,fd比较慢发送不完,无法达到预定的条件。postcopy就是先发送cpu和device state,停止执行,再慢慢发送ram,如果dst发现少page,再从src请求这个page。

代码语言:javascript
复制
migrate_fd_connect
{   //创建cleanup bh用于migration结束时,结束时触发执行
    s->cleanup_bh = qemu_bh_new(migrate_fd_cleanup, s);
    /****************************************/
    //如果是postcopy,创建收到page请求处理的线程
    open_return_path_on_source->qemu_thread_create(source_return_path_thread)
    /***************************************/
    //创建migration工作线程
    qemu_thread_create(migration_thread)
}
migration_thread
{
    /******************************/
    qemu_savevm_send_open_return_path
    qemu_savevm_send_ping
    qemu_savevm_send_postcopy_advise
    /******************************/
    qemu_savevm_state_setup
    {
        ram_save_setup
        {
            ram_init_all->ram_init_bitmaps
                          {
                              ram_list_init_bitmaps
                              memory_global_dirty_log_start
                              migration_bitmap_sync->kvm_physical_sync_dirty_bitmap
                          }              
            创建线程compress_threads_save_setup
         }
    }
    while(true) 
    {
        qemu_savevm_state_pending->ram_save_pending->migration_bitmap_sync
        
        migration_iteration_run
        {
             if(!threshhold&!post_copy)
                 /******************************/
                 if (postcopy_start()&&first)
                     return;
                 /*****************************/
                 qemu_savevm_state_iterate
                     ram_save_iterate
              else
                  migration_completion
                  {   if pre_copy
                          qemu_savevm_state_complete_precopy->ram_save_complete
                          //其它设备状态
                          vmstate_save_state
                      /******************************/
                      else if post_copy
                          qemu_savevm_state_complete_postcopy->->ram_save_complete
                      /******************************/
                  }
        }
    }
    migration_iteration_finish->qemu_bh_schedule(s->cleanup_bh);
}

migrate_fd_cleanup
{
    qemu_savevm_state_cleanup->ran_save_cleanup
    停止线程 migration_thread
}


process_incoming_migration_co
{
    qemu_loadvm_state
     {
          qemu_loadvm_state_setup->ram_load_setup->compress_threads_load_setup
          //创建线程do_data_decompress和compress_threads_save_setup对应
          vmstate_load_state
          qemu_loadvm_state_main
          {
               case:    qemu_loadvm_section_start_full->vmstate_load_state
               case:    qemu_loadvm_section_part_end->vmstate_load_state
               /******************************/
               //只有post_copy才执行这个case
               case:    loadvm_process_command
               {
                    case:    loadvm_handle_cmd_packaged
                    {
                        //这儿递归了,只执行前两个case
                        qemu_loadvm_state_main
                    }
                    case:    loadvm_postcopy_handle_advise
                    //用于接收请求返回的page
                    case:    loadvm_postcopy_handle_listen
                    {
                        //从kernel接收pagefault,然后发送src请求page
                        qemu_thread_create(postcopy_ram_fault_thread)
                        ///接收src给的page
                        qemu_thread_create(postcopy_ram_listen_thread)
                    }
                    case:    loadvm_postcopy_handle_run->loadvm_postcopy_handle_run_bh
                    //src启动发送page thread,src修改page然后停止,dst cpu执行
                    case:    loadvm_postcopy_ram_handle_discard
               }
               /******************************/
          }
          qemu_loadvm_state_cleanup
      }
    process_incoming_migration_bh
}
postcopy_ram_listen_thread
{ 
    //只执行前两个case 
    qemu_loadvm_state_main
    qemu_loadvm_state_cleanup
}

post_copy相比pre_copy多了如下过程

src:

代码语言:javascript
复制
source_return_path_thread
qemu_savevm_send_open_return_path
qemu_savevm_send_ping
qemu_savevm_send_postcopy_advise
postcopy_start
qemu_savevm_state_complete_postcopy

dst:

代码语言:javascript
复制
loadvm_process_command和src同步,src告诉dst进入那一个步骤。
postcopy_ram_fault_thread(利用了USERFAULT机制让qemu知道了pagefault)->migrate_send_rp_req_pages
postcopy_ram_listen_thread
//恢复ram,通过userfaultfd通知内核pagefault处理完成,还有其它fd通知共享内存的其它进程等流程
ram_load->ram_load_postcopy->postcopy_place_page->qemu_ufd_copy_ioctl

pre_copy vs post_copy

pre_copy有可能无法收敛,达到一定时间就失败了,无法收敛是由于CPU dirty page产生过快,有人就想着让CPU执行慢一点。post_copy的问题是一些情况是失败了无法恢复,pre_copy迁移失败,在src节点上照样可以恢复执行,而post_copy在dst上修改了状态,无法再同步给src,失败就恢复不了了。

迁移失败情况

qemu版本不匹配

pre_copy src失败情况:

代码语言:javascript
复制
就是fd发送失败
无法收敛导致超时
任何一个device注册的save函数失败
vm stop失败,要写的数据写不到硬盘
迁移过程中把vm paused

pre_copy dst失败情况:

代码语言:javascript
复制
从fd接收数据失败
任何一个device的pre_load或者post_load失败,在部分return 0,见过virtio-net报错,backend不支持feature

post_copy src失败情况:

代码语言:javascript
复制
包括所有pre_copy src的情况
发送数据长度超过最大长度

post_copy dst失败情况:

代码语言:javascript
复制
接收到的command不对,数据长度不对
打开socket失败
不支持post_copy,内核不支持把page fault通知用户态
pagesize不匹配
post_copy有固定的步骤,src给dst同步的步骤顺序不对
notify失败

migration还有很多properties和capabilities,可能打开或者关闭就失败了。

实战

迁移虚拟机失败,日记如下:

代码语言:javascript
复制
qemuMonitorIORead:610 : Unable to read from monitor: Connection reset by peer
qemuProcessReportLogError:1912 : internal error: qemu unexpectedly closed the monitor: 2019-12-04 06:50:14.598+0000: Domain id=17 is tainted: host-cpu
qemu-kvm: get_pci_config_device: Bad config data: i=0x34 read: 98 device: 40 cmask: ff wmask: 0 w1cmask:0
qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'
qemu-kvm: load of migration failed: Invalid argument

分析一下打印日记的代码

代码语言:javascript
复制
static const VMStateDescription vmstate_virtio_net = {
    .name = "virtio-net",
    .minimum_version_id = VIRTIO_NET_VM_VERSION,
    .version_id = VIRTIO_NET_VM_VERSION,
    .fields = (VMStateField[]) {
        VMSTATE_VIRTIO_DEVICE,
        VMSTATE_END_OF_LIST()
    },
    .pre_save = virtio_net_pre_save,
};
#define VMSTATE_VIRTIO_DEVICE \
    {                                         \
        .name = "virtio",                     \
        .info = &virtio_vmstate_info,         \
        .flags = VMS_SINGLE,                  \
    }

const VMStateInfo  virtio_vmstate_info = {
    .name = "virtio",
    .get = virtio_device_get,
    .put = virtio_device_put,
};

vmstate_load_state->(field->info->get)/virtio_device_get->virtio_load->load_config/virtio_pci_load_config->pci_device_load->vmstate_load_state->get_pci_config_device

static int get_pci_config_device(QEMUFile *f, void *pv, size_t size,
                                 VMStateField *field)
{
    PCIDevice *s = container_of(pv, PCIDevice, config);
    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(s);
    uint8_t *config;
    int i;

    assert(size == pci_config_size(s));
    config = g_malloc(size);

    qemu_get_buffer(f, config, size);
    for (i = 0; i < size; ++i) {
        if ((config[i] ^ s->config[i]) &
            s->cmask[i] & ~s->wmask[i] & ~s->w1cmask[i]) {
            error_report("%s: Bad config data: i=0x%x read: %x device: %x "
                         "cmask: %x wmask: %x w1cmask:%x", __func__,
                         i, config[i], s->config[i],
                         s->cmask[i], s->wmask[i], s->w1cmask[i]);
            g_free(config);
            return -EINVAL;
        }
    }
    return 0;
}

基本可以确定是virtio-net-pci config space内容src和dst不相符。qemu版本一样,参数一样,只可能是backend的问题,目前用的virtio-net的backend是内核vhost,virtio feature等需要qemu和内核vhost协商,两台物理机的kernel版本不一样导致的。

总结

page处理代码很多,细节还不太懂,等懂了再写。

开源项目资料少,要给一个人讲懂代码感觉很费劲,不太会写文档,简单记录一个以便加深自己对代码的印象,只能大体上讲个毛皮,大家且看且珍惜,要懂代码还得像追姑娘一样死磨烂缠,没有捷径。

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • live migration用法
  • live migration原理
  • 基础
  • pre_copy
  • post_copy
  • pre_copy vs post_copy
  • 迁移失败情况
  • 实战
  • 总结
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档