前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >一个线上GTID搭建主从复制的问题

一个线上GTID搭建主从复制的问题

作者头像
AsiaYe
发布2019-11-06 17:04:09
8650
发布2019-11-06 17:04:09
举报
文章被收录于专栏:DBA随笔DBA随笔
一个线上GTID搭建主从复制的问题

今天上午,我给一台单实例节点成功挂载了一个NFS备份机,挂载完成之后,尝试给这个单实例节点利用xtrabackup的方式搭建一套主从环境,在搭建的过程中出现了一点儿问题,这里记录下来,以供日后回溯。

首先,对操作的环境进行介绍:

主服务器:

IP:192.168.10.106

端口:4306

内存:8G

硬盘:300G

系统:Centos 6.5

数据量:50G

从服务器:

IP:192.168.10.107

端口:4306

内存:8G

硬盘:300G

系统:Centos 6.5

使用Xtrabackup进行备份的过程还比较顺利,这里简单写出来过程:

1.首先,找到my.cnf文件进行备份。

需要注意的是:备份时候加上--slave-info可以将备份时候的master节点log文件编号和log文件pos写入xtrabackup_binlog_info文件中

命令如下:

代码语言:javascript
复制
time innobackupex --defaults-file=/data/mysql_4306/my.cnf
 -udba_admin -p --host=127.0.0.1 -P4306 
--slave-info /data/backup/mysql/

2.应用--apply-log,它的作用是通过回滚未提交的事务及同步已经提交的事务至数据文件使数据文件处于一致性状态。

命令如下:

代码语言:javascript
复制
innobackupex --apply-log /data/backup/mysql/--10_16--/

3.通过--copy-back把备份好的数据复制到新实例的data目录下。

命令如下:

代码语言:javascript
复制
innobackupex --defaults-file=/data/mysql_4306/my.cnf 
 --copy-back --rsync /data/backup/mysql/--10_16--/

4.最后使用chown命令更改所有文件的属主,然后启动mysql实例。

问题发现及尝试解决

在这个过程中都没有出现明显的问题,实例启动之后,因为主从节点的GTID都是打开的,如下:

代码语言:javascript
复制
mysql--dba_admin(none) ::>>show variables like '%gtid%';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| binlog_gtid_simple_recovery      | ON        |
| enforce_gtid_consistency         | ON        |
| gtid_executed_compression_period |       |
| gtid_mode                        | ON        |
| gtid_next                        | AUTOMATIC |
| gtid_owned                       |           |
| gtid_purged                      |           |
| session_track_gtids              | OFF       |
+----------------------------------+-----------+
 rows in set (0.00 sec)

所以我就没有使用master_log+log_position的方法来搭建主从,转而使用GTID的方法搭建主从,但是却遇到了一个错误:

代码语言:javascript
复制
mysql--dba_admin >>change master to master_host='192.168.10.106',
    -> master_user='dba_repl',
    -> master_password='replsafe',
    -> master_port=,
    -> master_auto_position=;

mysql--dba_admin >>start slave;
Query OK,  rows affected (0.00 sec)

mysql--dba_admin >>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.106
                  Master_User: dba_repl
                  Master_Port: 4306
                Connect_Retry: 60
              Master_Log_File: mysqlbin.000145
          Read_Master_Log_Pos: 585019798
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 365
        Relay_Master_Log_File: mysqlbin.000145
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s)
 in the worker(s). The most recent failure being: Worker 0 failed executing transaction 
'9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log mysqlbin.000145,
 end_log_pos 2455. See error log and/or performance_schema.replication_applier_status_by_worker 
table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 
              Relay_Log_Space: 585020176
              Until_Condition: None 
        Seconds_Behind_Master: NULL
               Last_SQL_Error: Coordinator stopped because there were error(s)
 in the worker(s). The most recent failure being: Worker 0 failed executing transaction 
'9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log mysqlbin.000145, end_log_pos 2455.
 See error log and/or performance_schema.replication_applier_status_by_worker table for
 more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 116
                  Master_UUID: 9ae1a9f9-6984-11e8-a0a3-005056a80cd7
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 190410 18:54:04
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300-67958355
            Executed_Gtid_Set: 9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67557300
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

可以看到,IO线程没有问题,SQL线程出现问题,可以初步判定这个问题是GTID的问题,接着查看相关慢日志,可以看到:

代码语言:javascript
复制
--T10::31.179244Z  [Note] Slave SQL thread for channel '' initialized,
 starting replication in log 'FIRST' at position , 
relay log '/data/mysql_4306/log/slave-relay-bin.000001' position: 
--T10::31.298977Z  [ERROR] Slave SQL for channel '': Worker  failed
 executing transaction '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300' 
at master log mysqlbin.000145, end_log_pos ; Could not execute Write_rows event 
on table cmec_log.log; Duplicate entry '60578649' for key 'PRIMARY',
 Error_code: ; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST,
 end_log_pos 1246, Error_code: 1062
2019-04-10T10:37:31.299032Z 15 [Warning] Slave SQL for channel '': 
... The slave coordinator and worker threads are stopped, possibly leaving data 
in inconsistent state. A restart should restore consistency automatically, 
although using non-transactional storage for data or info tables or DDL queries could
 lead to problems. In such cases you have to examine your data 
(see documentation for details). Error_code: 1756
2019-04-10T10:37:31.299049Z 15 [Note] Slave SQL thread for channel '' exiting,
 replication stopped in log 'mysqlbin.000145' at position 194

从错误中可以看到,出现了duplicate key问题,也就是主键冲突,我们的主要目的就是要解决主键冲突的问题。因为采用GTID的方式,所以我们要去查找目前执行到了master的哪一个position,然后用这个position反推出来它的GTID,从上面的复制信息上可以看到:

代码语言:javascript
复制
                   Last_Errno: 
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 0 failed executing transaction '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log mysqlbin.000145, end_log_pos 2455. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 
              Relay_Log_Space: 585020176
              Until_Condition: None

目前错误是卡在了master日志的194这个位置处,然后我们登陆到192.168.10.106上,使用mysqlbinlog来分析它的binlog内容,找到194这个位置:

代码语言:javascript
复制
# at 
#  :: server id   end_log_pos  CRC32 0x4b02e8d1        Previous-GTIDs
# ae1a9f9--11e8-a0a3-a80cd7:-
# at 
#  :: server id   end_log_pos  CRC32 0xc3399b70        GTID    last_committed=        sequence_number=
SET @@SESSION.GTID_NEXT= '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300'/*!*/;
# at 
#  :: server id   end_log_pos  CRC32 0x8463964e        Query   thread_id=        exec_time=     error_code=
SET TIMESTAMP=/*!*/;
SET @@session.pseudo_thread_id=/*!*/;
SET @@session.foreign_key_checks=, @@session.sql_auto_is_null=, @@session.unique_checks=, @@session.autocommit=/*!*/;
SET @@session.sql_mode=/*!*/;
SET @@session.auto_increment_increment=, @@session.auto_increment_offset=/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=,@@session.collation_connection=,@@session.collation_server=/*!*/;
SET @@session.lc_time_names=/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN

可以发现这个位置有一个@@SESSION.GTID_NEXT,将这个GTID_NEXT的值'9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300'拷贝出来,在192.168.10.107上跳过这个全局事务ID,操作如下:

代码语言:javascript
复制
stop slave;
set session gtid_next='9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300';
begin;
commit;
start slave;

之所以使用上面的方法,可以参照3月5日的公众号文章。

上述操作完成后,然后重新启动主从复制,发现问题变成了

代码语言:javascript
复制
mysql--dba_admin>>show slave status\G
*************************** . row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.106
                  Master_User: dba_repl
                  Master_Port: 
                Connect_Retry: 
              Master_Log_File: mysqlbin.000145
          Read_Master_Log_Pos: 
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 
        Relay_Master_Log_File: mysqlbin.000145
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                   Last_Errno: 
                   Last_Error: Coordinator stopped because there were error(s)
 in the worker(s). The most recent failure being: Worker 0 failed executing
 transaction '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log 
mysqlbin.000145, end_log_pos 2455. See error log and/or 
performance_schema.replication_applier_status_by_worker table for more details 
about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 
              Relay_Log_Space: 585134887
              Until_Condition: None
               Until_Log_File: 

可以发现,已经跳过了194位置的错误,问题重新变成了1277位置处的问题,查看错误日志,和之前的错误日志一样,依旧是duplicated key错误,说明我们这种方法不能从根本上解决问题,所以必须重新审视这个问题。

新的解决思路

查看备份时候--slave-info生成的文件xtrabackup_binlog_info,发现里面保存了进行备份时候的GTID,又查看了当前192.168.10.107上应用的GTID编号,如下:

9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67936001

而刚才pos=194的GTID编号是:

9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300

再查看当前的从库的GTID,如下:

代码语言:javascript
复制
mysql--dba_admin@127.0.0.1:(none) ::>>show variables like '%gtid%';
+----------------------------------+-------------------------------------------------+
| Variable_name                    | Value                                           |
+----------------------------------+-------------------------------------------------+
| binlog_gtid_simple_recovery      | ON                                              |
| enforce_gtid_consistency         | ON                                              |
| gtid_executed_compression_period |                                             |
| gtid_mode                        | ON                                              |
| gtid_next                        | AUTOMATIC                                       |
| gtid_owned                       |                                                 |
| gtid_purged                      | ae1a9f9--11e8-a0a3-a80cd7:- |
| session_track_gtids              | OFF                                             |
+----------------------------------+-------------------------------------------------+
 rows in set (0.01 sec)

也就是:

9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67557299

由于这三个GTID的前面部分一样,只看后面部分:

备份文件中的:1-67936001

错误pos194处的:67557300

从库gtid_purge值:1-67557299

到这里,已经基本上能确定问题的原因了:

从库中GTID_PURGED的值太小,导致它会去执行那些已经执行过的事务,所以就会产生duplicated key错误。

关于gtid_executed和gtid_purged参数,下面给出解释:

  • gtid_executed变量(show global variables):MySQL数据库已经执行了哪些GTID事务,处于内存中。show slave status中的executed_gtid_set也取自这里。
  • gtid_purged变量(show global variables):由于BINLOG文件的删除(如purge binary logfiles或者超过expire_logs_days设置)已经丢失的GTID事务,同时在搭建备库的我们使用set global gtid_purged变量来提示MySQL哪些GTID事务我已经执行过了。

解决方案:

将当前从库中的gtid_purged值增大至备份文件中的值。

执行过程如下:

代码语言:javascript
复制
mysql--dba_admin>>set global gtid_purged='9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67936001';
ERROR  (HY000): @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.
mysql--dba_admin>>select @@GLOBAL.GTID_EXECUTED;
+-------------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                          |
+-------------------------------------------------+
| ae1a9f9--11e8-a0a3-a80cd7:- |
+-------------------------------------------------+
 row in set (0.00 sec)

mysql--dba_admin>>reset master;
Query OK,  rows affected (0.00 sec)

mysql--dba_admin>>select @@GLOBAL.GTID_EXECUTED;
+------------------------+
| @@GLOBAL.GTID_EXECUTED |
+------------------------+
|                        |
+------------------------+
 row in set (0.00 sec)


mysql--dba_admin>>show variables like '%gtid%';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| binlog_gtid_simple_recovery      | ON        |
| enforce_gtid_consistency         | ON        |
| gtid_executed_compression_period |       |
| gtid_mode                        | ON        |
| gtid_next                        | AUTOMATIC |
| gtid_owned                       |           |
| gtid_purged                      |           |
| session_track_gtids              | OFF       |
+----------------------------------+-----------+
 rows in set (0.00 sec)

mysql--dba_admin>>set global gtid_purged='9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67936001';
Query OK,  rows affected (0.00 sec)

mysql--dba_admin>>show variables like '%gtid%';
+----------------------------------+-------------------------------------------------+
| Variable_name                    | Value                                           |
+----------------------------------+-------------------------------------------------+
| binlog_gtid_simple_recovery      | ON                                              |
| enforce_gtid_consistency         | ON                                              |
| gtid_executed_compression_period |                                             |
| gtid_mode                        | ON                                              |
| gtid_next                        | AUTOMATIC                                       |
| gtid_owned                       |                                                 |
| gtid_purged                      | ae1a9f9--11e8-a0a3-a80cd7:- |
| session_track_gtids              | OFF                                             |
+----------------------------------+-------------------------------------------------+
 rows in set (0.00 sec)

可以发现,这个值是不能直接修改的,必须先通过reset master将gtid_purged置为空值,然后才能进行修改,修改完之后,重新进行change master to,如下:

代码语言:javascript
复制
mysql--dba_admin>>change master to master_host='192.168.10.106',
    -> master_user='dba_repl',
    -> master_password='replsafe',
    -> master_port=,
    -> master_auto_position=;
Query OK,  rows affected,  warning (0.12 sec)

再通过show slave status命令查看相关的复制状态,可以看到双Yes的状态了:

代码语言:javascript
复制
mysql--dba_admin@127.0.0.1:(none) ::>>show slave status\G
*************************** . row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.106
                  Master_User: dba_repl
                  Master_Port: 
                Connect_Retry: 
              Master_Log_File: mysqlbin.000145
          Read_Master_Log_Pos: 
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 
        Relay_Master_Log_File: mysqlbin.000145
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

到这里,这个问题算是解决了,所以大家用xtrabackup进行拷贝的时候,如果用log number+log pos的方法复制,不会出现什么大问题,如果使用gtid的方式,需要提前注意下gtid_purged里面的值,否则可能会有一些问题。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-04-10,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DBA随笔 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
云数据库 MySQL
腾讯云数据库 MySQL(TencentDB for MySQL)为用户提供安全可靠,性能卓越、易于维护的企业级云数据库服务。其具备6大企业级特性,包括企业级定制内核、企业级高可用、企业级高可靠、企业级安全、企业级扩展以及企业级智能运维。通过使用腾讯云数据库 MySQL,可实现分钟级别的数据库部署、弹性扩展以及全自动化的运维管理,不仅经济实惠,而且稳定可靠,易于运维。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档