今天上午,我给一台单实例节点成功挂载了一个NFS备份机,挂载完成之后,尝试给这个单实例节点利用xtrabackup的方式搭建一套主从环境,在搭建的过程中出现了一点儿问题,这里记录下来,以供日后回溯。
首先,对操作的环境进行介绍:
主服务器:
IP:192.168.10.106
端口:4306
内存:8G
硬盘:300G
系统:Centos 6.5
数据量:50G
从服务器:
IP:192.168.10.107
端口:4306
内存:8G
硬盘:300G
系统:Centos 6.5
使用Xtrabackup进行备份的过程还比较顺利,这里简单写出来过程:
1.首先,找到my.cnf文件进行备份。
需要注意的是:备份时候加上--slave-info可以将备份时候的master节点log文件编号和log文件pos写入xtrabackup_binlog_info文件中。
命令如下:
time innobackupex --defaults-file=/data/mysql_4306/my.cnf
-udba_admin -p --host=127.0.0.1 -P4306
--slave-info /data/backup/mysql/
2.应用--apply-log,它的作用是通过回滚未提交的事务及同步已经提交的事务至数据文件使数据文件处于一致性状态。
命令如下:
innobackupex --apply-log /data/backup/mysql/--10_16--/
3.通过--copy-back把备份好的数据复制到新实例的data目录下。
命令如下:
innobackupex --defaults-file=/data/mysql_4306/my.cnf
--copy-back --rsync /data/backup/mysql/--10_16--/
4.最后使用chown命令更改所有文件的属主,然后启动mysql实例。
问题发现及尝试解决
在这个过程中都没有出现明显的问题,实例启动之后,因为主从节点的GTID都是打开的,如下:
mysql--dba_admin(none) ::>>show variables like '%gtid%';
+----------------------------------+-----------+
| Variable_name | Value |
+----------------------------------+-----------+
| binlog_gtid_simple_recovery | ON |
| enforce_gtid_consistency | ON |
| gtid_executed_compression_period | |
| gtid_mode | ON |
| gtid_next | AUTOMATIC |
| gtid_owned | |
| gtid_purged | |
| session_track_gtids | OFF |
+----------------------------------+-----------+
rows in set (0.00 sec)
所以我就没有使用master_log+log_position的方法来搭建主从,转而使用GTID的方法搭建主从,但是却遇到了一个错误:
mysql--dba_admin >>change master to master_host='192.168.10.106',
-> master_user='dba_repl',
-> master_password='replsafe',
-> master_port=,
-> master_auto_position=;
mysql--dba_admin >>start slave;
Query OK, rows affected (0.00 sec)
mysql--dba_admin >>show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.106
Master_User: dba_repl
Master_Port: 4306
Connect_Retry: 60
Master_Log_File: mysqlbin.000145
Read_Master_Log_Pos: 585019798
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 365
Relay_Master_Log_File: mysqlbin.000145
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1062
Last_Error: Coordinator stopped because there were error(s)
in the worker(s). The most recent failure being: Worker 0 failed executing transaction
'9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log mysqlbin.000145,
end_log_pos 2455. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos:
Relay_Log_Space: 585020176
Until_Condition: None
Seconds_Behind_Master: NULL
Last_SQL_Error: Coordinator stopped because there were error(s)
in the worker(s). The most recent failure being: Worker 0 failed executing transaction
'9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log mysqlbin.000145, end_log_pos 2455.
See error log and/or performance_schema.replication_applier_status_by_worker table for
more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 116
Master_UUID: 9ae1a9f9-6984-11e8-a0a3-005056a80cd7
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 190410 18:54:04
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300-67958355
Executed_Gtid_Set: 9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67557300
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
可以看到,IO线程没有问题,SQL线程出现问题,可以初步判定这个问题是GTID的问题,接着查看相关慢日志,可以看到:
--T10::31.179244Z [Note] Slave SQL thread for channel '' initialized,
starting replication in log 'FIRST' at position ,
relay log '/data/mysql_4306/log/slave-relay-bin.000001' position:
--T10::31.298977Z [ERROR] Slave SQL for channel '': Worker failed
executing transaction '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300'
at master log mysqlbin.000145, end_log_pos ; Could not execute Write_rows event
on table cmec_log.log; Duplicate entry '60578649' for key 'PRIMARY',
Error_code: ; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST,
end_log_pos 1246, Error_code: 1062
2019-04-10T10:37:31.299032Z 15 [Warning] Slave SQL for channel '':
... The slave coordinator and worker threads are stopped, possibly leaving data
in inconsistent state. A restart should restore consistency automatically,
although using non-transactional storage for data or info tables or DDL queries could
lead to problems. In such cases you have to examine your data
(see documentation for details). Error_code: 1756
2019-04-10T10:37:31.299049Z 15 [Note] Slave SQL thread for channel '' exiting,
replication stopped in log 'mysqlbin.000145' at position 194
从错误中可以看到,出现了duplicate key问题,也就是主键冲突,我们的主要目的就是要解决主键冲突的问题。因为采用GTID的方式,所以我们要去查找目前执行到了master的哪一个position,然后用这个position反推出来它的GTID,从上面的复制信息上可以看到:
Last_Errno:
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 0 failed executing transaction '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log mysqlbin.000145, end_log_pos 2455. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos:
Relay_Log_Space: 585020176
Until_Condition: None
目前错误是卡在了master日志的194这个位置处,然后我们登陆到192.168.10.106上,使用mysqlbinlog来分析它的binlog内容,找到194这个位置:
# at
# :: server id end_log_pos CRC32 0x4b02e8d1 Previous-GTIDs
# ae1a9f9--11e8-a0a3-a80cd7:-
# at
# :: server id end_log_pos CRC32 0xc3399b70 GTID last_committed= sequence_number=
SET @@SESSION.GTID_NEXT= '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300'/*!*/;
# at
# :: server id end_log_pos CRC32 0x8463964e Query thread_id= exec_time= error_code=
SET TIMESTAMP=/*!*/;
SET @@session.pseudo_thread_id=/*!*/;
SET @@session.foreign_key_checks=, @@session.sql_auto_is_null=, @@session.unique_checks=, @@session.autocommit=/*!*/;
SET @@session.sql_mode=/*!*/;
SET @@session.auto_increment_increment=, @@session.auto_increment_offset=/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=,@@session.collation_connection=,@@session.collation_server=/*!*/;
SET @@session.lc_time_names=/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
可以发现这个位置有一个@@SESSION.GTID_NEXT,将这个GTID_NEXT的值'9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300'拷贝出来,在192.168.10.107上跳过这个全局事务ID,操作如下:
stop slave;
set session gtid_next='9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300';
begin;
commit;
start slave;
之所以使用上面的方法,可以参照3月5日的公众号文章。
上述操作完成后,然后重新启动主从复制,发现问题变成了
mysql--dba_admin>>show slave status\G
*************************** . row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.106
Master_User: dba_repl
Master_Port:
Connect_Retry:
Master_Log_File: mysqlbin.000145
Read_Master_Log_Pos:
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos:
Relay_Master_Log_File: mysqlbin.000145
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno:
Last_Error: Coordinator stopped because there were error(s)
in the worker(s). The most recent failure being: Worker 0 failed executing
transaction '9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557301' at master log
mysqlbin.000145, end_log_pos 2455. See error log and/or
performance_schema.replication_applier_status_by_worker table for more details
about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos:
Relay_Log_Space: 585134887
Until_Condition: None
Until_Log_File:
可以发现,已经跳过了194位置的错误,问题重新变成了1277位置处的问题,查看错误日志,和之前的错误日志一样,依旧是duplicated key错误,说明我们这种方法不能从根本上解决问题,所以必须重新审视这个问题。
新的解决思路
查看备份时候--slave-info生成的文件xtrabackup_binlog_info,发现里面保存了进行备份时候的GTID,又查看了当前192.168.10.107上应用的GTID编号,如下:
9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67936001
而刚才pos=194的GTID编号是:
9ae1a9f9-6984-11e8-a0a3-005056a80cd7:67557300
再查看当前的从库的GTID,如下:
mysql--dba_admin@127.0.0.1:(none) ::>>show variables like '%gtid%';
+----------------------------------+-------------------------------------------------+
| Variable_name | Value |
+----------------------------------+-------------------------------------------------+
| binlog_gtid_simple_recovery | ON |
| enforce_gtid_consistency | ON |
| gtid_executed_compression_period | |
| gtid_mode | ON |
| gtid_next | AUTOMATIC |
| gtid_owned | |
| gtid_purged | ae1a9f9--11e8-a0a3-a80cd7:- |
| session_track_gtids | OFF |
+----------------------------------+-------------------------------------------------+
rows in set (0.01 sec)
也就是:
9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67557299
由于这三个GTID的前面部分一样,只看后面部分:
备份文件中的:1-67936001
错误pos194处的:67557300
从库gtid_purge值:1-67557299
到这里,已经基本上能确定问题的原因了:
从库中GTID_PURGED的值太小,导致它会去执行那些已经执行过的事务,所以就会产生duplicated key错误。
关于gtid_executed和gtid_purged参数,下面给出解释:
解决方案:
将当前从库中的gtid_purged值增大至备份文件中的值。
执行过程如下:
mysql--dba_admin>>set global gtid_purged='9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67936001';
ERROR (HY000): @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.
mysql--dba_admin>>select @@GLOBAL.GTID_EXECUTED;
+-------------------------------------------------+
| @@GLOBAL.GTID_EXECUTED |
+-------------------------------------------------+
| ae1a9f9--11e8-a0a3-a80cd7:- |
+-------------------------------------------------+
row in set (0.00 sec)
mysql--dba_admin>>reset master;
Query OK, rows affected (0.00 sec)
mysql--dba_admin>>select @@GLOBAL.GTID_EXECUTED;
+------------------------+
| @@GLOBAL.GTID_EXECUTED |
+------------------------+
| |
+------------------------+
row in set (0.00 sec)
mysql--dba_admin>>show variables like '%gtid%';
+----------------------------------+-----------+
| Variable_name | Value |
+----------------------------------+-----------+
| binlog_gtid_simple_recovery | ON |
| enforce_gtid_consistency | ON |
| gtid_executed_compression_period | |
| gtid_mode | ON |
| gtid_next | AUTOMATIC |
| gtid_owned | |
| gtid_purged | |
| session_track_gtids | OFF |
+----------------------------------+-----------+
rows in set (0.00 sec)
mysql--dba_admin>>set global gtid_purged='9ae1a9f9-6984-11e8-a0a3-005056a80cd7:1-67936001';
Query OK, rows affected (0.00 sec)
mysql--dba_admin>>show variables like '%gtid%';
+----------------------------------+-------------------------------------------------+
| Variable_name | Value |
+----------------------------------+-------------------------------------------------+
| binlog_gtid_simple_recovery | ON |
| enforce_gtid_consistency | ON |
| gtid_executed_compression_period | |
| gtid_mode | ON |
| gtid_next | AUTOMATIC |
| gtid_owned | |
| gtid_purged | ae1a9f9--11e8-a0a3-a80cd7:- |
| session_track_gtids | OFF |
+----------------------------------+-------------------------------------------------+
rows in set (0.00 sec)
可以发现,这个值是不能直接修改的,必须先通过reset master将gtid_purged置为空值,然后才能进行修改,修改完之后,重新进行change master to,如下:
mysql--dba_admin>>change master to master_host='192.168.10.106',
-> master_user='dba_repl',
-> master_password='replsafe',
-> master_port=,
-> master_auto_position=;
Query OK, rows affected, warning (0.12 sec)
再通过show slave status命令查看相关的复制状态,可以看到双Yes的状态了:
mysql--dba_admin@127.0.0.1:(none) ::>>show slave status\G
*************************** . row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.10.106
Master_User: dba_repl
Master_Port:
Connect_Retry:
Master_Log_File: mysqlbin.000145
Read_Master_Log_Pos:
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos:
Relay_Master_Log_File: mysqlbin.000145
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
到这里,这个问题算是解决了,所以大家用xtrabackup进行拷贝的时候,如果用log number+log pos的方法复制,不会出现什么大问题,如果使用gtid的方式,需要提前注意下gtid_purged里面的值,否则可能会有一些问题。