Restarting MySQL master-master replication

February 14, 2010 software mysql replication

Summary

To fix a broken MySQL (5.0+) replication, you have two options: the easy way and the right way. The easy way involves skipping problematic statements using `SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;` but this is usually only a temporary solution. The right way is to rebuild the replication setup, which involves choosing a primary database (Good Server), dumping its data with `mysqldump`, transferring and importing this dump on the problematic server (Rebuilding Server), and then properly configuring replication settings using commands like `stop slave`, `change master`, and `start slave`. Finally, ensure continuous operation of both servers and monitor replication status using tools like Monit to prevent future issues.

If your MySQL (5.0+) replication is broken, there’s two ways to fix it: The easy way, and the right way.

Run commands starting with $ on Unix. Run commands starting with mysql> in the MySQL client.

The easy way: Skip the problem

If you hit both databases at the same time, with the same INSERT, they will create their own record, and try and replicate to the other, which already has that record, causing a duplicate error.

In a simple case like that, you just want to skip the offending statement:

mysql>SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

More details on skipping MySQL duplicate errors

Most of the time, you skip one statement, and replication breaks again straight away, because there’s a whole queue of problem statements coming up.

The right way: Rebuild

If you are not sure that you can skip the duplicate, or if replication has been broken long enough that your two servers are out of synch, pick one database to be the master, and rebuild the other from a copy of that master.

First make sure your site is only using the master server. Stop any processes that might modify data on the server you need to rebuild.

We have two database servers:

Good Server: The good one, with the correct data.
Rebuilding Server: The one we are fixing. All it’s data will be erased with the Good Server data.

You’ll need a unix command line and SQL command line on both servers. If you don’t already use it, after you’ve got this fixed, I recommend checking out Screen.

1. On the Good Server

Dump data from the Good Server. The master-data switch adds a statement at the end of the file to start replication.

The quick switch makes dumping large tables use a lot less memory, so on a VPS it’s much faster.

All tables will locked during the dump.

Replace ‘my_database’ with your database name.

$ mysqldump --add-drop-table --master-data
    --quick -u root -p my_database > my_database.sql
$ bzip2 my_database.sql

2. On the Rebuilding Server

Copy the dump onto the Rebuilding Server. Replace ‘myuser’, ‘good-server’ and ‘my_database’ as appropriate.

$ scp myuser@good-server:my_database.sql.bz2 .
$ bunzip my_database.sql

Load the dump. This can take a few minutes for a large database.

mysql> stop slave;
$ mysql -u root -p my_database < my_database.sql
mysql> show slave status\G

You should see Slave_IO_Running: Yes and Slave_SQL_Running: Yes. The master-data switch to mysqldump, in step 1, started replication at the right place for us. How nice. I love MySQL.

The \G means show vertical instead of the usual horizontal. It works with any MySQL command.

Now you have statements flowing Good Server –> Rebuilding Server. Next we need to get data going the other way.

mysql> flush tables with read lock;
mysql> show master status;

Make a note of the File and Position rows.

3. On the Good Server

Set the slave here to be in synch with Rebuilding Server. Use the file name and log position from the previous step.

mysql> stop slave;
mysql> change master to master_log_file='mysql-bin.000044',
master_log_pos=132059667;
mysql> start slave;
mysql> show slave status\G

4. On the Rebuilding Server

mysql> unlock tables;

Start the processes that point to the Rebuilding Server’s database. Done!

Monitor replication

The database is done, but you’re not done yet.

If you don’t already have good monitoring, take 15 minutes to setup Monit on all your production servers.

Monit is easy to setup (a world away from the monster that is Nagios), and will save again and again.

Once you have Monit working, take another 15 mins to monitor MySQL replication with Monit.

Now you’re done :-)

Graham King