Monday, 22 January 2018

Migrating MySQL to PostgreSQL with minimal downtime

When I started writing pg_chameleon I had in mind a tool making simple to migrate from different dbms engines to PostgreSQL.

The name was inspired by the Culture Club's Karma chameleon song which suggested me the idea of a tool capable to move data between different database technologies.

Then tool's scope changed becoming a replica system rather a migration system but I've kept the name as I think the idea is still valid.

However, during the development of the now discontinued version 1 I needed to migrate a MySQL database rather replicating it. Therefore I added a command to detach the replica, allowing the migration to happen with very minimal downtime.


Prerequisites


pg_chameleon requires python to be at least 3.3; in this example we'll use python 3.6.

The the tables during the initial copy are locked one by one.
So, probably, the best way to setup the MySQL-PostgreSQL replica is to use a cascading MySQL as a source in order to minimise the downtime.

This example use the test database sakila as MySQL source.
The instructions on how to install it are available here.

Setup the virtual environment

Let's start configuring a virtual environment.
In a shell run the following commands.

thedoctor@deadpool:~$ python3 -m venv migration
thedoctor@deadpool:~$ source migration/bin/activate





Then let's upgrade pip as a good measure

(migration) thedoctor@deadpool:~$ pip install pip --upgrade
Requirement already up-to-date: pip in ./migration/lib/python3.6/site-packages


Now let's install pg_chameleon.

(migration) thedoctor@deadpool:~$ pip install pg_chameleon
Collecting pg_chameleon
  Downloading pg_chameleon-2.0.2.tar.gz (58kB)
    100% |████████████████████████████████| 61kB 1.2MB/s
...

...
...
  Running setup.py install for mysql-replication ... done
  Running setup.py install for PyYAML ... done
  Running setup.py install for tabulate ... done
  Running setup.py install for rollbar ... done
  Running setup.py install for pg-chameleon ... done
Successfully installed PyMySQL-0.8.0 PyYAML-3.12 argparse-1.4.0 certifi-2018.1.18 chardet-3.0.4 daemonize-2.4.7 idna-2.6 mysql-replication-0.15 pg-chameleon-2.0.2 psycopg2-2.7.3.2 requests-2.18.4 rollbar-0.13.17 six-1.11.0 tabulate-0.8.2 urllib3-1.22
(migration) thedoctor@deadpool:~$



As last step let's create the configuration directory running the set_configuration_files command.

(migration) thedoctor@deadpool:~$ chameleon set_configuration_files
creating directory /home/thedoctor/.pg_chameleon
creating directory /home/thedoctor/.pg_chameleon/configuration/
creating directory /home/thedoctor/.pg_chameleon/logs/
creating directory /home/thedoctor/.pg_chameleon/pid/
copying configuration  example in /home/thedoctor/.pg_chameleon/configuration//config-example.yml


MySQL setup

Now let's configure the databases for the migration.
The MySQL server's configuration depends on if we are using the master or a cascading replica.

If we are using the master, then the my.cnf file should contain the following rows under the [mysqld] entry.

binlog_format= ROW
log-bin = mysql-bin
server-id = 1
binlog-row-image=FULL

In the case of a cascading replica the only difference is the server-id which should be different from the value 1 and unique within the replication cluster.
A reasonable value could be a number composed by the date and a counter, similarly to the DNS serial numbering (e.g. 201801221).

We also need to set the parameter log-slave-updates = 'on'.
This parameter will tell the slave to log the changes into its local binary log location. The cascading replica configuration will be something like this.

binlog_format= ROW
log-bin = mysql-bin
server-id =201801221
binlog-row-image=FULL
log-slave-updates = On

Appliying these settings requires the database restart .


Database users setup

Now we are ready to setup the users on MySQL and postgres.
Let's give them the same name usr_replica. 

On MySQL the user needs the full access to the replica database  plus the RELOAD, REPLICATION CLIENT and REPLICATION SLAVE on the cluster.

CREATE USER usr_replica ;
SET PASSWORD FOR usr_replica=PASSWORD('replica');
GRANT ALL ON sakila.* TO 'usr_replica';
GRANT RELOAD ON *.* to 'usr_replica';
GRANT REPLICATION CLIENT ON *.* to 'usr_replica';
GRANT REPLICATION SLAVE ON *.* to 'usr_replica';
FLUSH PRIVILEGES;


On PostgreSQL we shall create a normal user and a database which owner is the replication user.

CREATE USER usr_replica WITH PASSWORD 'replica';
CREATE DATABASE db_replica WITH OWNER usr_replica;



 Configuring pg_chameleon


After the configuration directory is created we need to step into ~/.pg_chameleon/confiuration and copy the config-example.yml in a new configuration file.

(migration) thedoctor@deadpool:~$ cd .pg_chameleon/configuration/
(migration) thedoctor@deadpool:~/.pg_chameleon/configuration$ ls
config-example.yml
(migration) thedoctor@deadpool:~/.pg_chameleon/configuration$ cp config-example.yml migration.yml
(migration) thedoctor@deadpool:~/.pg_chameleon/configuration$


The configuration file's details are described here.
In a minimal configuration the parts we need to setup are the PostgreSQL database connection and the mysql source setup.


pg_conn:
    host: "destination_host"
    port: "5432"
    user: "usr_replica"
    password: "replica"
    database: "db_replica"
    charset: "utf8"



The config-example.yml comes with the mysql and pgsql sources.
As we are configuring only the mysql part we'll remove the pgsql entry.

sources:
    mysql:
        db_conn:
            host: "source_host"
            port: "3306"
            user: "usr_replica"
            password: "replica"
            charset: 'utf8'
            connect_timeout: 10
        schema_mappings:
                sakila: sch_sakila
        limit_tables:
        skip_tables:
        grant_select_to:
            - usr_migration
        lock_timeout: "120s"
        my_server_id: 201801222
        replica_batch_size: 10000
        replay_max_rows: 10000
        batch_retention: '1 day'
        copy_max_memory: "300M"
        copy_mode: 'file' 
        out_dir: /tmp
        sleep_loop: 1
        on_error_replay: continue
        type: mysql

The part to pay attention is the schema_mappings key.
In this area is possible to map multiple schemas from mysql to postgresql.
Once the source is registered the schema mapping is stored in the replica catalogue and no longer read from the configuration file.

Any change of schema_mappings made into the configuration file shall be applied to the replica catalogue with the command update_schema_mappings.

Another section that needs configuration is the type_override.
The given example overrides the tinyint(1) into boolean for all the tables.

type_override:
    "tinyint(1)":
        override_to: boolean
        override_tables:
            - "*"


As this is quite the norm on the MySQL databases it could be the need to limit this override to specific tables.
With a boolean override like this not valid values will be skipped during the initial copy and any row with invalid data will cause the replica for the affected table to stop.

If there's no need for type override is possible to disable leaving the type_override: key without entries.

The entry grant_select_to lists the database users or roles which have usage and select privileges over  the replicated schema.

Configuring the replica

Now we are set for the replica so, after activating the virtualenv we shall install the replica catalogue and add the mysql source.

chameleon create_replica_schema
chameleon add_source --config migration --source mysql

If the log_dest is set to file you'll see the output logged in the directory log_dir.

You can check the source status with the command

chameleon show_status --config migration

  Source id  Source name    Type    Status    Consistent    Read lag    Last read    Replay lag    Last replay
-----------  -------------  ------  --------  ------------  ----------  -----------  ------------  -------------
          1  mysql          mysql   ready     Yes           N/A                      N/A

== Schema mappings ==
Origin schema    Destination schema
---------------  --------------------
sakila           sch_sakila

== Replica status ==
---------------------  -
Tables not replicated  0
Tables replicated      0
All tables             0
Replayed rows
Replayed DDL
Skipped rows
---------------------  -


Now we are ready to initialise the replica with the command init_replica.

chameleon init_replica --config migration --source mysql
Init replica process for source mysql started.

When displaying the source status the column name will change to initialising and then to initialised. If the rollbar_key and rollbar_env are set then a message to rollbar will be sent when the process is complete.

To start the replica just run

chameleon start_replica --config migration --source mysql
Starting the replica process for source mysql
chameleon show_status --config default --source mysql
  Source id  Source name    Type    Status    Consistent    Read lag    Last read    Replay lag    Last replay
-----------  -------------  ------  --------  ------------  ----------  -----------  ------------  -------------
          1  mysql          mysql   running   No            N/A                      N/A

== Schema mappings ==
Origin schema    Destination schema
---------------  --------------------
sakila           sch_sakila

== Replica status ==
---------------------  --
Tables not replicated   0
Tables replicated      20
All tables             20
Replayed rows
Replayed DDL
Skipped rows
---------------------  --


Migrating the database

Before migrating the database to PostgreSQL we should wait for the replica to reach consistent status and there are no tables with the replica disabled listed in the show_status output.

After stopping the writes on mysql we should wait that the replay lag drops to zero. This will ensures all the changes have been replayed on PostgreSQL.
Without writes on the MySQL source the read lag will increase as no events will come trough the replica.

We are then ready to detach the replica.
We are using the debug option to get a verbose output on the standard output.
chameleon detach_replica --config default --source mysql --debug
Detaching the replica will remove any reference for the source mysql.
 Are you sure? YES/No

YES
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (608): Changing the autocommit flag to True
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (2598): Collecting schema mappings for source mysql
2018-01-22 12:28:25 MainProcess INFO mysql_lib.py (297): retrieving foreign keys metadata for schemas 'sakila'
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (2598): Collecting schema mappings for source mysql
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.actor_actor_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.address_address_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.category_category_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.city_city_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.country_country_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.customer_customer_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.film_film_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.inventory_inventory_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.language_language_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.payment_payment_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.rental_rental_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.staff_staff_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.store_store_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_partition_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_partition2_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_tin4y_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_tiny_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_address_city_0 on table sch_sakila.address
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_city_country_1 on table sch_sakila.city
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_customer_address_2 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_customer_store_3 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_language_4 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_language_ori_5 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_actor_actor_6 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_actor_film_7 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_category_cat_8 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_category_fil_9 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_inventory_film_10 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_inventory_store_11 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_payment_customer_12 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_payment_rental_13 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_payment_staff_14 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_rental_customer_15 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_rental_inventory_16 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_rental_staff_17 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_staff_address_18 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_staff_store_19 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_store_address_20 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_store_staff_21 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_address_city_0 on table sch_sakila.address
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_city_country_1 on table sch_sakila.city
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_customer_address_2 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_customer_store_3 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_language_4 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_language_ori_5 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_actor_actor_6 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_actor_film_7 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_category_cat_8 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_category_fil_9 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_inventory_film_10 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_inventory_store_11 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_payment_customer_12 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_payment_rental_13 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_payment_staff_14 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_rental_customer_15 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_rental_inventory_16 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_rental_staff_17 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_staff_address_18 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_staff_store_19 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_store_address_20 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_store_staff_21 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (1787): Deleting the source mysql
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (628): There is already a database connection active.


The procedure resets all the sequences using the max values of the depending fields.
All the foreign keys are generated from the MySQL information schema and created as invalid. A second run validates them. If any error occurs the key is left invalid but at least is enforced.

Finally the source is dropped from the replica catalogue.

Wrap up

With this method is possible to keep the writes on MySQL until the last moment and then switch to PostgreSQL after the detach is finished.

The sequence reset and the foreign key validation are the biggest pain in this process but is very likely the time required is lesser than the time used for dumping the database from MySQL and reloading it into PostgreSQL.

Currently there is no functionality for setting automatically the write permissions on the detached schema.


The foreign keys are created without any clause as I haven't figured out how to get that from the information_schema.

Probably I will add both improvements it in a future minor release.

That's all.
Thank you for reading.