Showing posts with label postgresql. Show all posts
Showing posts with label postgresql. Show all posts

Monday, 22 January 2018

Migrating MySQL to PostgreSQL with minimal downtime

When I started writing pg_chameleon I had in mind a tool making simple to migrate from different dbms engines to PostgreSQL.

The name was inspired by the Culture Club's Karma chameleon song which suggested me the idea of a tool capable to move data between different database technologies.

Then tool's scope changed becoming a replica system rather a migration system but I've kept the name as I think the idea is still valid.

However, during the development of the now discontinued version 1 I needed to migrate a MySQL database rather replicating it. Therefore I added a command to detach the replica, allowing the migration to happen with very minimal downtime.


Prerequisites


pg_chameleon requires python to be at least 3.3; in this example we'll use python 3.6.

The the tables during the initial copy are locked one by one.
So, probably, the best way to setup the MySQL-PostgreSQL replica is to use a cascading MySQL as a source in order to minimise the downtime.

This example use the test database sakila as MySQL source.
The instructions on how to install it are available here.

Setup the virtual environment

Let's start configuring a virtual environment.
In a shell run the following commands.

thedoctor@deadpool:~$ python3 -m venv migration
thedoctor@deadpool:~$ source migration/bin/activate





Then let's upgrade pip as a good measure

(migration) thedoctor@deadpool:~$ pip install pip --upgrade
Requirement already up-to-date: pip in ./migration/lib/python3.6/site-packages


Now let's install pg_chameleon.

(migration) thedoctor@deadpool:~$ pip install pg_chameleon
Collecting pg_chameleon
  Downloading pg_chameleon-2.0.2.tar.gz (58kB)
    100% |████████████████████████████████| 61kB 1.2MB/s
...

...
...
  Running setup.py install for mysql-replication ... done
  Running setup.py install for PyYAML ... done
  Running setup.py install for tabulate ... done
  Running setup.py install for rollbar ... done
  Running setup.py install for pg-chameleon ... done
Successfully installed PyMySQL-0.8.0 PyYAML-3.12 argparse-1.4.0 certifi-2018.1.18 chardet-3.0.4 daemonize-2.4.7 idna-2.6 mysql-replication-0.15 pg-chameleon-2.0.2 psycopg2-2.7.3.2 requests-2.18.4 rollbar-0.13.17 six-1.11.0 tabulate-0.8.2 urllib3-1.22
(migration) thedoctor@deadpool:~$



As last step let's create the configuration directory running the set_configuration_files command.

(migration) thedoctor@deadpool:~$ chameleon set_configuration_files
creating directory /home/thedoctor/.pg_chameleon
creating directory /home/thedoctor/.pg_chameleon/configuration/
creating directory /home/thedoctor/.pg_chameleon/logs/
creating directory /home/thedoctor/.pg_chameleon/pid/
copying configuration  example in /home/thedoctor/.pg_chameleon/configuration//config-example.yml


MySQL setup

Now let's configure the databases for the migration.
The MySQL server's configuration depends on if we are using the master or a cascading replica.

If we are using the master, then the my.cnf file should contain the following rows under the [mysqld] entry.

binlog_format= ROW
log-bin = mysql-bin
server-id = 1
binlog-row-image=FULL

In the case of a cascading replica the only difference is the server-id which should be different from the value 1 and unique within the replication cluster.
A reasonable value could be a number composed by the date and a counter, similarly to the DNS serial numbering (e.g. 201801221).

We also need to set the parameter log-slave-updates = 'on'.
This parameter will tell the slave to log the changes into its local binary log location. The cascading replica configuration will be something like this.

binlog_format= ROW
log-bin = mysql-bin
server-id =201801221
binlog-row-image=FULL
log-slave-updates = On

Appliying these settings requires the database restart .


Database users setup

Now we are ready to setup the users on MySQL and postgres.
Let's give them the same name usr_replica. 

On MySQL the user needs the full access to the replica database  plus the RELOAD, REPLICATION CLIENT and REPLICATION SLAVE on the cluster.

CREATE USER usr_replica ;
SET PASSWORD FOR usr_replica=PASSWORD('replica');
GRANT ALL ON sakila.* TO 'usr_replica';
GRANT RELOAD ON *.* to 'usr_replica';
GRANT REPLICATION CLIENT ON *.* to 'usr_replica';
GRANT REPLICATION SLAVE ON *.* to 'usr_replica';
FLUSH PRIVILEGES;


On PostgreSQL we shall create a normal user and a database which owner is the replication user.

CREATE USER usr_replica WITH PASSWORD 'replica';
CREATE DATABASE db_replica WITH OWNER usr_replica;



 Configuring pg_chameleon


After the configuration directory is created we need to step into ~/.pg_chameleon/confiuration and copy the config-example.yml in a new configuration file.

(migration) thedoctor@deadpool:~$ cd .pg_chameleon/configuration/
(migration) thedoctor@deadpool:~/.pg_chameleon/configuration$ ls
config-example.yml
(migration) thedoctor@deadpool:~/.pg_chameleon/configuration$ cp config-example.yml migration.yml
(migration) thedoctor@deadpool:~/.pg_chameleon/configuration$


The configuration file's details are described here.
In a minimal configuration the parts we need to setup are the PostgreSQL database connection and the mysql source setup.


pg_conn:
    host: "destination_host"
    port: "5432"
    user: "usr_replica"
    password: "replica"
    database: "db_replica"
    charset: "utf8"



The config-example.yml comes with the mysql and pgsql sources.
As we are configuring only the mysql part we'll remove the pgsql entry.

sources:
    mysql:
        db_conn:
            host: "source_host"
            port: "3306"
            user: "usr_replica"
            password: "replica"
            charset: 'utf8'
            connect_timeout: 10
        schema_mappings:
                sakila: sch_sakila
        limit_tables:
        skip_tables:
        grant_select_to:
            - usr_migration
        lock_timeout: "120s"
        my_server_id: 201801222
        replica_batch_size: 10000
        replay_max_rows: 10000
        batch_retention: '1 day'
        copy_max_memory: "300M"
        copy_mode: 'file' 
        out_dir: /tmp
        sleep_loop: 1
        on_error_replay: continue
        type: mysql

The part to pay attention is the schema_mappings key.
In this area is possible to map multiple schemas from mysql to postgresql.
Once the source is registered the schema mapping is stored in the replica catalogue and no longer read from the configuration file.

Any change of schema_mappings made into the configuration file shall be applied to the replica catalogue with the command update_schema_mappings.

Another section that needs configuration is the type_override.
The given example overrides the tinyint(1) into boolean for all the tables.

type_override:
    "tinyint(1)":
        override_to: boolean
        override_tables:
            - "*"


As this is quite the norm on the MySQL databases it could be the need to limit this override to specific tables.
With a boolean override like this not valid values will be skipped during the initial copy and any row with invalid data will cause the replica for the affected table to stop.

If there's no need for type override is possible to disable leaving the type_override: key without entries.

The entry grant_select_to lists the database users or roles which have usage and select privileges over  the replicated schema.

Configuring the replica

Now we are set for the replica so, after activating the virtualenv we shall install the replica catalogue and add the mysql source.

chameleon create_replica_schema
chameleon add_source --config migration --source mysql

If the log_dest is set to file you'll see the output logged in the directory log_dir.

You can check the source status with the command

chameleon show_status --config migration

  Source id  Source name    Type    Status    Consistent    Read lag    Last read    Replay lag    Last replay
-----------  -------------  ------  --------  ------------  ----------  -----------  ------------  -------------
          1  mysql          mysql   ready     Yes           N/A                      N/A

== Schema mappings ==
Origin schema    Destination schema
---------------  --------------------
sakila           sch_sakila

== Replica status ==
---------------------  -
Tables not replicated  0
Tables replicated      0
All tables             0
Replayed rows
Replayed DDL
Skipped rows
---------------------  -


Now we are ready to initialise the replica with the command init_replica.

chameleon init_replica --config migration --source mysql
Init replica process for source mysql started.

When displaying the source status the column name will change to initialising and then to initialised. If the rollbar_key and rollbar_env are set then a message to rollbar will be sent when the process is complete.

To start the replica just run

chameleon start_replica --config migration --source mysql
Starting the replica process for source mysql
chameleon show_status --config default --source mysql
  Source id  Source name    Type    Status    Consistent    Read lag    Last read    Replay lag    Last replay
-----------  -------------  ------  --------  ------------  ----------  -----------  ------------  -------------
          1  mysql          mysql   running   No            N/A                      N/A

== Schema mappings ==
Origin schema    Destination schema
---------------  --------------------
sakila           sch_sakila

== Replica status ==
---------------------  --
Tables not replicated   0
Tables replicated      20
All tables             20
Replayed rows
Replayed DDL
Skipped rows
---------------------  --


Migrating the database

Before migrating the database to PostgreSQL we should wait for the replica to reach consistent status and there are no tables with the replica disabled listed in the show_status output.

After stopping the writes on mysql we should wait that the replay lag drops to zero. This will ensures all the changes have been replayed on PostgreSQL.
Without writes on the MySQL source the read lag will increase as no events will come trough the replica.

We are then ready to detach the replica.
We are using the debug option to get a verbose output on the standard output.
chameleon detach_replica --config default --source mysql --debug
Detaching the replica will remove any reference for the source mysql.
 Are you sure? YES/No

YES
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (608): Changing the autocommit flag to True
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (2598): Collecting schema mappings for source mysql
2018-01-22 12:28:25 MainProcess INFO mysql_lib.py (297): retrieving foreign keys metadata for schemas 'sakila'
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (2598): Collecting schema mappings for source mysql
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.actor_actor_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.address_address_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.category_category_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.city_city_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.country_country_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.customer_customer_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.film_film_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.inventory_inventory_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.language_language_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.payment_payment_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.rental_rental_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.staff_staff_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.store_store_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_partition_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_partition2_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_tin4y_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (711): resetting the sequence  sch_sakila.test_tiny_id_seq
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_address_city_0 on table sch_sakila.address
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_city_country_1 on table sch_sakila.city
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_customer_address_2 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_customer_store_3 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_language_4 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_language_ori_5 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_actor_actor_6 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_actor_film_7 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_category_cat_8 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_film_category_fil_9 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_inventory_film_10 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_inventory_store_11 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_payment_customer_12 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_payment_rental_13 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_payment_staff_14 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_rental_customer_15 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_rental_inventory_16 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_rental_staff_17 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_staff_address_18 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_staff_store_19 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_store_address_20 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (740): creating invalid foreign key fk_store_staff_21 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_address_city_0 on table sch_sakila.address
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_city_country_1 on table sch_sakila.city
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_customer_address_2 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_customer_store_3 on table sch_sakila.customer
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_language_4 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_language_ori_5 on table sch_sakila.film
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_actor_actor_6 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_actor_film_7 on table sch_sakila.film_actor
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_category_cat_8 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_film_category_fil_9 on table sch_sakila.film_category
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_inventory_film_10 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_inventory_store_11 on table sch_sakila.inventory
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_payment_customer_12 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_payment_rental_13 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_payment_staff_14 on table sch_sakila.payment
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_rental_customer_15 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_rental_inventory_16 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_rental_staff_17 on table sch_sakila.rental
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_staff_address_18 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_staff_store_19 on table sch_sakila.staff
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_store_address_20 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess INFO pg_lib.py (750): validating fk_store_staff_21 on table sch_sakila.store
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (1787): Deleting the source mysql
2018-01-22 12:28:25 MainProcess DEBUG pg_lib.py (628): There is already a database connection active.


The procedure resets all the sequences using the max values of the depending fields.
All the foreign keys are generated from the MySQL information schema and created as invalid. A second run validates them. If any error occurs the key is left invalid but at least is enforced.

Finally the source is dropped from the replica catalogue.

Wrap up

With this method is possible to keep the writes on MySQL until the last moment and then switch to PostgreSQL after the detach is finished.

The sequence reset and the foreign key validation are the biggest pain in this process but is very likely the time required is lesser than the time used for dumping the database from MySQL and reloading it into PostgreSQL.

Currently there is no functionality for setting automatically the write permissions on the detached schema.


The foreign keys are created without any clause as I haven't figured out how to get that from the information_schema.

Probably I will add both improvements it in a future minor release.

That's all.
Thank you for reading.

Monday, 20 November 2017

pg_chameleon v2.0.0.alpha2 is out

Has been a while since I wrote a post on this blog.
I've been quite busy on coding pgchameleon's version 2.
I finally managed to release an alpha1 the 11th of November but this release had several issues which prevented the users to have real tests.
However, after a week of debugging I've released pg_chameleon v2.0.0.alpha2 which is now quite usable.


For a complete command line reference and an usage example click here.

The documentation now have a dedicated a page with the configuration file details explained.

Please note this pre-release and despite the changelog shows many fixes, there is still a lot of work to do.
Do not use it in production.

So, why pgchameleon v2 and why the version 1 can't be fixed?

Well, pg_chameleon started as an experiment and the version 1 was built without any clue on the architecture and what should be the final result.

This caused the version 1 to have several issues making very difficult to improve it without dismantling the existing logic.

Building up from fresh start required less time to reorganise the original in the correct way.
The code now is more modular, but is still not very pythonic.
My bad, I'm not a developer after all.

The major changes from the version 1 are the following.

  • Python 3 only development
  • Supports the replica from multiple MySQL schemas withing a single MySQL instance. The source's and target schema names can be different.
  • Conservative approach to the replica. Tables which generate errors are automatically excluded from the replica.
  • Daemonised init_replica process.
  • Daemonised replica process with two separated subprocess, one for the read and one for the replay.
  • Soft replica initialisation. The tables are locked when needed and stored with their log coordinates. The replica daemons will put the database in a consistent status gradually.
  • Rollbar integration for a simpler error detection.

The version 2 improves the display of the show_status command when a source name is specified.

There is also an error log in the replica schema which saves the exception's data during the replay phase.

The error log can be queried with the new command show_errors.

To install the version 2 please run.

python3 -m venv venv
source venv/bin/activate
pip install pip --upgrade 
pip install pg_chameleon==2.0.0a2


Release notes: https://github.com/the4thdoctor/pg_chameleon/releases/tag/v2.0.0.alpha2
Documentation: http://www.pgchameleon.org/documents_v2/

Please report any issue on github.



Wednesday, 23 August 2017

Happy birthday pg_chameleon

Today is one year since I started working seriously on pg_chameleon.
With this commit I changed the project's license to the 2 clause BSD and the project's scope, evolving the project into a MySQL to PostgreSQL replica system.

Initially this change was just a try. I needed to synchronise the data between MySQL and PostgreSQL and at that time the only option I had it was to use the MySQL foreign data wrapper, eventually to copy the data locally every now and then. However, because the previous implementation relied on a MySQL replica this approach approach wasn't really feasible.

If you are curious about the background story and how we scaled the analytics database  in Transferwise you can read it here.

I developed pg_chameleon in my spare time. I like to think about it like  my little commute project.

The first test on large datasets happened during the amazing days of the pgconf eu 2016. I remember how the process were incredibly slow, taking the unacceptable amount of time.  Four days to copy a 600GB database. I found the bottlenecks during the nights between the conference days building a faster implementation.

I also had to cope with the sql dialect conversion. The solution is still in progress.

Initially I decided to use an existing library but after few failures I realised that sqlparse didn't fit my needs.
So I took the occasion to learn how to use the regular expressions and I doubled my problems at the same time.

In May I presented the project at the Estonian PostgreSQL User Group and the video is available here.

Currently the project is at the version 1.6 which improves the replay speed and comes with better status view with the replay lag along the read lag.

The upcoming release 1.7 will add an optional threaded mode for the replica, where the read and replay processes will run independently.

This version will also see the support for the type override during the init schema and the ddl replay. This change will make simpler to use pg_chameleon as a migration tool (e.g. conversion of tinyint(1) into a boolean).

However the current replay implementation can result in a broken in case of not compatible data pushed into the data fiels (e.g. insert a value >1 in tinyint(1) will throw a type error on postgres if the data is boolean). I'm working on a solution.

I've also started the development of the version 2 but I've not yet kicked off seriously the coding yet. The reason why is that I'm still learning a lot of things thanks to the feedback I'm getting via github.
I will start the version 2 soon and hopefully I will release the first alpha by the beginning of the next year.

However I'm very happy to see pg_chameleon gaining popularity.

If my work even will help just one person to move from MySQL to PostgreSQL, I feel satisfied.

So, happy birthday to pg_chameleon, my little pet project.

Monday, 19 June 2017

pg_chameleon 1.3 out

Bugfix emergency release v1.3.1

I discovered a regression when running the init_replica caused by a wrong handling of missing time in master coordinates. Sorry about that.

 


After another round of bug fixes I've released the version 1.3 of my pet project pg_chameleon.

The package is available on pypi as usual.

The changelog is available here.

If you have any question/issue to discuss I created a community on gitter.

Please join!

I've also added a RELASE_NOTES file to explain the changes. As this release changes the relay storage and you plan to upgrade please have look to the file.

The initial implementation for the relay data was to have two log tables t_log_replica_1 and t_log_replica_2 with the replica process accessing one table at a time.

This approach allows autovacuum to take care of the unused partition meanwhile the other is written.

The method worked fine with only one replica worker. However as the flip flop between the tables is calculated indepentently for each source this could cause unwanted bloat on the log tables if several sources are replicating all togheter.

In this scenario autovacuum will struggle to truncate the empty space in the table's end.

The pg_chameleon version 1.3 implements the log tables per source. Each source have a dedicated couple of tables still inherited from
the root partition t_log_replica.


The schema upgrade happen automatically at the first run.
Because this one involves a data copy it could take more time than the usual. If the process seems frozen do not stop it otherwise you may lose your replica setup .

Those are the upgrade steps you should follow.
  • Stop all the replica sources. The show_status command must show all the rows in stopped status
  • Take a backup of the schema sch_chameleon with pg_dump
  • Upgrade pg_chameleon with pip install pg_chameleon --upgrade
  • Run chameleon.py upgrade_schema --config <your_config> --debug
  • When the upgrade is finished start the replica process as usual
If something goes wrong in the upgrade process you shall restore the sch_chameleon’s backup, then you should downgrade the installation to pg_chameleon 1.2 and start the replica as usual.

Saturday, 17 June 2017

Amazing pgug Estonia

Last 30th of May I had the honor to talk at the second Estonia PostgreSQL User Group.

My talk was about my pet project pg_chameleon and it went very well.

The audience was about 40 people highly interested in the topic and the questions were  all around.

I'm impressed by the Estonian PostgreSQL User Group.
Their organizational skills are amazing, the members interest is very high and they are fantastic friends.

If you are curious about the talk there is the recording on EEPUG's youtube channel.



Saturday, 13 May 2017

pg_chameleon v1.1 out

Last week I announced the stable release of pg_chameleon as I felt quite confident the tool worked properly.

However the murphy law is always ready to teach us we live in an interesting universe. A couple of days after the release I noticed on a server which data were modified seldom a strange replica issue. For some reason at specific moments of the day the inserts replayed on the postgres tables failed with the primary key violation. It took me a while to reproduce the bug as this is an interesting corner case. This sequence of statemens on MySQL cause the replica break down on the version 1.0.

DROP TABLE IF EXISTS test;
CREATE TABLE test(
  id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
  value1 VARCHAR(45) NOT NULL,
  PRIMARY KEY  (id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TEMPORARY TABLE tmp_test(
  id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
  value1 VARCHAR(45) NOT NULL,
  PRIMARY KEY  (id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
insert into tmp_test (value1) values('blah'),('blah');
insert into test (value1) values('blah');
DROP TEMPORARY TABLE if exists tmp_test ;
This is caused by the batch not marked as closed when the query is received. The batch was closed only when the query were parsed. The statements CREATE TEMPORARY TABLE are not parsed (by design) and therefore the batch were not closed correctly. The subsequent row images generated by the inserts were read at each loop causing a replay issue because of the primary key violation.

I also took the occasion to complete the docstrings on the last library sql_util.py. Now the classes are is fully documented.

The release 1.1 is already available on pypi.

I've also created  a google group for help and general discussions.

I'd really love to hear users's stories.

Sunday, 16 April 2017

pg_chameleon 1.0 RC1 available now


In 2013 I started my pet project pg_chameleon to discharge my frustration only to find the entire thing even more frustrating.

It's fascinating how time and circumstances can fix the personal perspective revamping the passion for development that never left me.

This philosophical introduction is to tell I still can't believe I've relased the RC1 of my pet project pg_chameleon.

The release is already available on pypi and includes several bug fixes and performance improvement. In particular the speed of pulling the row changes from MySQL is now 10 times faster because I'm skipping the implicit BEGIN issued by MySQL when generating a row image.

The display of sources and status is now nicer with the tabulate library.

I've also changed the logging file handler to use the time rotate which automatically rotate the replica logs on a daily basis. The log retention is configurable using the parameter log_days_keep.

It's also present a new command detach_replica which creates a standalone postgresql database with the sequences correctly set and the foreign keys created as invalid. The detach process tries to validate the foreign keys in a second moment though.

Pypi: https://pypi.python.org/pypi/pg_chameleon/
Documentation: http://pythonhosted.org/pg_chameleon/
Github: https://github.com/the4thdoctor/pg_chameleon

From this release I won't add new features but I'll focus only on bug fixes.
 

Changelog from v1.0 beta 2.

  • add support for primay key defined as column constraint
  • fix regression if null constraint is omitted in a primary key column
  • add foreign key generation to detach replica. keys are added invalid and a validation attempt is performed.
  • add support for truncate table
  • add parameter out_dir to set the output destination for the csv files during init_replica
  • add set tables_only to table_limit when streaming the mysql replica
  • force a close batch on rotate event if binlog changes without any row events collected
  • fixed replica regression with python 3.x and empty binary data
  • added event_update in hexlify strategy
  • add tabulate for nice display for sources/status
  • logs are rotated on a daily basis
  • removed parameter log_append
  • add parameter log_days_keep to specify how many days keep the logs
  • feature freeze

Monday, 20 March 2017

pg_chameleon beta 1 out

After fixing few bugs I decided to release the first pg_chameleon beta.

The package details with the changelog are available here http://www.pgdba.co.uk/p/mysql-to-postgresql-replica.html

These are the notable changes.

  • Switch to psycopg2 2.7.x - Thanks to Daniele's work now the requirement installs without need for python or postgresql headers, making the install in virtualenv much simpler
  • Install system wide or in virtualenv follow the same rules. The configuration files and the package files are installed in the python's site-packages/pg_chameleon. 
  • Configuration directory created at first run. If not present the $HOME/.pg_chameleon/ directory is created when chameleon.py is executed the first time. The script creates  the directory and the required config dir with the config-example.yaml file. 
  • The write_batch function is now using the copy_expert in order to speedup the batch load. The fallback to inserts is still present.
 The release is already available on pypi https://pypi.python.org/pypi/pg_chameleon/

The documentation is available here http://pythonhosted.org/pg_chameleon/

Please report any issue on the project's github page https://github.com/the4thdoctor/pg_chameleon

However if you like to get in touch you can ping me on twitter @4thdoctor_scarf or if you prefer to chat, on irc there is a dedicated channel irc.freenode.net #pgchameleon (you can find me logged in at night).

Thursday, 2 March 2017

pg_chameleon alpha4 available on pypi

Over the last week end I finally worked trough python's setuptools and relased the  alpha4 for pg_chameleon 1.0.
This is the final alpha version the version 1.0.

The system now can be installed in the virtualenv just typing
 pip install pg_chameleon

It's important pip is up to date so if there's any error upgrade pip with

pip install pip --upgrade

The first install creates in the user's home the directory .pg_chameleon with 4 subdirectories.

  • config is where the configuration files are stored. Use config-example.yaml as template for the other configuration files. If you have already a config file from a previous test you can copy there and the system should work. You will need to update the paths of logs and pid directories. The previous values with relative paths will no longer work. You can use  either use an absolute path or provide the home alias ~. Check the config-example.yaml for the latter configuration.
  • pid is where the replica pid is created. it can be changed in the configuration file.
  • logs is where the replication logs are created. it can be changed in the configuration file
  • sql stores the sql service files and upgrade files, you can ignore it

It's also possible to install pg_chameleon system wide (if you have root privileges). In that case the user's directory is not created. An example structure is put in /usr/local/etc/pg_chameleon.

You should copy the directory in the user's home dir (with the . prefix).
Using directly the general directory is not a good idea for security reasons.

The wrapper script which becomes chameleon.py.

Usage and examples in the documents.

pypi: https://pypi.python.org/pypi/pg_chameleon/1.0a4
github: https://github.com/the4thdoctor/pg_chameleon
documentation: http://pg-chameleon.readthedocs.io/en/v1.0-alpha.4/

Please test!

Thank you.

Friday, 3 February 2017

Slides from the fosdem pgday 2017

Thanks everybody for listening. And massive thanks to the PostgreSQL Europe staff for organising an amazing conference.

The slides from my talk are available for download here.

The open source replica system is on github https://github.com/the4thdoctor/pg_chameleon and hopefully Transferwise will also release the customised obfuscator soon.

 


Sunday, 29 January 2017

See you in Brussels

The next week I will present at the Fosdem PGDay 2017.
This is the first time I travel to Brussels, the first time I visit the Fosdem and the presentation is almost brand new. I gave the same presentation at the last Brighton PostgreSQL meetup and  this helped me to tailor the slides in a better way.

During the talk I will explain about the first project I was assigned since I joined Transferwise in July 2016, where I helped the analyst to move a struggling MySQL database, used for the business intelligence, into a faster PostgreSQL 9.5 database.

I can't wait to meet the PostgreSQL friends and visit the Fosdem.

So, see you in Brussels!


Monday, 28 November 2016

pg chameleon 1.0 alpha1

As mentioned in the presentation I did at the Brighton PostgreSQL Meetup, pg chameleon is a commute project.
In general I work on it when I travel. And thanks to the pgconf and other requirements I travelled a lot recently.

This is the presentation's recording (sorry the audio is suboptimal).


So I was able to build a minimum viable product which seems to work properly.

The alpha1 release is available for download and testing here

https://github.com/the4thdoctor/pg_chameleon/releases/tag/1.0-alpha.1

The system comes with the following limitations.

Installation in virtualenv

For working properly you should use virtualenv for installing the requirements via pip. I'm planning to make the package available via pip once it reaches the RC level.

No daemon yet

The script should be executed in a screen session to keep it running. Currently there's no respawning of the process on failure nor failure detector.

psycopg2 requires python and postgresql dev files

The psycopg2's pip installation requires the python development files and postgresql source code.
Please refer to your distribution for fulfilling those requirements.

DDL replica limitations

DDL and DML mixed in the same transaction are not decoded in the right order. This can result in a replica breakage caused by a wrong jsonb descriptor if the DML change the data on the same table modified by the DDL. I know the issue and I'm working on a solution.
Test please!
Please submit the issues you find.
Bear in mind this is an alpha release. if you use the software in production keep an eye on the process to ensure the data is correctly replicated.

Saturday, 12 November 2016

presenting pg chameleon at the brighton postgresql meetup

Back from Estonia after the amazing European PGConf 2016 I'm about to give a talk on the MySQL to PostgreSQL replication project I'm developing.

But first things first. I want to thank all the organisers of the PGConf Europe. The talks were super interesting the location absolutely stunning and the old and new friends I met, remembered me how amazing is the PostgreSQL community. Unfortunately I missed the social events (did somebody said a DBA can have social life? :P).


The meetup will be in Brighton UK Friday 18th of November. There will be  a chat starting at 19.00 with pizza and beer offered by our sponsor.
The talk will start around 19.30 and will be streamed on youtube here.

Here's the talk description.

pg_chameleon is a lightweight replication system written in python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The author's tool will talk about the history, the logic behind the functions available and will give an interactive usage example.

If you are in Brighton and want to join us please RSVP here.

Saturday, 22 October 2016

If you close your eyes you can smell Estonia

It's almost time for the pgconf 2016. This year the conference is in Tallinn the capital of Estonia. The conference is packed with super interesting talks which I really look forward to listen.

I'll also present the talk life on a rollercoaster which tells the story of the last 4 years  of a DBA(just guess who's he :P)  dealing with large large PostgreSQL installations.

I decided to make the talk with a narrative, and hopefully, entertaining form, in order to avoid to bore the audience to death.

I will also submit a lightning talk for the pg_chameleon tool I'm developing.

I'm pretty excited to meet old friends from the PostgreSQL community.

Well, that's all for now.

See you in Tallinn very soon :)

Wednesday, 5 October 2016

An evening with Psycopg2

The next meetup at Brighton will be Friday 14th October at 19.30.

Daniele Varrazzo will kindly talk about the most popular PostgreSQL driver in the python universe.
Psycopg is the most used PostgreSQL driver for the Python programming language.

As a driver it stays at a relatively low level, allowing to use all the features offered by the database. Behind the scenes, it does its best to convert the rich Python data model into the likewise rich PostgreSQL data model:
we will take a look at the adaptation mechanism and how to customize it to add new data types or change the default behaviour.

We will also take a look at how psycopg allows Python to tap into the PostgreSQL publish/subscribe mechanism and how this can be used to build reactive and distributed applications.
The presentation's live stream will be available at this link

http://www.youtube.com/channel/UCgr5Iv9eJYN2Piy9oYkn6jw/live

Please RSVP here.

See you in Brighton!

Tuesday, 20 September 2016

Mighty morphin power elephant

Back in the 2013 I started playing with sqlalchemy to create a simple extractor from heterogeneous systems to be pushed in postgresql.
I decided to give the project a name which recalled the transformation and I called pg_chameleon.

To be honest I didn't like sqlalchemy.  Like any other ORM adds an interface to the data layer with a mental approach to the data itself. I lost the interest to developing a migrator very soon, and after all there are thousands of similar tools thousands of times better than mine (e.g. the awesome pgloader)

However recently I revamped the project after discovering a python library capable to read the mysql replication protocol. In few weeks I cleaned all the sqlalchemy stuff, rebuilt the metadata extraction using the information_schema and finally I had an usable tool to replicate the data across the two systems.

I've also changed the license from GPL to the 2 clause BSD.

The tool requires testing. I'm absolutely sure is full of bugs and issues, but it seems to work quite nice.

Some key aspects:

  • Is developed in python 2.7. Bear with me, I'll build a port to python 3.x when/if the project will get to an initial  release.
  • I use tabs (4 space tabs). Bear with me again. I tried to use spaces and I almost thrown my laptop out of the window
  • setup.py is not working. I'll fix this as soon as I'll do a release.
  • Yes, the sql part use the "hungarian notation" and the keywords are uppercase with strange indentation on the statements .  
  • The DDL are not yet replicated. I'm thinking to a clever approach to the problem.

That's it. If you want to test it please do and try to break the tool :)

The tool is on github here: https://github.com/the4thdoctor/pg_chameleon/

Friday, 2 September 2016

News from the outer ring

After the summer break the Brighton PostgreSQL meetup restarts with the monthly technical talks.

This time is my round again. I'll speak on how to scale the backup and recovery on large postgres installations.


Actually this is the talk I've submitted to the european pgconf.
I made the talk in a storytelling form in order to avoid to bore the audience to the death. The talk should be  quite entertaining with explanation of  the issues solved  by the DBA over the years.

As google is removing the hangouts on air I'm using youtube live and OBS to stream this event. It's the first time I try and I cannot guarantee it will work.

I'll record the presentation just in case the stream is broken.

Event details

PostgreSQL - backup and recovery with large databases

 

Friday 9th September 19.00 London Time Zone

Location: Brandwatch - 1st Floor Sovereign House, Church St, Brighton, East Sussex BN1 1UJ, Brighton

Description:
Dealing with large databases is always a challenge.
The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
The presentation is based on a real story. The names are changed in order to protect the innocents.

RSVP here

Live stream (hopefully) here

Monday, 6 June 2016

Another evening with PostgreSQL

Thanks to the Ferrara's University, departement of Engineering I'll attend to another evening with PostgreSQL.

This time the event will be a seminary with all talks made by me.

The first talk will cover our beloved elephant's history.

The second talk will look to the engine with  particular attention to the MVCC and the memory manager.

The final talk will cover thebackup and recovery with pgdump and the streaming replication.

The talks will be in Italian and, hopefully, I'll be able to stream them via google hangout.

Here is the event page in Italian .

The schedule is

15:00 Don’t panic! - Introduzione a PostgreSQL
15:40 Coffee break
15:50 The Ravenous Bugblatter Beast of Traal - PostgreSQL: uno sguardo dentro al motore
16:50 Coffee break
17:00 Mostly harmless - Backup & recovery in PostgreSQL
17:50 Closing remarks - Q&A


There is also a Facebook event to RSVP.



The event is Free of charge.
 
See you in Ferrara!