Monday 20 November 2017

pg_chameleon v2.0.0.alpha2 is out

Has been a while since I wrote a post on this blog.
I've been quite busy on coding pgchameleon's version 2.
I finally managed to release an alpha1 the 11th of November but this release had several issues which prevented the users to have real tests.
However, after a week of debugging I've released pg_chameleon v2.0.0.alpha2 which is now quite usable.


For a complete command line reference and an usage example click here.

The documentation now have a dedicated a page with the configuration file details explained.

Please note this pre-release and despite the changelog shows many fixes, there is still a lot of work to do.
Do not use it in production.

So, why pgchameleon v2 and why the version 1 can't be fixed?

Well, pg_chameleon started as an experiment and the version 1 was built without any clue on the architecture and what should be the final result.

This caused the version 1 to have several issues making very difficult to improve it without dismantling the existing logic.

Building up from fresh start required less time to reorganise the original in the correct way.
The code now is more modular, but is still not very pythonic.
My bad, I'm not a developer after all.

The major changes from the version 1 are the following.

  • Python 3 only development
  • Supports the replica from multiple MySQL schemas withing a single MySQL instance. The source's and target schema names can be different.
  • Conservative approach to the replica. Tables which generate errors are automatically excluded from the replica.
  • Daemonised init_replica process.
  • Daemonised replica process with two separated subprocess, one for the read and one for the replay.
  • Soft replica initialisation. The tables are locked when needed and stored with their log coordinates. The replica daemons will put the database in a consistent status gradually.
  • Rollbar integration for a simpler error detection.

The version 2 improves the display of the show_status command when a source name is specified.

There is also an error log in the replica schema which saves the exception's data during the replay phase.

The error log can be queried with the new command show_errors.

To install the version 2 please run.

python3 -m venv venv
source venv/bin/activate
pip install pip --upgrade 
pip install pg_chameleon==2.0.0a2


Release notes: https://github.com/the4thdoctor/pg_chameleon/releases/tag/v2.0.0.alpha2
Documentation: http://www.pgchameleon.org/documents_v2/

Please report any issue on github.



Wednesday 23 August 2017

Happy birthday pg_chameleon

Today is one year since I started working seriously on pg_chameleon.
With this commit I changed the project's license to the 2 clause BSD and the project's scope, evolving the project into a MySQL to PostgreSQL replica system.

Initially this change was just a try. I needed to synchronise the data between MySQL and PostgreSQL and at that time the only option I had it was to use the MySQL foreign data wrapper, eventually to copy the data locally every now and then. However, because the previous implementation relied on a MySQL replica this approach approach wasn't really feasible.

If you are curious about the background story and how we scaled the analytics database  in Transferwise you can read it here.

I developed pg_chameleon in my spare time. I like to think about it like  my little commute project.

The first test on large datasets happened during the amazing days of the pgconf eu 2016. I remember how the process were incredibly slow, taking the unacceptable amount of time.  Four days to copy a 600GB database. I found the bottlenecks during the nights between the conference days building a faster implementation.

I also had to cope with the sql dialect conversion. The solution is still in progress.

Initially I decided to use an existing library but after few failures I realised that sqlparse didn't fit my needs.
So I took the occasion to learn how to use the regular expressions and I doubled my problems at the same time.

In May I presented the project at the Estonian PostgreSQL User Group and the video is available here.

Currently the project is at the version 1.6 which improves the replay speed and comes with better status view with the replay lag along the read lag.

The upcoming release 1.7 will add an optional threaded mode for the replica, where the read and replay processes will run independently.

This version will also see the support for the type override during the init schema and the ddl replay. This change will make simpler to use pg_chameleon as a migration tool (e.g. conversion of tinyint(1) into a boolean).

However the current replay implementation can result in a broken in case of not compatible data pushed into the data fiels (e.g. insert a value >1 in tinyint(1) will throw a type error on postgres if the data is boolean). I'm working on a solution.

I've also started the development of the version 2 but I've not yet kicked off seriously the coding yet. The reason why is that I'm still learning a lot of things thanks to the feedback I'm getting via github.
I will start the version 2 soon and hopefully I will release the first alpha by the beginning of the next year.

However I'm very happy to see pg_chameleon gaining popularity.

If my work even will help just one person to move from MySQL to PostgreSQL, I feel satisfied.

So, happy birthday to pg_chameleon, my little pet project.

Monday 19 June 2017

pg_chameleon 1.3 out

Bugfix emergency release v1.3.1

I discovered a regression when running the init_replica caused by a wrong handling of missing time in master coordinates. Sorry about that.

 


After another round of bug fixes I've released the version 1.3 of my pet project pg_chameleon.

The package is available on pypi as usual.

The changelog is available here.

If you have any question/issue to discuss I created a community on gitter.

Please join!

I've also added a RELASE_NOTES file to explain the changes. As this release changes the relay storage and you plan to upgrade please have look to the file.

The initial implementation for the relay data was to have two log tables t_log_replica_1 and t_log_replica_2 with the replica process accessing one table at a time.

This approach allows autovacuum to take care of the unused partition meanwhile the other is written.

The method worked fine with only one replica worker. However as the flip flop between the tables is calculated indepentently for each source this could cause unwanted bloat on the log tables if several sources are replicating all togheter.

In this scenario autovacuum will struggle to truncate the empty space in the table's end.

The pg_chameleon version 1.3 implements the log tables per source. Each source have a dedicated couple of tables still inherited from
the root partition t_log_replica.


The schema upgrade happen automatically at the first run.
Because this one involves a data copy it could take more time than the usual. If the process seems frozen do not stop it otherwise you may lose your replica setup .

Those are the upgrade steps you should follow.
  • Stop all the replica sources. The show_status command must show all the rows in stopped status
  • Take a backup of the schema sch_chameleon with pg_dump
  • Upgrade pg_chameleon with pip install pg_chameleon --upgrade
  • Run chameleon.py upgrade_schema --config <your_config> --debug
  • When the upgrade is finished start the replica process as usual
If something goes wrong in the upgrade process you shall restore the sch_chameleon’s backup, then you should downgrade the installation to pg_chameleon 1.2 and start the replica as usual.

Saturday 17 June 2017

Amazing pgug Estonia

Last 30th of May I had the honor to talk at the second Estonia PostgreSQL User Group.

My talk was about my pet project pg_chameleon and it went very well.

The audience was about 40 people highly interested in the topic and the questions were  all around.

I'm impressed by the Estonian PostgreSQL User Group.
Their organizational skills are amazing, the members interest is very high and they are fantastic friends.

If you are curious about the talk there is the recording on EEPUG's youtube channel.



Saturday 13 May 2017

pg_chameleon v1.1 out

Last week I announced the stable release of pg_chameleon as I felt quite confident the tool worked properly.

However the murphy law is always ready to teach us we live in an interesting universe. A couple of days after the release I noticed on a server which data were modified seldom a strange replica issue. For some reason at specific moments of the day the inserts replayed on the postgres tables failed with the primary key violation. It took me a while to reproduce the bug as this is an interesting corner case. This sequence of statemens on MySQL cause the replica break down on the version 1.0.

DROP TABLE IF EXISTS test;
CREATE TABLE test(
  id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
  value1 VARCHAR(45) NOT NULL,
  PRIMARY KEY  (id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TEMPORARY TABLE tmp_test(
  id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
  value1 VARCHAR(45) NOT NULL,
  PRIMARY KEY  (id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
insert into tmp_test (value1) values('blah'),('blah');
insert into test (value1) values('blah');
DROP TEMPORARY TABLE if exists tmp_test ;
This is caused by the batch not marked as closed when the query is received. The batch was closed only when the query were parsed. The statements CREATE TEMPORARY TABLE are not parsed (by design) and therefore the batch were not closed correctly. The subsequent row images generated by the inserts were read at each loop causing a replay issue because of the primary key violation.

I also took the occasion to complete the docstrings on the last library sql_util.py. Now the classes are is fully documented.

The release 1.1 is already available on pypi.

I've also created  a google group for help and general discussions.

I'd really love to hear users's stories.

Sunday 16 April 2017

pg_chameleon 1.0 RC1 available now


In 2013 I started my pet project pg_chameleon to discharge my frustration only to find the entire thing even more frustrating.

It's fascinating how time and circumstances can fix the personal perspective revamping the passion for development that never left me.

This philosophical introduction is to tell I still can't believe I've relased the RC1 of my pet project pg_chameleon.

The release is already available on pypi and includes several bug fixes and performance improvement. In particular the speed of pulling the row changes from MySQL is now 10 times faster because I'm skipping the implicit BEGIN issued by MySQL when generating a row image.

The display of sources and status is now nicer with the tabulate library.

I've also changed the logging file handler to use the time rotate which automatically rotate the replica logs on a daily basis. The log retention is configurable using the parameter log_days_keep.

It's also present a new command detach_replica which creates a standalone postgresql database with the sequences correctly set and the foreign keys created as invalid. The detach process tries to validate the foreign keys in a second moment though.

Pypi: https://pypi.python.org/pypi/pg_chameleon/
Documentation: http://pythonhosted.org/pg_chameleon/
Github: https://github.com/the4thdoctor/pg_chameleon

From this release I won't add new features but I'll focus only on bug fixes.
 

Changelog from v1.0 beta 2.

  • add support for primay key defined as column constraint
  • fix regression if null constraint is omitted in a primary key column
  • add foreign key generation to detach replica. keys are added invalid and a validation attempt is performed.
  • add support for truncate table
  • add parameter out_dir to set the output destination for the csv files during init_replica
  • add set tables_only to table_limit when streaming the mysql replica
  • force a close batch on rotate event if binlog changes without any row events collected
  • fixed replica regression with python 3.x and empty binary data
  • added event_update in hexlify strategy
  • add tabulate for nice display for sources/status
  • logs are rotated on a daily basis
  • removed parameter log_append
  • add parameter log_days_keep to specify how many days keep the logs
  • feature freeze

Monday 20 March 2017

pg_chameleon beta 1 out

After fixing few bugs I decided to release the first pg_chameleon beta.

The package details with the changelog are available here http://www.pgdba.co.uk/p/mysql-to-postgresql-replica.html

These are the notable changes.

  • Switch to psycopg2 2.7.x - Thanks to Daniele's work now the requirement installs without need for python or postgresql headers, making the install in virtualenv much simpler
  • Install system wide or in virtualenv follow the same rules. The configuration files and the package files are installed in the python's site-packages/pg_chameleon. 
  • Configuration directory created at first run. If not present the $HOME/.pg_chameleon/ directory is created when chameleon.py is executed the first time. The script creates  the directory and the required config dir with the config-example.yaml file. 
  • The write_batch function is now using the copy_expert in order to speedup the batch load. The fallback to inserts is still present.
 The release is already available on pypi https://pypi.python.org/pypi/pg_chameleon/

The documentation is available here http://pythonhosted.org/pg_chameleon/

Please report any issue on the project's github page https://github.com/the4thdoctor/pg_chameleon

However if you like to get in touch you can ping me on twitter @4thdoctor_scarf or if you prefer to chat, on irc there is a dedicated channel irc.freenode.net #pgchameleon (you can find me logged in at night).

Thursday 2 March 2017

pg_chameleon alpha4 available on pypi

Over the last week end I finally worked trough python's setuptools and relased the  alpha4 for pg_chameleon 1.0.
This is the final alpha version the version 1.0.

The system now can be installed in the virtualenv just typing
 pip install pg_chameleon

It's important pip is up to date so if there's any error upgrade pip with

pip install pip --upgrade

The first install creates in the user's home the directory .pg_chameleon with 4 subdirectories.

  • config is where the configuration files are stored. Use config-example.yaml as template for the other configuration files. If you have already a config file from a previous test you can copy there and the system should work. You will need to update the paths of logs and pid directories. The previous values with relative paths will no longer work. You can use  either use an absolute path or provide the home alias ~. Check the config-example.yaml for the latter configuration.
  • pid is where the replica pid is created. it can be changed in the configuration file.
  • logs is where the replication logs are created. it can be changed in the configuration file
  • sql stores the sql service files and upgrade files, you can ignore it

It's also possible to install pg_chameleon system wide (if you have root privileges). In that case the user's directory is not created. An example structure is put in /usr/local/etc/pg_chameleon.

You should copy the directory in the user's home dir (with the . prefix).
Using directly the general directory is not a good idea for security reasons.

The wrapper script which becomes chameleon.py.

Usage and examples in the documents.

pypi: https://pypi.python.org/pypi/pg_chameleon/1.0a4
github: https://github.com/the4thdoctor/pg_chameleon
documentation: http://pg-chameleon.readthedocs.io/en/v1.0-alpha.4/

Please test!

Thank you.

Friday 3 February 2017

Slides from the fosdem pgday 2017

Thanks everybody for listening. And massive thanks to the PostgreSQL Europe staff for organising an amazing conference.

The slides from my talk are available for download here.

The open source replica system is on github https://github.com/the4thdoctor/pg_chameleon and hopefully Transferwise will also release the customised obfuscator soon.

 


Sunday 29 January 2017

See you in Brussels

The next week I will present at the Fosdem PGDay 2017.
This is the first time I travel to Brussels, the first time I visit the Fosdem and the presentation is almost brand new. I gave the same presentation at the last Brighton PostgreSQL meetup and  this helped me to tailor the slides in a better way.

During the talk I will explain about the first project I was assigned since I joined Transferwise in July 2016, where I helped the analyst to move a struggling MySQL database, used for the business intelligence, into a faster PostgreSQL 9.5 database.

I can't wait to meet the PostgreSQL friends and visit the Fosdem.

So, see you in Brussels!