Thursday 20 December 2012

Time and relative arrays in space

Scale what?

Few years ago when I was self employed as PostgreSQL consultant and I was
appointed a big performance tuning task by a great search engine (no, not
THAT search engine).

The application was similar to the Google’s desktop, a grid of boxes with
RSS feeds. Unfortunately the implementation of the two dimensional grid had
bad performance issues.

Any position change with a drag and drop, a delete or an add, even in low
activity periods, required the consequent database update to run for at least
20 seconds. During the peak time the system was unusable as the php hit the
execution limit timeout and this caused any transaction to abort.

The simple design, a table with two integer columns mapping the X-Y co-
ordinates with a unique constraint to avoid duplicates, wasn’t scaling and the
problem was caused by the unique constraint itself.

As the PostgreSQL 8.4 didn’t supported the deferrable unique constraints
the position change required an update for each row in order to keep the sequence
without holes, and no way to run a bulk update because the unique constraint
was violated during the update.

This approach worked well with the stage, once pushed to the live system
the map table became several millions rows big and the updates, with all the
overhead introduced by the indexes itself simply did not scale.

Scale me this!

After one week of analysis and test I’d realized that the solution was to rewrite
completely the application logic changing the conventional table into an aggre-
gated table.

The original table had four fields. userid, boxid,xposition,yposition.

I moved the boxid, from the map table field into a two dimensional text
array binding and adding the userid column as primary key.

In that way I could access the array by userid and get the boxid simply
passing the x-y coordinate. The task to build the box grid became incredibly
simple, with the array unnest.

With this structure any box move,add or delete with drag and drop was
performed over the array values and required only one table row update.

The previous implementation required for the move at least three updates.
For the box delete one database delete with a new sequencing for all the re-
maining rows. For the add a re sequencing for all the rows and an insert.
After the new implementation gone live the improvement was massive.
In peak activiy the longest update was 0.7 s.


After this experience, recently I used the same approach to implement a grid of
FIFO queues simulating a physical structure into the abstract data structure.

Obviously the arrays are not the solution for any scale problem, but if care-
fully designed can be used to aggregate data easing massively the database

As there’s no implementation of foreign keys over the array’s elements is not
possible, in theory, to have a referential integrity on those data structures.
I heard a proposal to have this feature added to PostgreSQL.

IMHO this is absolutely useless and if you need a foreign on an array, then probably
your database design is wrong.

If you are wondering where is it the web product described in this post, then it’s
a shame it was decommissioned after one year I did my job.

No comments:

Post a Comment