Wednesday 25 June 2014

Chapter 7 - Part 4, VACUUM FULL and CLUSTER


The CLUSTER command rebuilds a completely new table with the tuples in the same order of the clustered index. The clustered index can be set using the command ALTER TABLE table_name CLUSTER ON index_name and is used as sort key in the next CLUSTER. In order to get a clearer picture, let's cluster the table created in 7.1 its primary key.

This CLUSTER's run performs a full index scan on the clustered index avoiding the table sort. The tuples are then stored into a new file node. When the build is complete then the relation's file node is swapped in the system catalogue and indices are reindexed. When the command completes the old file node is then removed. The entire process requires an exclusive lock on the table preventing the reads and the writes. Also the storage is a critical point because the disk space requirements are for the old relation plus the new one with the indices and the eventual sort on disk.

Looking at source code in src/backend/commands/cluster.c, is clearly stated in the file's header that CLUSTER a table on an index. This is now also used for VACUUM FULL. The only difference between VACUUM FULL and CLUSTER, is the clustered index's OID validity. If it's valid then the data output is sorted on the clustered index. How the data is sorted is determined by the planner, which choice is the cheapest between an index scan and the sequential scan with a data sort. Otherwise, if the index's OID is invalid then the tuples are read using a plain sequential scan.

VACUUM FULL and CLUSTER have beneficial effects on the storage as the space is returned to the operating system. Also, regenerating completely the relation's files with the reindex, it makes the page access more efficient and CLUSTER rebuilds the table on the clustered index order. This minimise the random disk seeks when accessing the data via clustered index.

The disadvantages for using those commands are the complete stop of the affected table's activity. Also, CLUSTER and VACUUM FULL do not fix the XID wraparound risk which the conventional VACUUM does.

As rule of thumb, in order to minimise the database's downtime, CLUSTER and the VACUUM FULL should be used only for extraordinary maintenance and only if the disk space is critical. For the day to day maintenance it's best to rely on VACUUM and occasionally the reindex as seen in 7.1 and 7.3.

1 comment:

  1. Nice Article !
    This is my pleasure to read your article.
    Really this will help to people of PostgreSQL Community.

    I have also prepared one article about, CLUSTER command of PostgreSQL to improve the performanc of Index.
    You can also visit my article, your comments and reviews are most welcome.