Tuesday 25 December 2012

The Christmas Shitload Post


Happy Christmas! Believe it or not we are almost done. We’re almost out
of winter!

This post is my present for a brilliant new PostgreSQL year.
You know, Christmas usually means goodness, happiness, joy and
kindness.
And if you’re waiting for this, well, you will be hardly disappointed.
I decided to close the year writing my feelings, about VACUUM FULL, one
of the  PostgreSQL’s key functionality and an interesting thing
I discovered on the utility script shipped with the debian package for controlling
the clusters, pg ctlcluster.

And believe me if I tell you I feel like the Angry Birds, when the bad piggies
stole their eggs.


VACUUM two face

VACUUM is one of the most important functions shipped with PostgreSQL. It’s
main goal is to free space cleaning the dead rows generated during the normal
database activity.

As the conventional VACUUM simply marks as reusable the dead rows,
and freeze the tuples avoiding the s***t to hit the fan, preventing the XID
wraparound failure, this does not shrink the data file.
If you have a large table with many dead rows after the VACUUM you will
see the table stop growing.
If you need to reclaim space on disk the VACUUM FULL is what do you
need as it shrinks down the datafile locking in accessExclusive mode the affected
relation and preventing the read and the write.

During a VACUUM FULL on a massively bloated table I noticed a strange
behaviour.

The free space on the table’s tablespace lessened and then suddendly stopped
meanwhile the free space on the pg default began to lower until the filesystem
runt out of space with the subsequent failure of the task.

I started investigating the problem increasing the client verbosity with the
SET client min messages=’debug’;

I’ll explain what I’ve found with a simple test using a
copy of the pg attribute table.

To watch what’s happening I’ve switched on the debug message level for the
client and I’ve turned on the trace sort parameter during a single column index
creation on the relname field.
This is the entire procedure output.


db_test =# CREATE TABLE t_vacuum AS
SELECT
*
FROM
pg_attribute ;
SELECT 2155
-- repeat this many times in order to
-- fill the table with a good amount of data
db_test =# INSERT INTO t_vacuum
db_test -# SELECT * FROM t_vacuum
db_test -# ;
INSERT 0 2155
SET trace_sort = ’ on ’;
SET client_min_messages = ’ debug ’;
db_test =# CREATE INDEX
idx_attname ON t_vacuum USING btree ( attname );
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.04 s /0.48 u sec elapsed 0.53 sec
LOG : performsort starting : CPU 0.04 s /0.72 u sec elapsed 0.77 sec
LOG : finished writing run 1 to tape 0: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : finished writing final run 2 to tape 1: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.09 s /2.61 u sec elapsed 2.72
LOG : external sort ended , 836 disk blocks used : CPU 0.14 s /2.84 u sec elapsed 3.16 se
CREATE INDEX


As my work mem is small the index sort switch immediately on an external
sort in order to build the index, then generate the btree from the external sort
file.
Now let’s run a VACUUM FULL the same table.


db_test =# VACUUM FULL t_vacuum ;
DEBUG : building index " pg_toast_16411_index " on table " pg_toast_16411 "
LOG : begin index sort : unique = t , workMem = 16384 , randomAccess = f
LOG : begin index sort : unique = f , workMem = 1024 , randomAccess = f
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort starting : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort done : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.08 sec
DEBUG : vacuuming " public . t_vacuum "
DEBUG : " t_vacuum " : found 0 removable , 275840 nonremovable row versions in 4756 page
DETAIL : 0 dead row versions cannot be removed yet .
CPU 0.17 s /0.99 u sec elapsed 1.87 sec .
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.07 s /0.51 u sec elapsed 0.58 seLOG : finished writing final run 2 to tape 1: CPU 0.14 s /2.58 u sec elapsed 2.73 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.14 s /2.64 u sec elapsed 2.80
LOG : external sort ended , 836 disk blocks used : CPU 0.21 s /2.85 u sec elapsed 3.27 se
DEBUG : drop auto - cascades to type pg_temp_16402
DEBUG : drop auto - cascades to type pg_temp_16402 []
DEBUG : drop auto - cascades to toast table pg_toast . pg_toast_16411
DEBUG : drop auto - cascades to index pg_toast . pg_toast_16411_index
DEBUG : drop auto - cascades to type pg_toast . pg_toast_16411
VACUUM


The interesting part begins at row 9. The database actually performs the
VACUUM, shrinking down the data file and cleaning the dead rows then at row
13 we found the same index creation output.
This explains well why the VACUUM FULL failed.

As VACUUM FULL acts exactly as CLUSTER,  during the process the
table is read sequentially and the data output sent to a new node file and this
happens on the same tablespace where the original table belongs.

The index build is performed on the same index tablespace except for the
sort on disk which is governed by the temp tablespaces parameter. And if this
is not set then the sort is performed on the pg default.

In conclusion, if you want to run a VACUUM FULL you will need space for the new
file node on the same table’s tablespace. The indexes do require the space on the
same tablespace plus the sort on disk, if you don't have enough memory.

So, if you have space shortage on the pg_default, the command SET temp_tablespaces to a
more spacious tablespace will prevent the pg default to run out of space.


Aggressive shutdown

Working with the hot standby on debian I had a bad experience when I’d set up
a new slave server. After the data copy with the archive command started I ac-
tually started the slave forgetting to set the hot standby parameter to on. After
a while the slave reached the consistent state and I issued the pg ctlcluster com-
mand in order to stop the cluster itself with the –force to have a fast shutdown
instead of a smart one.
The script doesn’t have an option to switch between PostgreSQL’s shutdown
modes but, if without options performs a smart shutdown mode.
Adding the –force option the script first try a fast shutdown mode, if the
database doesn’t shutdown in a reasonable time the script switch to the imme-
diate shutdown, if this is not enough to stop the instance finally a kill -9 on
the postgres process is performed.

WARNING INCOMING AIRSTRIKE!!!!!

If you are too much delicate I suggest you to stop reading, now begins the real
shitload.

I understand the hard efforts done by the global team to create a first class
product, I left Oracle for PostgreSQL and I don’t regret my choice because
PostgreSQL is a damn good database, probably too much good.

But can anyone explain me why, in the name of one of any of the thousands
fake gods invented by the human race, no one put the VACUUM FULL be-
haviour in the on line manual leaving this HUGE lack of documentation on
PostgreSQL’s KEY FEATURE?

Regarding the debian's script, I understand the –force option of pg ctlcluster is not
 to be run in normal condition, but does the author understand what
does it means to perform in normal conditions a smart shutdown?

A user that forget to logout can let the script to run forever.
On the other side if the –force is passed to the script and the database takes too much to
shutdown an immediate shutdown is performed without warnings.

And this will wipe out any bloody unlogged table from the cluster on
the next startup.

Then, as cherry on the top, if the immediate does not work a kill -9 is
performed on the postmaster leaving the shared memory as garbage.

And this works the same way if you use the debian’s init script.


HEAVY MACHINE GUN! (Metal slug’s quote)
So, my last advice, is do not use the wrapper pg ctlcluster for stopping the
clusters created with pg createcluster.
It’s better to shutdown the cluster using the command.

export PGDATA=<PATH TO DATA>
/usr/lib/postgresql/<MAJOR VERSION>/bin/pg ctl stop -m <THEWAYYOUWANT>



HAPPY CHRISTMAS!







2 comments:

  1. It certainly seems to me Fed, that you're suggesting that the pg_cluster script is not the best way to manage any instances, after reading this, and knowing that damage that Kill -9 can cause, I am inclined to agree.

    Kill -9 is never a good thing on Postgres, although I have to admit, there have been times where I have had no choice but to -9 a runaway Postgres process, but in doing this, I kill -TERM all other processes first, minimizing the damage to shared mem.

    I now write/rewrite my own init scripts, even for RHEL. One cannot assume that they act the way you want them to, even with the scripts provided with source, and in the instance where an 'immediate' fails, I leave the process running and report out to allow a suitably armed DBA to fix the issue manually.

    On VACUUM FULL; I very rarely use this facility, simple as a hangover from earlier versions where VACUUM FULL was not like CLUSTER, and used to lock the tables for what seem liked forever (inefficiencies well known to most proficient DBA's). Setting more aggressive AUTOVACUUM settings, and using REINDEX where appropriate has always seemed to be enough for me.

    In an instance where disk space becomes an issue, that would be over to Sysadmins to correct, although if I as the DBA hadn't planned for enough space in the first instance, then I need to be shot for it!
    Adding more spindles and new tablespaces on said spindles is usually the "get out" clause for a lack of space, as well as backing up and then removing old/redundant data.

    Obviously there is always the situation that hasn't been planned for, where an emergency VACUUM FULL is needed, but usually by that time, there is insufficient space on the server anyway. In that instance, you're up the creek already. Temporary NFS mounts anyone?

    In short, good planning (although not always possible) keeps the VACUUM FULL at bay!

    Merry Christmas, and a Happy New Year.

    ReplyDelete
  2. Indeed a good planning can keep the vacuum full at bay :)

    Unfortunately sometimes the space needed is simply too much and the space must be reclaimed with this command.

    What upset me was the lack of documentation about the behaviour not the behaviour itself.

    Merry Christmas and Happy New Year ;)

    ReplyDelete