Sunday 30 March 2014

Chapter 4 - part 3 - The shutdown sequence

The shutdown sequence

For those coming from Oracle, the PostgreSQL's shutdown sequence will sound familiar with few, but very important, exceptions. A PostgreSQL process enters the shutdown status when receive a specific OS signal.
This can happen using the os kill or via pg_ctl.

As seen in the section 3.1.2 the latter accepts the -m switch to specify the shutdown mode. If not specified defaults to the mode smart which sends the signal SIGTERM to the running PostgreSQL process. This way the database will not accept new connections and will wait for the existing connections to exit before the shutdown.

The most useful PostgreSQL shutdown mode is the fast mode. This way the SIGQUIT is sent to the postgresql process and the database will disconnect all the sessions rolling back the open transactions and then will enter the shutdown sequence.

Either modes, smart and fast, make the database shutdown clean, leaving the cluster in consistent state when the postgres process exits.
As soon as the postgres process enters the shutdown sequence will issue a last checkpoint consolidating the updated physical blocks in the shared buffers to the data files.

As soon as the checkpoint is complete and the last consistent position is logged on the pg_control file, the database main process exits the accessory processes like the walwriter or the checkpointer and terminates removing the pid file.
The last checkpoint can slow down the entire sequence because is commonly spread through time to avoid any disk IO activity spike. If the shared_buffer is big and contains many dirty blocks, the checkpoint can run for a very long time. In addition, if at the shutdown time, another checkpoint is running the database will wait the completion before starting the final checkpoint. This meanwhile the new connections are forbidden, whatever shutdown mode you decided to use.
In order to have an idea of what's happening on the running cluster, is a good practice to change the GUC log_checkpoints = off to log_checkpoints = on.
For more informations about the database processes take a look to the section 4.4.

If the cluster is taking too long to stop, as last resort is possible to use the immediate mode. This will send a SIGQUIT to the running process causing the immediate exit of the main process and all the sessions.
The shutdown in this case is not clean and the subsequent start will perform a crash recovery starting from the last consistent state read from the pg_control file. The process is usually harmless with one important exception.
If the cluster have unlogged tables those relations are recreated from scratch when the recovery happens and all the data in those table is lost.

This is the main reason I suggest to avoid the pg_ctlcluster shipped with debian to stop the cluster. The program doesn't offer any control on the shutdown mode and can result in disastrous data loss. For more details take a look to the subsection 3.3.4.

A last word about the SIGKILL signal. It can happen the cluster refuse to stop even using the immediate mode. In this case, as last resort the SIGKILL or kill -9 can be used. The online manual is very clear on this point. As the SIGKILL cannot be trapped and evicts from the ram the process with the shared memory, its use will results in not freeing the resources used by the killed process.

This will very likely affect the start of a fresh instance. Please refer to your sysadmin to find out the best way to perform a memory and semaphore cleanup before starting PostgreSQL after a SIGKILL.

Wednesday 26 March 2014

Chapter 4 - part 2 - the startup sequence

The startup sequence

When the server process is started allocates the shared segment in memory. In the versions before the 9.3 this was the a potential point of failure because, if the shared_buffers was bigger than the kernel's max allowed shared memory segment the startup will fail with this sort of error

FATAL: could not create shared memory segment: Cannot allocate memory

DETAIL: Failed system call was shmget(key=X, size=XXXXXX, XXXXX).

HINT: This error usually means that PostgreSQL's request for a shared memory
segment exceeded available memory or swap space, or exceeded your kernel's
SHMALL parameter. You can either reduce the request size or reconfigure the
kernel with larger SHMALL. To reduce the request size (currently XXXXX bytes),
reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or
In this case the kernel parameters needs adjustment. As from my Oracle experience I take no chance and I do use the values suggested for an Oracle installation which incidentally works perfectly for PostgreSQL.
kernel.shmall = 2621440
kernel.shmmax = 18253611008
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 658576
Put those values into the file /etc/sysctl.conf and, as root, run sysctl -p.
The latest major version PostgreSQL switched from the SysV to Posix shared memory and mmap. This solved completely the shared memory tuning on Linux.
When the memory is allocated the postmaster reads the pg_control file to check if the instance requires recovery.
The pg_control contains references to the last checkpoint and the last known status for the instance.
If the instance is in dirty state, because a crash or an unclean shutdown, the startup replays the blocks from the WAL segments in the pg_xlog directory starting from the last checkpoint position read from the pg_control file.
Any corruption in the wal files or, even worse, the pg_control file results in a not startable instance.

After the recovery is complete or the cluster is in clean state, then the startup completes opening the database in production state.  

Sunday 23 March 2014

Chapter 4 part 1 - Initialising the data directory

Initialising the data directory

A PostgreSQL instance is composed by a shared memory process and a data area managed by the postgres processes. The data area is initialised by initdb which requires an empty and writable directory to succeed. The binary location depends on the installation method, take a look to the chapters 3 and 2 for further informations. Initdb accepts various parameters. If not supplied then the probam will try to get the informations from the environment variables.

Its basic usage is
If the variable PGDATA is set then the DATADIRECTORY can be omitted.
postgres@tardis:~/tempdata$ export PGDATA=`pwd`
postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/initdb 
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_GB.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/tempdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
creating configuration files ... ok
creating template1 database in /var/lib/postgresql/tempdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    /usr/lib/postgresql/9.3/bin/postgres -D /var/lib/postgresql/tempdata
    /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/tempdata -l 
logfile start
After initialising the data area initdb shows the two methods to start the database instance. The first one is useful for debugging and development purposes as starts the database directly from the command line without daemonising the process.
postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/postgres -D 
LOG:  database system was shut down at 2014-03-23 18:52:07 UTC
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
To stop it simply press ctrl+c
The second method is more interesting as the PostgreSQL process managed as daemon via pg_ctl.
The option -l logfile is to enable the logfile redirection. Without this option the server's output will appear on the standard output.
without -l

postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/pg_ctl -D 
/var/lib/postgresql/tempdata  start
server starting
postgres@tardis:~/tempdata$ LOG:  database system was shut down at 2014-03-23 
19:00:36 UTC
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

with -l 

postgres@tardis:~/tempdata$ /usr/lib/postgresql/9.3/bin/pg_ctl -D 
/var/lib/postgresql/tempdata -l logfile start
server starting

postgres@tardis:~/tempdata$ tail logfile 
LOG:  database system was shut down at 2014-03-23 19:01:19 UTC
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
As seen in the subsection 3.1.2 pg_ctl accepts the command to perform on the cluster.
In order to stop the running instance simply running
postgres@tardis:~$ /usr/lib/postgresql/9.3/bin/pg_ctl -D 
/var/lib/postgresql/tempdata -l logfile stop
waiting for server to shut down.... done
server stopped

Friday 14 March 2014

Chapter 3 - part 2, the wrappers and debian's utilities

Wrappers and contributed modules

In this section we'll take a brief look to the contributed binaries and the sql wrappers.

create/drop binaries

These binaries, createdb createlang createuser and dropdb droplang dropuser, are wrappers for the corresponding SQL functions. Each binary can create/drop a database an user or a procedural language. The command line parameters are quite the same as psql for the connection part.


Performs a database wide cluster on previously clustered tables. The word cluster can be confusing as the PostgreSQL implementation is very peculiar. Check the chapter 6 for further readings.


Performs a database wide reindex. In chapter 6 we'll look deeply to the index maintenance, how this can affect the performances.


Performs a database wide vacuum and/or statistics gathering for the query optimiser. VACUUM is a very important maintenance task. We'll take a deep look on this in the chapter 8.


Despite the name this binary doesn't perform any vacuum, its purpose is to remove orphaned large object from the pg_largeobject. The pg_largeobject is a system table where the database stores the binary objects. Usually is used when the size of the large object is bigger than 1GB. The theoretical limit for the large object is now 4 TB. Before the version 9.3 the limit were 2 GB.

Debian's specific utilities

The debian packaged install ships with some other, not official, utilities, mostly written in PERL.


Creates a new PostgreSQL cluster naming the configuration's directory in /etc/postgresql after the major version and the cluster's name. It's possible to specify the data directory and the initd options.


Removes a PostgreSQL cluster created previously with pg_createcluster. The cluster must be stopped before the drop.


Lists the clusters created with pg_createcluster.


Controls the cluster in an almost similar way pg_ctl does. Using this wrapper for the shutdown is not a good idea as there's no way to tell the script which shutdown mode to use. For further readings on the shutdown sequence check the section 4.2.2 By default pg_ctlcluster performs a smart shutdown mode. Using the -force option the script first try a fast shutdown mode. If the database doesn't shutdown in a reasonable time the script then try an immediate shutdown. If the the instance is still up the script then sends a kill -9 on the postgres process.

Tuesday 4 March 2014

Chapter 3 -part 1 - Install structure

Install structure

In this chapter we'll look at the PostgreSQL installation. Whatever installation method you did, the PostgreSQL binaries are the same. The packaged version comes with few extra utilities we'll take a look later as those present some caveats to be aware.
The source installation puts all the binaries into the bin sub directory in the target location specified by the -prefix parameter.
The packaged install puts the binaries into a folder organised per major PostgreSQL version.
e.g. /usr/lib/postgresql/9.3/bin/

The core binaries

We'll look first at the core binaries like postgres or psql.
A separate section is dedicated to wrappers and the contributed modules.


Is the database itself. It's possible to start it directly or using the pg_ctl utility.
The latter is the best way to control the instance except in case of the XID wraparound failure. In this case the only way to run the instance is executing postgres in single user mode.
For historical reason there's usually also the postmaster symbolic link to postgres.


As mentioned before this is the utility for managing the PostgreSQL instance.
It can start,stop, reload the postgres process. It's also capable to send kill signals to the running instance.
pg_ctl accepts on the command line various options. The most important are the -D or -pgdata= to specify the database storage area and the -m to specify the shutdown mode. Check the section 4.2.2 for details.
pg_ctl also requires the action to perform on the instance.
The supported actions are

  • init[db] initialises a directory as PostgreSQL data area
  • start starts a PostgreSQL instance
  • stop shutdowns a PostgreSQL instance
  • reload reloads the configuration's files
  • status checks the PostgreSQL instance running status
  • promote promotes a standby server
  • kill sends a custom signal to the running instance


Is the binary which initialises the PostgreSQL data area. initdb requires an empty directory to initialise. Various options can be specified on the command line, like the character enconding or the collation order.


Is the PostgreSQL command line client. Despite his look very essential is one of the most flexible tools available to interact with the server. As is part of the core distribution is always present.


Is the binary dedicated to the backup. Generates consistent backups in various formats. The default is plain text. Supports the parallel dump implemented using the snapshot exports.
The switch to set is -F followed by a letter to indicate the wanted output format.

  • p saves the sql statements to reconstruct the schema and/or data in plain text with no compression.
  • c is the custom PostgreSQL format. Supports parallel restore, compression and object search.
  • d the dump is saved in a directory. With this format is possible to dump in parallel.
  • t saves the dump in the standard tar format.
Please on't be confused by the pg_dumpall . This does look more like a wrapper for pg_dump rather a dedicated program.
As pg_dumpall doesn't support all the pg_dump features is still very useful to save the cluster wide objects like the users with the switch -globals-only.


As the name suggests this is used to restore the database's dump. It can read the backups generated in all formats by pg_dump. The restore target can be a PostgreSQL connection or a file. If the backup's format is directory or custom then pg_restore can run the data load and index/key creations in multiple jobs.


The program query the pg_control file where instance's vital informations are stored. The pg_control is one of the most important cluster's files. With a corrupted pg_control the instance cannot start.


If the WAL files or the get corrupted the instance cannot perform a crash recovery. pg_resetxlog can solve the problem and make the instance startable but keep in mind this must the last chance, after trying any other possible solution.
The reset removes the WAL files and creates a new pg_control. The XID are also restarted.
The instance becomes startable at the cost of losing any reference between the transactions and the data files. All the physical data integrity is lost and any attempt to run DML queries results in data corruption.
The on line manual is absolutely clear on this point.
After running pg_resetxlog the database must start without user access, the entire content must be dumped, the data directory must be dropped and recreated from scratch using initdb and then the dump file can be restored using psql or pg_restore