Saturday 26 April 2014

Chapter 5 part 1 - The connection

The connection

When a client tries to connect to a PostgreSQL cluster the process follow few stages which can result in rejection or connection. The first stage is the host based authentication where the cluster scans the pg_hba.conf file searching a match for the connection's parameters, like the host, the username etc. This file is usually stored into the data area amongst the postgresql.conf file and is read from the top to the bottom. If there's match the corresponding method is used, otherwise if there's no match then the connection is refused.
The pg_hba.conf is structured as shown in table 5.1

Table 5.1: pg_hba.conf
Type Database User Address Method
local name name ipaddress/network mask trust
host * * host name reject
hostssl       md5
hostnossl       password

The column type specifies if the connection is local and happens via unix socket or host,hostssl,hostnossl, in this case the tcp/ip is used.The host type matches either an SSL or plain connection, the hostssl only a SSL connection and hostnossl only a plain connection.
The Database and User columns are used to match specific databases or users from the incoming connection. It's possible to use a wildcard to specify everything.
The column address is used only if the type uses the tcp/ip and can be an ip address with network mask or a hostname. Both ipv4 and ipv6 are supported.
The last column is the authentication method. PostgreSQL supports many methods, from the password challenge to the sophisticated radius or kerberos.
For now we'll take a look to the most common.

  • trust allow the connection without any request. Is quite useful if the password is lost but represent a treat on production.
  • peer allow the connection if the OS user matches the database user. Useful to authenticate to the database on the local boxes. Initdb sets this as default method for the local connections.
  • password allow the connection matching the user and password with pg_shadow system table. Beware asthis method sends the password in clear over the network.
  • md5 same as for password this method offer a md5 encryption for the passwords. As the md5 is deterministic a pseudo random subroutine is used during the password challenge to avoid the same string to be sent over the network.
When the connection request matches the pg_hba.conf and the authentication method is cleared, the connection becomes established. The postgres main process forks a new backend process which attaches to the shared buffer.
As the fork process is expensive the connection is a potential bottleneck. A great amount of opening connections can degenerate in zombies resulting in the worst case in a denial of service.
A solution could be to keep all the needed connections constantly established. Even if it seems reasonable, this approach have a couple of unpleasant side effects.
Each connection's slot in the GUC max_connections, consumes about 400 bytes of shared memory; each established connection requires the allocation of the work_mem and the os management of the extra backend process.

For example a 512 MB shared_buffer and 100MB work_mem, with with 500 established connections consumes about 49 GB. Even reducing the work_mem to 10MB the required memory is still 5 GB; in addiction, as seen in 4.5.2, this parameter affects the sorts and subsequently the performance, his value requires then extra care.
In this case a connection pooler like pgpool or the lightweight pgbouncer is a good solution. In particular the latter offers a very simple to use configuration, with different pooling levels.
The GUC parameter listen_addresses in a freshly initialised data area is set to localhost. This way the cluster accepts only tcp connections from the local machine. In order to have the cluster listening on the network this parameter must be changed to the correct address to listen or to * for 'all'. The parameter accepts multiple values separated by commas.
Changing the parameters max_connections and listen_addresses require the cluster shutdown and startup as described in 4.2 and 4.3

Saturday 19 April 2014

Chapter 4 - part 6 - The data area

As seen before the data storage area is initialized by initdb . Its structure didn't change too much from the old fashioned 7.4. In this section we'll take a look to the various subdirectories and how their usage can affect the performances.


As the name suggests, the base directory contains the database files. Each database have a dedicated subdirectory, named after the internal database's object id. A freshly initialised data directory shows only three subdirectories in the base folder. Those corresponds to the two template databases,template0 and template1, plus the postgres database. Take a look to section 4.7 for more informations.
The numerical directories contains various files, also with the numerical name which are actualy the database's relations, tables and indices.
The relation's name is set initially from the relation's object id. Any file altering operation like VACUUM FULL or REINDEX, will generate a new file with a different name. To find out the real relation's file name the relfilenode inside the pg_class system table must be queried.


The global directory contains all the cluster wide relations. In addition there's the very critical control file mentioned in 3.1.7. This small file is big exactly one database block, usually 8192 bytes, and contains critical informations for the cluster. With a corrupted control file the instance cannot start. The control file is written usually when a checkpoint occurs.


This is the most important and critical directory, for the performances and for the reliability. The directory contains the transaction's logs, named wal file. Each file is usually 16 Mb and contains all the data blocks changed during the database activity. The blocks are written first on this not volatile area to ensure the cluster's recovery in case of cras. The data blocks are then written later to the corresponding data files. If the cluster's shutdown is not clean then the wal files are replayed during the startup process from the last known consistent location read from control file.
In order to ensure good performance this location should stay on a dedicated device.


This directory contains the committed transactions in small 8k files, except for the serializable transactions. The the files are managed by the cluster and the amount is related with the two GUC parameters autovacuum_freeze_max_age and vacuum_freeze_table_age. Increasing the values for the two parameters the pg_clog must store the commit status to the ``event horizon'' of the oldest frozen transaction id. More informations about vacuum and the maintenance are in the chapter 8.


Same as pg_clog this directory stores the informations about the commited transactions in serializable transaction isolation level.


Stores the informations about the multi transaction status, used generally for the row share locks.


Stores informations about the LISTEN/NOTIFY operations.


This directory is used to store the exported snapshots. From the version 9.2 PostgreSQL offers the transaction's snapshot export where one session can open a transaction and export a consistent snapshot. This way different session can access the snapshot and read all togheter the same consistent data snapshot. This feature is used, for example, by pg_dump for the parallel export.


This directory contains the permanent files for the statistic subsystem.


This directory contains the temporary files for the statistic subsystem. As this directory is constantly written, is very likely to become an IO bottleneck. Setting the GUC parameter stats_temp_directory to a ramdisk speeds can improve the database performances.


Stores the subtransactions status data.


Stores the two phase commit data. The two phase commit allows the transaction opening independently from the session. This way even a different session can commit or rollback the transaction later.


The directory contains the symbolic links to the tablespace locations. A tablespace is a logical name pointing a physical location. As from PostgreSQL 9.2 the location is read directly from the symbolic link. This make possible to change the tablespace's position simply stopping the cluster, moving the data files in the new location, creating the new symlink and starting the cluster. More informations about the tablespace management in the chapter 7.

Wednesday 16 April 2014

2048: PostgreSQL Edition

Do you know the Gabriele Cirulli's addictive game 2048?

I've found this website where anyone can custom the labels and the messages, so I created a 2048 PostgreSQL edition where the major releases, starting from 7.3 up to 9.3 are used instead of numbers.

If you are a PostgreSQL DBA worth of this name you shouldn't have any problems with the version's progression ;)

Sunday 13 April 2014

Chapter 4 - part 5 - The memory

The memory

The PostgreSQL's memory structure is not complex like other databases. In this section we'll take a to the various parts.

The shared buffer

The shared buffer, as the name suggests is the segment of shared memory used by PostgreSQL to manage the data pages. Its size is set using the GUC parameter shared_buffers and is allocated during the startup process.Any change requires the instance restart.
The segment is formatted in blocks with the same size of the data file's blocks, usually 8192 bytes. Each backend connected to the cluster is attached to this segment. Because usually its size is a fraction of the cluster's size, a simple but very efficient mechanism keeps in memory the blocks using a combination of LRU and MRU.
Since the the version 8.3 is present a protection mechanism to avoid the massive block eviction when intensive IO operations, like vacuum or big sequential reads, happens.
Each database operation, read or write, is performed moving the blocks via the shared buffer. This ensure an effective caching process and the memory routines guarantee the consistent read and write at any time.
PostgreSQL, in order to protect the shared buffer from potential corruption, if any unclean disconnection happens, resets by default all the connections.
This behaviour can be disabled in the configuration file but exposes the shared buffer to data corruption if the unclean disconnections are not correctly managed.

The work memory

This memory segment is allocated per user and its default value is set using the GUC parameter work_mem. The value can be altered for the session on the fly. When changed in the global configuration file becomes effective to the next transaction after the instance reload. This segment is used mainly for expensive operations like the sort or the hash.
If the operation's memory usage exceeds the work_mem value then the PostgreSQL switches to a disk sort/hash.
Increasing the work_mem value results generally in better performances for sort/hash operations.
Because is a per user memory segment, the potential amount of memory required in a running instance is max_connections * work_mem. It's very important to set this value to a reasonable size in order to avoid any risk of out of memory error or unwanted swap.
In complex queries is likely to have many sort or hash operations in parallel and each one consumes the amount of work_mem for the session.

The maintenance work memory

The maintenance work memory is set using the GUC parameter maintenance_work_mem and follow the same rules of work_mem. This memory segment is allocated per user and is used for the maintenance operations like VACUUM or REINDEX. As usually this kind of operations happens on one relation at time, this parameter can be safely set to a bigger value than work_mem.

The temporary memory

The temporary memory is set using the GUC parameter temp_buffers. This is the amount of memory per user for the temporary table creation before the disk is used. Same as for the work memory and the maintenance work memory it's possible to change the value for the current session but only before any temporary table creation. After this the parameter cannot be changed anymore.

Saturday 5 April 2014

Chapter 4 - part 4 - The processes

Since the early version 7.4, when the only process running the cluster were the old loved postmaster, PostgreSQL has enriched with new dedicated processes, becoming more complex but even more efficient. With a running cluster there are at least six postgres processes, the one without colon in the process name is is the main database's process, started as seen in the section 4.2.

postgres: checkpointer process

As the name suggest this process take care of the cluster's checkpoint. The checkpoint is an important event in the database activity. When a checkpoint starts all the dirty pages in memory are written to the data files. The checkpoint by the time and the number of cluster's WAL switches. To adjust the checkpoin's frequency the GUC parameters checkpoint_timeout and checkpoint_segments are used. A third parameter, checkpoint_completion_target is used to spread the checkpoint over a percentage of the checkpoint_timeout, in order to avoid a massive disk IO spike.

postgres: writer process

To ease down the checkpoint activity the background writer scans the shared buffer for dirty pages to write down to the disk. The process is designed to have a minimal impact on the database activity. It's possible to tune the rounds length and delay using the GUC parameters bgwriter_delay, time between two rounds, and bgwriter_lru_maxpages, the number of buffers after the writer's sleep.

postgres: wal writer process

This background process has been introduced recently to have a more efficient wal writing. The process works in rounds were write down the wal buffers to the wal files. The GUC parameter wal_writer_delay sets the milliseconds to sleep between the rounds.

postgres: autovacuum launcher process

This process is present if the GUC parameter autovacuum is set to on. It's scope is to launch the autovacuum backends at need. Anyway autovacuum can run even if autovacuum is turned of, when there's risk of the XID wraparound failure.

postgres: stats collector process

The process gathers the database's usage statistics for human usage and stores the informations into the location indicated by the GUC stats_temp_directory, by default pg_stat_temp. Those statistics are useful to understand how the database is performing, from pyshical and logical point of view.

postgres: postgres postgres [local] idle

This kind of process is the database backend, one for each established connection. The values after the colon square brackets show useful informations like the connected database, the username, the host and the executing query. The same informations are stored into the pg_stat_activity table.