Saturday, 26 April 2014

Chapter 5 part 1 - The connection

The connection

When a client tries to connect to a PostgreSQL cluster the process follow few stages which can result in rejection or connection. The first stage is the host based authentication where the cluster scans the pg_hba.conf file searching a match for the connection's parameters, like the host, the username etc. This file is usually stored into the data area amongst the postgresql.conf file and is read from the top to the bottom. If there's match the corresponding method is used, otherwise if there's no match then the connection is refused.
The pg_hba.conf is structured as shown in table 5.1

Table 5.1: pg_hba.conf
Type Database User Address Method
local name name ipaddress/network mask trust
host * * host name reject
hostssl       md5
hostnossl       password
        gss
        sspi
        krb5
        ident
        peer
        pam
        ldap
        radius
        cert

The column type specifies if the connection is local and happens via unix socket or host,hostssl,hostnossl, in this case the tcp/ip is used.The host type matches either an SSL or plain connection, the hostssl only a SSL connection and hostnossl only a plain connection.
The Database and User columns are used to match specific databases or users from the incoming connection. It's possible to use a wildcard to specify everything.
The column address is used only if the type uses the tcp/ip and can be an ip address with network mask or a hostname. Both ipv4 and ipv6 are supported.
The last column is the authentication method. PostgreSQL supports many methods, from the password challenge to the sophisticated radius or kerberos.
For now we'll take a look to the most common.

  • trust allow the connection without any request. Is quite useful if the password is lost but represent a treat on production.
  • peer allow the connection if the OS user matches the database user. Useful to authenticate to the database on the local boxes. Initdb sets this as default method for the local connections.
  • password allow the connection matching the user and password with pg_shadow system table. Beware asthis method sends the password in clear over the network.
  • md5 same as for password this method offer a md5 encryption for the passwords. As the md5 is deterministic a pseudo random subroutine is used during the password challenge to avoid the same string to be sent over the network.
When the connection request matches the pg_hba.conf and the authentication method is cleared, the connection becomes established. The postgres main process forks a new backend process which attaches to the shared buffer.
As the fork process is expensive the connection is a potential bottleneck. A great amount of opening connections can degenerate in zombies resulting in the worst case in a denial of service.
A solution could be to keep all the needed connections constantly established. Even if it seems reasonable, this approach have a couple of unpleasant side effects.
Each connection's slot in the GUC max_connections, consumes about 400 bytes of shared memory; each established connection requires the allocation of the work_mem and the os management of the extra backend process.

For example a 512 MB shared_buffer and 100MB work_mem, with with 500 established connections consumes about 49 GB. Even reducing the work_mem to 10MB the required memory is still 5 GB; in addiction, as seen in 4.5.2, this parameter affects the sorts and subsequently the performance, his value requires then extra care.
In this case a connection pooler like pgpool http://www.pgpool.net/ or the lightweight pgbouncer http://pgfoundry.org/projects/pgbouncer is a good solution. In particular the latter offers a very simple to use configuration, with different pooling levels.
The GUC parameter listen_addresses in a freshly initialised data area is set to localhost. This way the cluster accepts only tcp connections from the local machine. In order to have the cluster listening on the network this parameter must be changed to the correct address to listen or to * for 'all'. The parameter accepts multiple values separated by commas.
Changing the parameters max_connections and listen_addresses require the cluster shutdown and startup as described in 4.2 and 4.3

Saturday, 19 April 2014

Chapter 4 - part 6 - The data area

As seen before the data storage area is initialized by initdb . Its structure didn't change too much from the old fashioned 7.4. In this section we'll take a look to the various subdirectories and how their usage can affect the performances.


base

As the name suggests, the base directory contains the database files. Each database have a dedicated subdirectory, named after the internal database's object id. A freshly initialised data directory shows only three subdirectories in the base folder. Those corresponds to the two template databases,template0 and template1, plus the postgres database. Take a look to section 4.7 for more informations.
The numerical directories contains various files, also with the numerical name which are actualy the database's relations, tables and indices.
The relation's name is set initially from the relation's object id. Any file altering operation like VACUUM FULL or REINDEX, will generate a new file with a different name. To find out the real relation's file name the relfilenode inside the pg_class system table must be queried.


global

The global directory contains all the cluster wide relations. In addition there's the very critical control file mentioned in 3.1.7. This small file is big exactly one database block, usually 8192 bytes, and contains critical informations for the cluster. With a corrupted control file the instance cannot start. The control file is written usually when a checkpoint occurs.


pg_xlog

This is the most important and critical directory, for the performances and for the reliability. The directory contains the transaction's logs, named wal file. Each file is usually 16 Mb and contains all the data blocks changed during the database activity. The blocks are written first on this not volatile area to ensure the cluster's recovery in case of cras. The data blocks are then written later to the corresponding data files. If the cluster's shutdown is not clean then the wal files are replayed during the startup process from the last known consistent location read from control file.
In order to ensure good performance this location should stay on a dedicated device.


pg_clog

This directory contains the committed transactions in small 8k files, except for the serializable transactions. The the files are managed by the cluster and the amount is related with the two GUC parameters autovacuum_freeze_max_age and vacuum_freeze_table_age. Increasing the values for the two parameters the pg_clog must store the commit status to the ``event horizon'' of the oldest frozen transaction id. More informations about vacuum and the maintenance are in the chapter 8.


pg_serial

Same as pg_clog this directory stores the informations about the commited transactions in serializable transaction isolation level.


pg_multixact

Stores the informations about the multi transaction status, used generally for the row share locks.


pg_notify

Stores informations about the LISTEN/NOTIFY operations.


pg_snapshots

This directory is used to store the exported snapshots. From the version 9.2 PostgreSQL offers the transaction's snapshot export where one session can open a transaction and export a consistent snapshot. This way different session can access the snapshot and read all togheter the same consistent data snapshot. This feature is used, for example, by pg_dump for the parallel export.


pg_stat

This directory contains the permanent files for the statistic subsystem.


pg_stat_tmp

This directory contains the temporary files for the statistic subsystem. As this directory is constantly written, is very likely to become an IO bottleneck. Setting the GUC parameter stats_temp_directory to a ramdisk speeds can improve the database performances.


pg_subtrans

Stores the subtransactions status data.


pg_twophase

Stores the two phase commit data. The two phase commit allows the transaction opening independently from the session. This way even a different session can commit or rollback the transaction later.


pg_tblspc

The directory contains the symbolic links to the tablespace locations. A tablespace is a logical name pointing a physical location. As from PostgreSQL 9.2 the location is read directly from the symbolic link. This make possible to change the tablespace's position simply stopping the cluster, moving the data files in the new location, creating the new symlink and starting the cluster. More informations about the tablespace management in the chapter 7.