Wednesday, 25 June 2014

Chapter 7 - Part 4, VACUUM FULL and CLUSTER

VACUUM FULL and CLUSTER

The CLUSTER command rebuilds a completely new table with the tuples in the same order of the clustered index. The clustered index can be set using the command ALTER TABLE table_name CLUSTER ON index_name and is used as sort key in the next CLUSTER. In order to get a clearer picture, let's cluster the table created in 7.1 its primary key.

This CLUSTER's run performs a full index scan on the clustered index avoiding the table sort. The tuples are then stored into a new file node. When the build is complete then the relation's file node is swapped in the system catalogue and indices are reindexed. When the command completes the old file node is then removed. The entire process requires an exclusive lock on the table preventing the reads and the writes. Also the storage is a critical point because the disk space requirements are for the old relation plus the new one with the indices and the eventual sort on disk.

Looking at source code in src/backend/commands/cluster.c, is clearly stated in the file's header that CLUSTER a table on an index. This is now also used for VACUUM FULL. The only difference between VACUUM FULL and CLUSTER, is the clustered index's OID validity. If it's valid then the data output is sorted on the clustered index. How the data is sorted is determined by the planner, which choice is the cheapest between an index scan and the sequential scan with a data sort. Otherwise, if the index's OID is invalid then the tuples are read using a plain sequential scan.

VACUUM FULL and CLUSTER have beneficial effects on the storage as the space is returned to the operating system. Also, regenerating completely the relation's files with the reindex, it makes the page access more efficient and CLUSTER rebuilds the table on the clustered index order. This minimise the random disk seeks when accessing the data via clustered index.

The disadvantages for using those commands are the complete stop of the affected table's activity. Also, CLUSTER and VACUUM FULL do not fix the XID wraparound risk which the conventional VACUUM does.

As rule of thumb, in order to minimise the database's downtime, CLUSTER and the VACUUM FULL should be used only for extraordinary maintenance and only if the disk space is critical. For the day to day maintenance it's best to rely on VACUUM and occasionally the reindex as seen in 7.1 and 7.3.

Sunday, 22 June 2014

Chapter 7 - Part 3, reindex

reindex

A B-tree index entry carries the indexed value with the pointer to the corresponding heap page where the full tuple is stored. The pages are organised in the form of a balanced tree referring each other using the special space seen in 6.1. Until the heap tuple does not change page, the index entry doesn't need update. The HOT strategy tries to achieve this goal keeping the heap tuples in the same page. When the heap tuple's page change then the index entry needs to be updates as well. Having the index pages fillfactor of 70 for the not leaf pages and a fillfactor of 90 for the leaf pages the index tuple's page change is somewhat limited.

The VACUUM is not so effective with the indices, even clearing down the dead tuple's space this is reusable only if compatible with the B-tree position. Only the completely empty pages can be recycled and this requires at least two VACUUM runs, because when an index page is empty then is marked as deleted, it's stamped with the next XID and becomes invisible. The empty page then is not immediately recycled because there might be running scans which need to access the page which is still visible in their snapshot. The next VACUUM run will ensure the page is no longer required by the running transactions and the it marks as reusable.

For those reasons the indices are affected by the data bloat more than the tables. Alongside with a bigger disk space allocation, the bloat results in a generally bad index's performances. The reindex is then the only solution to restore an index good shape.

Unlike the VACUUM, REINDEX have a substantial impact on the database activity. To ensure the data is consistently read the REINDEX sets a table lock which prevents the table's writes. The reads are also blocked for the SELECT queries using the rebuilding index.

A B-tree index build requires a data sort to build the balanced tree. PostgreSQL comes with a handy GUC parameter to track the sort, the trace_sort. The message verbosity level by the trace is DEBUG, then also the client_min_messages needs adjustment for getting the informations from the trace. Let's then reindex the table's primary key seen in 7.1.

The reindex performs a data sort which does not fit in the maintenance_work_mem. PostgreSQL then starts a slower disk sort to build up the index. The first LOG entry with begin index sort: shows the available maintenance_work_mem for the index sort. If after the table scan the available memory is exhausted then an external sort on disk will happen. Otherwise a faster sort in memory will build the index. Increasing then the maintenance_work_mem can improve the reindex. Unfortunately the determining the value when the sort in memory happens is not simple and can just be guessed from the index size. The previous reindex with 1 GB maintenance_work_mem runs 40% faster.

The reindex create a completely new filenode for the index and when the build is complete, the pg_class entry is then updated with the new relfilenode's value and old filenode is deleted. The entire sequence can be emulated creating a new index with a different name with the CREATE INDEX statement. After the index is ready, dropping the old one, renaming the new index to the old name, will result in a brand new index without blocking the reads.

Since the version 8.2 PostgreSQL supports the CREATE INDEX CONCURRENTLY which doesn't block reads nor writes. Using this method, the index creation starts adding a new invalid index in the system catalogue starting a first table scan to builds the entries without caring for the changing data. When the first build is complete a second table scan fixes the not consistent tuple's references and finally the index's status is set to valid becoming available. This approach, combined with the swap strategy can limit greatly the impact of the index maintenance.

The concurrent index build have indeed some caveats and limitations.

Any problem with the table scan will fail the command and leave behind an invalid index which is ignored for the reads but adds overhead for the inserts and updates.
When building an unique index concurrently this start enforcing the uniqueness when the second table scan starts. Some transactions could then start reporting the uniqueness violation before the index becomes available and if the build fails during the second scan then the invalid index continues to enforce the uniqueness.
Regular index builds can run in parallel on the same table. Concurrent index builds cannot.
Concurrent index builds cannot run within a transaction block.

With the primary keys and unique constraints is also possible to use the index swap strategy with an extra little trick. PostgreSQL since the version 9.1 supports the ALTER TABLE table_name ADD table_constraint using_index statement. Combining this statement in a single command with a DROP CONSTRAINT it's possible to swap the constraint's index without losing the uniqueness enforcement.

The example uses a regular index build and then blocks the writes. It's also possible to build the new index concurrently.
As this method works well for the unreferenced primary or unique keys, any foreign key referencing the unique field will cause the drop constraint's failure.

In this case the safest way to proceed is the conventional REINDEX.

Tuesday, 17 June 2014

Chapter 7 - Part 2, analyze

analyze

The PostgreSQL's query optimiser is based on the costs estimates. When building the execution plans the planner consults the internal statistics and assigns to each node plan an estimated cost. The plan with the smallest total estimated cost is then executed. Having up to date and accurate statistics will help the database to keep up the performances.
The ANALYZE command is used to gather the usage statistics. When launched reads the data, builds up the statistics and stores them into the pg_statistics system table. The command accepts the optional clause VERBOSE to increase verbosity and the optional target table and the column list. If ANALYZE is launched with no parameters it processes all the tables in the database. Launching ANALYZE with the table name only, will process all the table's columns.

When working on large tables ANALYZE runs a sample random read on a table's portion. The GUC parameter default_statistics_target determines the amount of entries read by the sample. The default limit is 100. Increasing the value will cause the planner to get better estimates. in particular for columns having data distributed irregularly. This accuracy costs more time for the statistics gathering and space because requires a bigger storage in the pg_statistics table.

To show how the default_statistics_target can affects the estimates, let's run an ANALYZE VERBOSE with the default setting on the table created in 7.1.

Even if the table have 10 million rows, the analyse estimates only 2,909,979 rows, the 30% of the total effective storage.
Changing the default_statistics_target to its maximum value of 10000 ANALYZE will get better estimates.

This time the table estimate is correctly determined in 10 millions live rows.
The pg_statistics stores the gathered data for the database usage only. To help users and administrators it's present the view pg_stats providing a human readable visualization of the gathered statistics.

As general rule, before starting any performance tuning, it's important to check if database statistics are recent and accurate. The information is stored into the view pg_stat_all_tables ^7.2.
For example this query gets the last execution of the manual and the auto vacuum with the analyze and auto analyze, for a given table.

The statistics target can be set per column to fine tune the ANALYZE with the ALTER TABLE SET STATISTICS statement.

The SET can be used, as seen before, to change the default statistics target for the current session. Otherwise is possible to change the value cluster wide changing the parameter in the postgresql.conf file.

Sunday, 15 June 2014

Chapter 7 - Part 1 - VACUUM

Maintenance

The database maintenance is something crucial for the efficiency of the data access, the integrity and the reliability. Any database sooner or later will need a proper maintenance plan.
When a new tuple's version is generated by an update it can be put everywhere there's free space. Frequent updates can result in tuples moving across the data pages many and many times leaving a trail of dead tuples behind them. Because the dead tuples are physically stored but no longer visible this creates an extra overhead causing the table to bloat. Indices makes things more complicated because when a tuple changes page the index entry is updated to point the new page and because of the index's ordered structure, the bloating is more probable than the table.

vacuum

VACUUM is a PostgreSQL specific command which reclaims back the dead tuple's space. When called without specifying a target table, the command processes all the tables in the database. Running regulary VACUUM have some beneficial effects.

It reclaims back the dead tuple's disk space.
It updates the visibility map making the index scans run faster.
It freezes the tuples with old XID protecting from the XID wraparound data loss

The optional ANALYZE clause also gather the statistics on processed table, more details here 7.2.

A standard VACUUM's run, frees the space used by the dead rows inside the data files but doesn't returns the space to the operating system. VACUUM doesn't affects the common database activity but prevents any schema change on the processed table. Because the pages are rewritten, a VACUUM run increases substantially the I/O activity.

The presence of one or more empty pages in the table's end can be removed by VACUUM if an exclusive lock on the relation can be obtained immediately. When this happens the table is scanned backward to find all the empty pages and then it's truncated to the first not empty page. The index pages are scanned as well and the dead tuples are also cleared. The VACUUM's truncate scan works only on the heap data files. VACUUM'S performances are influenced by the maintenance_work_mem only if the table have indices, otherwise the VACUUM will run the cleanup sequentially without storing the tuple's references for the index cleanup.

To show the effect of the maintenance_work_mem let's build build a simple table with 10 million rows.

To have a statical environment we'll disable the table's autovacuum. More infos on autovacuum here 7.5. We'll also increase the session's verbosity to look out what's happening during the VACUUM's run.

We are now executing a complete table rewrite running an UPDATE without the WHERE condition. This will create 10 millions of dead rows.

Before running the VACUUM we'll change the maintenance_work_mem to a small value enabling the the timing to check the query duration.

During the VACUUM the the maintenance_work_mem is used to store an array of TCID referencing the dead tuples for the index cleanup. If the maintenance_work_mem is small and the dead rows are many, the memory fills up often. When this happens the table scan pauses and the index is scanned searching for the tuples stored into the array. When the index scan is complete the array of TCID is emptied and the table's scan resumes. Increasing the maintenance_work_mem to 2 GB^7.1 the index scan is executed in one single run resulting in a VACUUM 32 seconds faster.

A table without indices does use the maintenance_work_mem. For example if we run the VACUUM after dropping the table's primary key the execution is faster even with the low maintenance_work_mem setting.

The table seen in the example begins with a size of 806 MB . After the update the table double its size which remains the same during the VACUUM runs the updates. This happens because after the first insert the table had all the rows packed together; the update added in the table's bottom the new row versions leaving the previous 10 millions row on the table's top as dead tuples. The VACUUM's run cleared the space on the table's top but weren't able to truncate because all the rows packed in the table's bottom. Running a new UPDATE followed by VACUUM would free the space in the table's bottom and a truncate scan would succeed but only if there's no tuple in the table's end free space. To check if the vacuum is running effectively the tables should show an initial growt followed by a substantial size stability in time. This happens only if the new rows are versioned at the same rate of the old rows clear down.

The XID wraparound failure protection is performed automatically by VACUUM which when it finds a live tuple with a t_xmin's age bigger than the GUC parameter vacuum_freeze_min_age, then it replaces the tuple's creation XID with the FrozenXID preserving the tuple's visibility forever. Because VACUUM by default skips the pages without dead tuples it will miss some aging tuples. That's the reason why it's present a second GUC parameter, vacuum_freeze_table_age, which triggers a VACUUM's full table scan when the table's relfrozenxid age exceeds the value.

VACUUM accepts the FREEZE clause which forces a complete tuple freeze regardless to the age. That's equivalent to run the VACUUM setting the vacuum_freeze_min_age to zero.

There are few GUC parameters controlling VACUUM

vacuum_freeze_table_age

As seen before, VACUUM this parameter triggers a whole table scan if the table's pg_class.relfrozenxid field is older than the parameter's value. The default is 150 million transactions. The accepted values are from zero to one billion. Anyway VACUUM will silently set the effective value to the 95% of autovacuum_freeze_max_age, in order to give more chance to the manual VACUUM to run before an anti-wraparound autovacuum.

vacuum_freeze_min_age

Sets the age in transactions when VACUUM replaces the tuple's transaction IDs with the FrozenXID. The default is 50 million transactions. Same as for vacuum_freeze_table_age tThe accepted values are from zero to one billion. With this parameter VACUUM will silently set the effective value to half the value of autovacuum_freeze_max_age in order to increase the time between the forced autovacuums.

vacuum_multixact_freeze_min_age

As from 9.3 VACUUM take care also of the multiple transaction ID, used to store the row locks in the tuple's header. Because the multixact ID is implemented same as the XID the value suffers of the same wraparound failure. This parameter sets the age's limit for VACUUM to performs a whole table scan. The table's pg_class.relminmxid is used to check the transaction's age. . The default is 150 million multixacts. Also VACUUM limits the effective value to 95% of autovacuum_multixact_freeze_max_age, so that a periodical manual VACUUM has a chance to run before an anti-wraparound is launched for the table.

vacuum_multixact_freeze_table_age

Sets the age in multixacts when VACUUM replaces the multixact IDs with a newer transaction ID or multixact ID while scanning a table. The default is 5 million multixacts. The accepted values are from zero to one billion but VACUUM will silently limit the effective value to half the value of autovacuum_multixact_freeze_max_age, in order to increase the time between forced autovacuums.

vacuum_defer_cleanup_age

This parameter have effect only on the master in hot standby configurations. When set to a positive value on the master, can reduce the risk of query conflicts on the standby. Does not have effect on the standby though.

vacuum_cost_delay

This parameter, if set to a not zero value enables the cost based vacuum delay and sets the time, in milliseconds, that the VACUUM process sleeps when the cost limit has been exceeded. The default value is zero, which disables the cost-based vacuum delay feature.

vacuum_cost_limit

This parameter sets the value when the internal counter puts the VACUUM process to sleep. The default value is 200.

vacuum_cost_page_hit

The counter's cost is determined using arbitrary values for each vacuum operation. This parameter sets the estimated cost for vacuuming one buffer found in the shared buffer cache. It represents the cost to lock the buffer pool, lookup the shared hash table and scan the content of the page. The default value is one.

vacuum_cost_page_miss

This parameter sets the estimated cost for vacuuming a buffer not present in the shared buffer. This represents the effort to lock the buffer pool, lookup the shared hash table, read the desired block in from the disk and scan its content. The default value is 10.

vacuum_cost_page_dirty

This parameter sets the estimated cost charged when vacuum changes a previously clean block. It represents the extra I/O required to flush the dirty block out to disk again. The default value is 20.

Wednesday, 11 June 2014

Chapter 6 - part 6, MVCC

MVCC

The multiversion concurrency control is the access method used by PostgreSQL to provide the transactional model as seen in 5.7.
At logical level this is completely transparent to the user and the new row versions become visible after the commit, accordingly with the transaction isolation level.

At physical level we have for each new row version, the insert's XID stamped into the t_xmin field. The PostgreSQL's internal semantic makes visible only the committed rows stamped with the XID lesser than the current transaction's XID because considered in the past. The rows with a XID greater than the current transaction's XID are considered in the future and then invisible.

Because the XID is a 32 bit quantity, it wraps at 4 billions. When this happens theoretically all the tuples should suddenly disappear because they switch from in the XID's past to its future. This is the XID wraparound failure, a serious problem for the older PostgreSQL versions, which only fix was to re init a new data area each 4 billion transactions and dump reload the databases.

PostgreSQL 7.2 introduced the modulo-2^32 arithmetic for evaulating the XID age where a special XID, the FrozenXID^6.2 was assumed as always in the past and having, for any given XID 2 billion transactions in the future and 2 billion transactions in the past.

When the age of the stamped t_xmin becomes old then the VACUUM can freeze the tuple stamping the FrozenXID and preserving it from the disappearance. The pg_class and the pg_database table have a dedicated field to track the oldest tuple inside the relation and the database, respectively the relfrozenxid and the datfrozenxid where the oldest not frozen XID's value is stored. The builtin function age() shows how many transactions are between the current XID and the value stored in the system catalogue.

For example this is a query to get all the databases with the datfrozenxid and the age.

 postgres=# 
        SELECT 
                datname,
                age(datfrozenxid),
                datfrozenxid 
        FROM 
                pg_database;
    datname    | age  | datfrozenxid 
---------------+------+--------------
 template1     | 4211 |          679
 template0     | 4211 |          679
 postgres      | 4211 |          679
 db_test       | 4211 |          679

The datfroxenxid value is meaningful only through the age function which shows the ``distance'' between the current XID and the datfroxenxid. PostgreSQL assigns the new XID only for the write transactions and only if the tuples are updated in the so called ``lazy XID assignment''.

When a tuple's XID becomes older than 2 billion transactions, the tuple simply disappears jumping from the the current XID's past to its future. Before the version 8.0 there was no prevention for this problem, except the periodic cluster wide VACUUM. The latest versions introduced a passive protection mechanism emitting messages in the activity log when the age of datfrozenxid is ten million transactions from the wraparound point.

WARNING:  database "test_db" must be vacuumed within 152405486 transactions
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in 
"test_db".

Another active protection is the autovacuum daemon which take care of the affected tables and starts a VACUUM to freeze the tuples even if autovacuum is turned off. However if something goes wrong and the datfrozenxid reaches one million transactions from the wraparound point, the cluster shutdown and keeps shutting down for each transaction. When this happens the cluster can be only started in single-user backend to execute the VACUUM.

To limit the effect of data bloat, unavoidable with this implementation, PostgreSQL have the feature called HOT which stands for Heap Only Tuples. The RDBMS tries to keep the updated tuples inside the same page avoiding also any index reference update, if present. This is possible only if there's available free space. By default PostgreSQL when inserting the tuples, fills up the pages completely; however is possible to reserve a page portion for the updates with the fillfactor storage parameter. This is the percentage of the page to reserve for the inserts. The default value for the heap pages is 100, complete packing. For the indices is 70 for the not leaf pages and 90 for the leaf pages leaving some space available for the unavoidable updates. A smaller fill factor will result, at insert time, with a bigger table but with lesser grow rate when updated.

Finally if the MVCC is not carefully considered at design time, this can result in data bloat and generally poor performances. In the 7 we'll see how and how to keep the cluster in efficient conditions or at least how to try.

Monday, 9 June 2014

Chapter 6 part 5 - Tablespaces

Tablespaces

PostgreSQL implements the tablespaces creating the symbolic links, pointing the tablespace's location, into the pg_tblspc. The links are named like the tablespace's OID. The tablespaces are available only on systems supporting the symbolic links.
Since PostgreSQL 9.1 the tablespace location was stored into the field spclocation of the pg_tablespace system table. The information was used only to dump the tablespace's definition during a pg_dump and was removed in the version 9.2 and the function pg_tablespace_location(tablespace_oid), which returns the tablespace's absolute path from the tablespace oid, was introduced.

For example this query will return the tablespace's location seen in 5.6

 postgres=# 
 SELECT 
        pg_tablespace_location(oid),
        spcname 
FROM 
        pg_tablespace;
       pg_tablespace_location       |  spcname   
------------------------------------+------------
                                    | pg_default
                                    | pg_global
 /var/lib/postgresql/pg_tbs/ts_test | ts_test
(3 rows)

The function returns the empty string for the system tablespaces, pg_default and pg_global, because those locations have an immutable location, relative to the data directory. The data area's absolute path can be retrieved using the function current_settings

 postgres=# SELECT current_setting('data_directory');
       current_setting        
------------------------------
 /var/lib/postgresql/9.3/main
(1 row)

Combining the CASE construct is then possible to build up an useful query to lookout the tablespaces locations.

 postgres=# 
SELECT 
        CASE
                WHEN 
                                pg_tablespace_location(oid)=''
                        AND     spcname='pg_default'
                THEN
                        current_setting('data_directory')||'/base/'
                WHEN 
                                pg_tablespace_location(oid)=''
                        AND     spcname='pg_global'
                THEN
                        current_setting('data_directory')||'/global/'
        ELSE
                pg_tablespace_location(oid)
        END
        AS      spclocation,
                
        spcname 
FROM 
        pg_tablespace;
             spclocation              |  spcname   
--------------------------------------+------------
 /var/lib/postgresql/9.3/main/base/   | pg_default
 /var/lib/postgresql/9.3/main/global/ | pg_global
 /var/lib/postgresql/pg_tbs/ts_test   | ts_test
(3 rows)

Before the version 8.4 the tablespace location pointed directly to the referenced directory, causing a potential location's clash. The newer versions introduced the usage of a container directory into the tablespace location named after the major version and the system catalogue release.

postgres@tardis:~$ ls -l /var/lib/postgresql/pg_tbs/ts_test
total 0
drwx------ 2 postgres postgres 6 Jun  9 13:01 PG_9.3_201306121

The tablespace's directory container is structured this way
PG_{MAJOR_VERSION}_{CATALOGUE_VERSION_NUMBER}

The major version is the PostgreSQL version truncated to the second cypher and the catalogue's version number is the same shown in the pg_controldata output, a formatted date.

postgres@tardis:~$ export PGDATA=/var/lib/postgresql/9.3/main
postgres@tardis:~$ /usr/lib/postgresql/9.3/bin/pg_controldata 
pg_control version number:            937
Catalog version number:               201306121
Database system identifier:           5992975355079285751
Database cluster state:               in production
pg_control last modified:             Mon 09 Jun 2014 13:05:14 UTC
.
.
.
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

The container directory have the same structure as base seen in 4.6, with only the subdirectories for the databases having relations on the tablespace.

To get all the databases with objects on the current tablespace it's present the function pg_tablespace_databases(tablespace_oid) which returns the set of the database OID with objects on the specified tablespace. In order to have a better picture a join with the pg_database system table is a good idea.
For example this query combines the CASE construct with the pg_tablespace_databases function.

 db_test=# 
 SELECT
        datname,
        spcname,
        CASE
                WHEN 
                                pg_tablespace_location(tbsoid)=''
                        AND     spcname='pg_default'
                THEN
                        current_setting('data_directory')||'/base/'
                WHEN 
                                pg_tablespace_location(tbsoid)=''
                        AND     spcname='pg_global'
                THEN
                        current_setting('data_directory')||'/global/'
        ELSE
                pg_tablespace_location(tbsoid)
        END
        AS      spclocation
FROM
        pg_database dat,
        (
                SELECT
                        oid as tbsoid,
                        pg_tablespace_databases(oid) as datoid,
                        spcname 
                FROM 
                        pg_tablespace where spcname='ts_test'
        ) tbs
WHERE
        dat.oid=tbs.datoid
;
 datname | spcname |            spclocation             
---------+---------+------------------------------------
 db_test | ts_test | /var/lib/postgresql/pg_tbs/ts_test
(1 row)

Moving a tablespace to another physical location it's not complicated; the cluster of course has to be shut down, see 4.3 for more informations.

After the cluster is stopped the container directory can be copied to the new location; the directory's access permissions must be the same as the origin; read write for the os user running the postgresql process only,otherwise the cluster will refuse to start.

When the copy is complete and the symbolic link in $PGDATA/pg_tblspc is fixed to point the new location the cluster can be started as shown in 4.2.

Friday, 6 June 2014

Chapter 6 part 4 - TOAST

The oversize attribute storage technique is the PostgreSQL implementation for the data overflowing the page size.

The user data shown in figure 6.3 is a stream of composite data. Actually the data itself is logically described by the composite model stored in the system catalogue. The attributes in the model can be grouped in two categories, the fixed length and the variable length data type (varlena).

For example, a four bytes integer is a fixed length type and a text is a variable length. For the PostgreSQL's internal routines the data at physical level appears all the same, as a generic datum. When the datum is loaded into the shared buffer becomes meaningful and is managed accordingly with its kind.

The attribute kind is stored in the first two bits^6.1 of the varlena length word. When both bits are zero then the attribute is a fixed length data type and the remaining bits give the datum size in bytes including the length word.

If the first bit is set then the value have only a single-byte header and the remaining bits describe the total datum size in bytes including the length byte. If the remaining bits are all zero, then the value is a pointer to an out of line data stored in a separate TOAST table which structure is shown in figure 6.4.

If the first bit is zero but the second bit is set then the datum is compressed and must be decompressed before the use. The compression uses the LZ family algorithm.

The external toast table is a normal table with three fields. The chunk_id, the OID of the toasted data, the chunk_seq and integer for ordering the chunks within the value and the chunk_data, a bytea field containing the the overflown data.

The chunk size is normally 2k and is controlled at compile time by the symbol TOAST_MAX_CHUNK_SIZE. The TOAST code is triggered by the value TOAST_TUPLE_THRESHOLD, also 2k by default. When the tuple's size is bigger then the TOAST routines are triggered.
The TOAST_TUPLE_TARGET normally 2 kB as well governs the compression's behaviour. PostgreSQL will compress the datum to achieve a final size lesser than TOAST_TUPLE_TARGET. If cannot then the out of line storage is used.

**Figure 6.4:** Toast table structure

TOAST offerst four different storage strategies. Each strategy can be changed per column using the ALTER TABLE SET STORAGE statement.

PLAIN prevents either compression or out-of-line storage; It's the only for fixed length data types.
EXTENDED allows both compression and out-of-line storage. It is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big.
EXTERNAL allows out-of-line storage but not compression.
MAIN allows compression but not out-of-line storage. Actually the out-of-line storage is still performed as last resort.

The out of line storage alongside the reduced performances have the advantage of leaving out the stored data, if untouched by the update, from the new row version's generation. That's possible because the varlena is a mere pointer to the chunks and a new row version will affect only the pointer leaving the TOAST data unchanged.

The TOAST table are stored like all the other relation's in the pg_class table, the associated table can be found using a self join on the field reltoastrelid.

6.1 On the big-endian architecture those are the high-order bits; on the little-endian those are the low-order bits

Tuesday, 3 June 2014

Chapter 6 part 3 - Tuples

Tuples

As seen in 5.7 PostgreSQL when updating the rows, generates new row versions stamping the old as dead.
Each tuple comes with a fixed header of system columns usually 23 bytes as shown in the figure 6.3.

**Figure 6.3:** Tuple structure

The two fields t_xmin and t_xmax are stamped respectively at tuple's insert and delete with the operation's transaction id.

The field t_cid is a ``virtual'' field used either for cmin and cmax. This is possible because the command id have meaning only inside a transaction.

The field t_xvac is used by VACUUM when moving the rows, according with the source code's comments in src/include/access/htup_details.h this is used only by old style VACUUM FULL.

The t_cid is the tuple's location id, a couple of integers pointing the page number and the tuple's index. When a new tuple is created t_cid is set to the actual row's value; When the tuple is updated the this value changes to the location of the newer tuple's version.

A tuple is then recognised as last versino when xmax is invalid or t_cid points to itself. If xmax is also valid then the tuple is the last locked or deleted version of the tuple's version chain.

The two infomask fields are used to store various flags like the presence of the tuple OID or if the tuple has NULL values.

The last field t_off is used to mark the offset to the composite data, the actual tuple's data. This field's value is usually zero if the table doesn't have NULLable fields or is created WITH OIDS. The OID and the a bitmap are then placed just after the header's. The bitmap begins just after the fixed header and occupies enough bytes to have one bit per data column. The OID comes after and is 4 bytes long.

Sunday, 1 June 2014

Chapter 6 part 2 - Pages

Pages

The datafiles are organized as array of fixed length elements called pages. The default size is 8k. Pages of a table's datafile are called heap pages to distinguish from the index pages which differs from the former only for the special space present in the page's end. The figure 6.1 shows an index page structure. The special space is small area in the page's end used to track down the index structure. For example a B-tree index stores in the special space the pointers to the leaf pages.

**Figure 6.1:** Index page

The page starts with a header 24 bytes long followed by the item pointers, usually 4 bytes long. In the page's bottom are stored the actual items, the tuples.

The item pointers is an array of pairs, offset and length, pointing the actual tuples in the page's bottom. The tuples are put in the page's bottom and going backwards to fill up all the available free space.

The header contains the page's generic space management informations as shown in figure 6.2.

**Figure 6.2:** Page header

pd_lsn identifies the xlog record for last page's change. The buffer manager uses the LSN for WAL enforcement. A dirty buffer is not dumped to the disk until the xlog has been flushed at least as far as the page's LSN.
pd_checksum stores the page's checksum if enabled, otherwise this field remain unused
pd_flags used to store the page's flags
pg_lower offset to the start of the free space
pg_upper offset to the end of the free space
pg_special offset to the start of special space
pd_pagesize_version page size and page version packed together in a single field.
pg_prune_xid is a hint field to helpt to determine if pruning is useful. It's used only on the heap pages.

The pd_checksum field substitute the pd_tli field present in the page header up to PostgreSQL 9.2 and used to track the xlog record across the timeline id. The page's checksum can be enabled only when the data area is initialised with initdb and cannot be disabled later.

The offset fields, pg_lower, pd_upper and the optional pd_special, are 2 bytes long, this means PostgreSQL can only support pages up to 32KB.

The page version was introduced with PostgreSQL 7.3. For the prior releases the page version is arbitrarily considered 0; PostgreSQL 7.3 and 7.4 had the page version to 1; PostgreSQL 8.0 used the version 2; PostgreSQL 8.1 and 8.2 used version number 3; From PostgreSQL 8.3 the page's version number is 4.