PostgreSQL Dba: Concurrently numb

Here we go again. I'm not dead, not yet.
I've just moved home and started a new job and I'll be honest, before this job I've just played with toys.

With this post I want to start a talk over indexes, how good they can be and how a wrong action or assumption can buy you a one way ticket to disaster.
So, let's start with the concurrent clause in the CREATE INDEX statement and a short description of the PostgreSQL's general purpose indexes, the b-trees.

Index physical structure

PostgreSQL implements the b-tree index access using the Lehman and Yao's high-concurrency B-tree management algorithm, to find out an extensive description you can check out the README file in the source directorysrc/backend/access/nbtree/README
The chapter on database physical storage give you a good explain how the data files are organized and how is organized the block structure but doesn't tell too much about the index blocks.
An index block is quite similar to the heap block, with a 24 bytes long page header with references to the corresponding WAL entry and timeline, plus the pointers to the begin of the free space, to the end of the free space and to the beginning of the special space.
After the header resides the item pointers sectionto the tuples stored in the end of the page, usually 4 bytes on 32 bit and 8 bytes on windows and 64 bit architectures.
The last section, the special space, is usualy 16 bytes long and stores the pointers to the related pages inside the tree. Another difference with data files is the presence of the metapage, the first page which stores the pointers to the root page, or pages. With this implementation is possible to have the root page split.

Table and index space overhead

Any item in PostgreSQL have special attributes identified with negative att num in the pg_attribute system table.
According with this structure every item have an extra overhead of 27 bytes even if the table doesn't have columns.
Another difference with the heap pages is the fill factor, fixed to 0.7 for non leaf pages and 0.9, but adjustable, for leaf pages.

The visibility map and index only scans

Up to the version 9.1 the index is used only to track the ordered data pointer to the corresponding heap block.
As the index carries the values in an ordered ans structured tree is theoretically possible to read only the index to get the data if the SELECT contains only for the indexed columns.
Another point to my mantra SELECT * FROM table; IS THE ROOT OF EVIL.
The problem with the index only scan is related to the tuple visibility.
The index inside the transaction can be temporarily out of date for deleted tuples this can produce inconsistent results.
The visibility map is then used to track the visible tuples to keep the index read consistent.
The problem with this approach, before the version 9.2, is that this fork is not crash safe and, in the case of instance crash, could result in an out of date map.
In the version 9.2 visibility map changes is logged on the WAL making it the map crash safe and enabling the index only scan.

Bloating indexes

As any PostgreSQL expert should know UPDATE stands for INSERT.
Everytime an update is performed a new row version is created with xmin set to the current transaction XID and the previous version have the xmax set to the same XID.
This permit the magic of MVCC without the use of the rollback segment as Oracle does.
The problem with this approach is that often updated tables can bloat faster and the last updated tuples will move across the datafiles every time a new block is created to store the newer versions.
When this happens if there's an index referencing the tuple, this must be updated as well with a new row version that points to the new heap block.
If this in a table, that is a sequential not ordered collections of blocks, add the block in the table's physical end, the same for the index require the index to be transversed to find where it is the updated row and then the new block creation must follow the b-tree structure in a very expensive process.
Even worse, if the updated value is in a composite index and is not the first column the index entry point is unknown and the read for the update require a great amout of cpu cycles and disk seek.
Then the cherry on the cake. The vacuum process cannot mark an index block reusable if there inside mixed live and dead tuples.
I think you got the point.
Indexes can bloat with values scattered over multiple blocks where dead tuples slow down the search process. As the optimizer doesn't know about this, an index scan over a bloated index can have horrible performances.
And I didn't mentioned the waste of space...

This is a job for REINDEX

Of course PostgreSQL comes with good maintenance statements like VACUUM or CLUSTER, for the tables, and REINDEX for the indexes.
Reindex builds a new index, updates the system catalogue to point to the new one and drop the old one. Looks simple? Not exactly.

To do this a write exclusive lock must be acquired on the table and this prevents any write action for the time needed to reindex.
The read is not affected as the REINDEX needs only a consistent snapshot to build the new relation.
The index build requires the data to be sorted and if the amount doesn't fit in the work_mem value this will be performed on disk slowing down the entire process.
The exclusive lock have one big advantage, if there's some process blocking the REINDEX to acquire the lock the procedure will not start.
This doesn't happen with the concurrent rebuild (REINDEX thanks Rassilon does not have the CONCURRENT clause) and this can give you an extra free ticket to the previously mentioned disaster.

Concurrently what?

The concurrent build create a new INVALID index immediately visible, then performs a second table scan to update the index with the values changed during the creation, then acquires an exclusive lock over the table to validate the index. The point of failure is the last step. As no check is performed before the index build is started if there's any blocking process, for example pg_dump, the last step will wait for the lock, meanwhile the INVALID index is attached and updated adding overhead to the table's operations.
If you cancel the query the index is not dropped automatically. It stays in place in INVALID state since is dropped.
But the drop requires an exclusive lock on the pg_class table and if the blocking process is still around...
Now, what happens if you create an unique index and during the second table scan the unique condition is violated by transactions not visible when the first scan is started? The index build is aborted, same as before, INVALID index attached to the table.
You see. Do you like aisle or window?