Wednesday 11 June 2014

Chapter 6 - part 6, MVCC

MVCC

The multiversion concurrency control is the access method used by PostgreSQL to provide the transactional model as seen in 5.7.
At logical level this is completely transparent to the user and the new row versions become visible after the commit, accordingly with the transaction isolation level.

At physical level we have for each new row version, the insert's XID stamped into the t_xmin field. The PostgreSQL's internal semantic makes visible only the committed rows stamped with the XID lesser than the current transaction's XID because considered in the past. The rows with a XID greater than the current transaction's XID are considered in the future and then invisible.

Because the XID is a 32 bit quantity, it wraps at 4 billions. When this happens theoretically all the tuples should suddenly disappear because they switch from in the XID's past to its future. This is the XID wraparound failure, a serious problem for the older PostgreSQL versions, which only fix was to re init a new data area each 4 billion transactions and dump reload the databases.

PostgreSQL 7.2 introduced the modulo-2^32 arithmetic for evaulating the XID age where a special XID, the FrozenXID6.2 was assumed as always in the past and having, for any given XID 2 billion transactions in the future and 2 billion transactions in the past.

When the age of the stamped t_xmin becomes old then the VACUUM can freeze the tuple stamping the FrozenXID and preserving it from the disappearance. The pg_class and the pg_database table have a dedicated field to track the oldest tuple inside the relation and the database, respectively the relfrozenxid and the datfrozenxid where the oldest not frozen XID's value is stored. The builtin function age() shows how many transactions are between the current XID and the value stored in the system catalogue.

For example this is a query to get all the databases with the datfrozenxid and the age.
 postgres=# 
        SELECT 
                datname,
                age(datfrozenxid),
                datfrozenxid 
        FROM 
                pg_database;
    datname    | age  | datfrozenxid 
---------------+------+--------------
 template1     | 4211 |          679
 template0     | 4211 |          679
 postgres      | 4211 |          679
 db_test       | 4211 |          679
The datfroxenxid value is meaningful only through the age function which shows the ``distance'' between the current XID and the datfroxenxid. PostgreSQL assigns the new XID only for the write transactions and only if the tuples are updated in the so called ``lazy XID assignment''.

When a tuple's XID becomes older than 2 billion transactions, the tuple simply disappears jumping from the the current XID's past to its future. Before the version 8.0 there was no prevention for this problem, except the periodic cluster wide VACUUM. The latest versions introduced a passive protection mechanism emitting messages in the activity log when the age of datfrozenxid is ten million transactions from the wraparound point.
WARNING:  database "test_db" must be vacuumed within 152405486 transactions
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in 
"test_db".
Another active protection is the autovacuum daemon which take care of the affected tables and starts a VACUUM to freeze the tuples even if autovacuum is turned off. However if something goes wrong and the datfrozenxid reaches one million transactions from the wraparound point, the cluster shutdown and keeps shutting down for each transaction. When this happens the cluster can be only started in single-user backend to execute the VACUUM.

To limit the effect of data bloat, unavoidable with this implementation, PostgreSQL have the feature called HOT which stands for Heap Only Tuples. The RDBMS tries to keep the updated tuples inside the same page avoiding also any index reference update, if present. This is possible only if there's available free space. By default PostgreSQL when inserting the tuples, fills up the pages completely; however is possible to reserve a page portion for the updates with the fillfactor storage parameter. This is the percentage of the page to reserve for the inserts. The default value for the heap pages is 100, complete packing. For the indices is 70 for the not leaf pages and 90 for the leaf pages leaving some space available for the unavoidable updates. A smaller fill factor will result, at insert time, with a bigger table but with lesser grow rate when updated.

Finally if the MVCC is not carefully considered at design time, this can result in data bloat and generally poor performances. In the 7 we'll see how and how to keep the cluster in efficient conditions or at least how to try.

No comments:

Post a Comment