Sunday, 2 September 2012

Don't let your improbability grow to infinite

After three posts on self celebrating, best practice and scary things now is the time to start talking about the DBA stuff.As you should already discovered this blog is not for beginners. I think there's too much howtos for basic things like installation and table creation, all scattered around the internet.
My intention is explore the unknown topics and try to give a clear solution for any possible problem.
This post will cover the database housekeeping, things to keep under your pillow to let the database run efficiently and let you sleep without worries.

All the section's titles are from Douglas Adams The Hitchhikers Guide To The Galaxy, no copyright infringement intended. I used it because I love it and I never forget my towel. And by the way, THE ANSWER IS 42

Time is an illusion. Lunchtime doubly so.

PostgreSQL comes with a powerful feature called MVCC which stands for Multi Version Concurrency Control. The database track, for each transaction, which data is visible and which not simply assigning a trasaction id, XID, when a new tuple is inserted and when the same tuple is deleted. Each tuple carries two system columns, xmin and xmax, used to determine the visibility in a simple way. Anything greater than the current transaction id is in the future and then invisible. This works absolutely good except one big caveat. The XID is a 4 byte integer and every 4 billions transactions wraps.In the early database versions this behaviour forced the DBA to initdb a new data directory and reload the data from a pg_dump every 4 billions transactions. The risk was the sudden disappearance of all data at XID wrap.
To avoid this a new comparison method was adopted in later versions. The modulo-2^31 method guarantee, for each integer, 2 billions are greater, and then in the future, and 2 billions are lesser, then in the past. A new special XID called frozenXID which is in the past for any real XID was introduced to store in a safe condition the tuples with old transactions id.

The VACUUM when finds a tuple which XID age is greater than the configuration parameter vacuum_freeze_min_age freezes the xmin value keeping it safe from the wraparound danger.
When one of the databases becomes dangerously near to the wraparound failure the server start emitting messages in the log, when this become too near to the wraparound the system will shut down and refuse to start any new transaction. The limit is 1 million transactions and is valid for each database in the cluster.
So, to avoid any problem the periodic vacuum must include all databases, even template1 and postgres.

mostly harmless

Indexes are a wonderful thing, they order data and give the direct access to the data block without need to read the entire table.Most people believe an index creation is a good thing, even if the index is not used. That's wrong as useless indexes put overhead in the table activity and the query planner can be confused by their presence.

An index block read require the so called random access on disk which, by default, its cost is 4 times a corresponding sequential access. An index block, up to the version 9.1, does not carry the data but only the pointer to the heap block where the data is stored.

Any index read consists in at least two disk read as one is for the index block and another one for the data block. Smaller tables does not use indexes because is quicker a sequential scan to get the data in the shared buffer. Same for queries with no where condition as the entire table must be loaded in memory to satisfy the backend's request.

Any update to the tuple is, because the MVCC, a new insert with a new xmin and this does not affect the index if the new tuple resides in same block but, if the new tuple is stored in a new block the index require to be updated and this generate dead tuples in the index block itself.
Index blocks with dead and live tuples prevent vacuum to recycle the block and cause the index bloat, wasting space on disk and affecting the index performance.

The only solution to fix this is doing an index rebuild with reindex, usually not a cheap operation.
Useless indexes can affect the query planner itself as wrong performance assumption is done.
Being a balanced data structure, assuming we're talking about the btree index type, the scan require an entry point determined by the where condition, if this is covered by the index only partially the scan will cost more than expected and the query become slower than expected as the entry point cannot be determined on the root block.

Sometimes slow queries execution plans show one or more index scan.
This is can happen if the index is used with a wrong the entry point in the where condition.
To fix this a new index according the where condition can solve the problem but in that case keep in mind the problem caused by the overhead.
To check the index usage the system table pg_stat_all_indexes show the index name, the schema name and three columns

: idx_scan
: idx_tup_read
: idx_tup_fetch

Respectively, the number of index scan initiated, the number of index block read for the index and the live tuple read with the index.
Zero or lower values in the idx_scan column should alert you about the index usage and let you ask why this index was created.
High difference between idx_tup_read and idx_tup_fetch will show the index is actually used but no data is returned, probably because the where condition is too much stringent.
To solve this problems does require deep knowledge about the application itself and probably a meeting with the developers to find a better solution.

The Ravenous Bugblatter Beast of Traal

To avoid problems in the query planner and before doing any sort of analysis in performance tuning the first question to ask is:are the database statistics up to date?
The database statistics are one of the most important things to keep after. Out of date statistics can result in unpredictable behaviour in the query planner with subsequent performance downgrade.
To explain this I'll show you a little example.
An absolute minimal table where the transactions are stored can have a transaction id, the product code and the transaction timestamp.

CREATE TABLE 
        t_transact
        (
           i_id_trn bigserial,
           v_prod_code character varying (100),
           d_trn_datetime timestamp with time zone DEFAULT current_timestamp,
           CONSTRAINT pk_id_trn PRIMARY KEY (i_id_trn)
        )
;

To speed up the read on the d_trn_datetime field let's create a new index.

CREATE INDEX idx_trn_datetime
        ON t_transact
        USING btree
        (d_trn_datetime )
;

Now let's insert ten records in this table.

INSERT INTO 
    t_transact
    (v_prod_code)

SELECT
     v_prod_code
FROM
     (
         SELECT
             generate_series(1,10),
             'tst_prod' as v_prod_code
     ) sl_prod

;

and check the execution plan used by the following query

SELECT 
        i_id_trn,        v_prod_code,
        d_trn_datetime 

FROM 
       t_transact
WHERE
       d_trn_datetime='2012-09-02 15:42:20.447888+01'::timestamp with time zone
;

"Bitmap Heap Scan on t_transact  (cost=4.27..9.61 rows=2 width=234) (actual time=0.070..0.134 rows=10 loops=1)"
"  Recheck Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)"
"  ->  Bitmap Index Scan on idx_trn_datetime  (cost=0.00..4.27 rows=2 width=0) (actual time=0.043..0.043 rows=10 loops=1)"
"        Index Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)"
"Total runtime: 0.333 ms"

The planner choose a bitmap scan on the index to retrieve the requested rows, assuming the index is cheaper than the table scan.
After running an analyze on the table t_transact this is the new execution plan

"Seq Scan on t_transact  (cost=0.00..1.12 rows=10 width=25) (actual time=0.021..0.082 rows=10 loops=1)"
"  Filter: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)"
"Total runtime: 0.198 ms"

Obviously the table with 10 rows consist in one single 8k block and the sequential scan is the fastest way to get the data. Without statistics the database did the wrong assumption the index is a better choice resulting in a query 40% slower as two disk access are required.
In the same way, if the table grow big and the database doesn't know the sequential scan will result in slower queries.

DON'T PANIC!

Reviewing the previous sections here some house keeping suggestions

run vacuum full every 2 billion transactions on any database in the cluster
check the index usage and do reindex if tables are updated often
check the statistics are up to date to avoid planner confusion

Saturday, 25 August 2012

Danger Master! Danger!

to configure the PITR you should save the $PGDATA with tar or in alternative you can use pg_dumpall
PostgreSQL trainer describing the point in time recovery

An oid type can carry on 2Gb
PostgreSQL guru explaining the binary data type

Maybe you can use pg_resetxlog to solve your problem
A clueless user trying to solve a mayor version incompatibility data directory issue

Danger, man at work

This is a small excerpt of what I've seen browsing the tecnical forums and community websites.

My first reaction reading all this nonsense was angry as these guys have good reputation in the PostgreSQL community and their ignorance can cause data loss, damages and, definitely, harm PostgreSQL.

Obviously it's impossible to cover anything so I'm writing this post as warning do not trust people only by the reputation. You should investigate any suggested solution and absolutely DO NOT TRY ON THE PRODUCTION, NEVER!.

It's an airplane? It's an ufo? It's a dumb a**

The point in time recovery is a wonderful feature introduced in the version 8.1 and enable the capabilty to restore the entire database cluster at the exact date and time you want.
In this post I'm not describing how to configure it as the documentation is well written. I show how it works to make clear the danger in the trainer's quote.

The PITR is a physical backup where you actually copy the data files meanwhile the database is running. You can do it safely because the write ahead logs generated during the copy are archived and all the changes can be applied to the inconsistent data files later when you need to restore the cluster.
This is possible because the blocks stored in the archived WAL carry on the information where the block is located on disk. In the restore process the physical data block in the data file is overwritten by the WAL data block during the startup process when the database performs the instance recovery.
The pg_dumpall is a logical backup wrapper for the general purpose pg_dump. If you look the pg_dump's source code you will see a collection of SELECT command within the strict SERIALIZABLE transaction isolation level.

The result is an output file with the SQL commands to run against a different cluster to recreate the databases, the tables and the data.

The restore with this tool require an empty cluster generated with initdb up and running. pg_dumpall does not require instance recovery but a full running instance, that means all the archived WAL files are useless as they are usable only during the recovery.

So, if you follow the procedure suggested by the trainer and choose the pg_dumpall you will illude to have a PITR because you are using a logical tool to restore a physical backup and you will loose any database change between your pg_dumpall and the moment the live database server was lost.

OID is not a TARDIS

PostgreSQL have the bytea built in data type for binary objects. As is a TOASTable type with variable lenght the maximum allowed size is 1Gb. The binary data in this way is stored in line as any table column and you can read and write the content using the conventional SQL statements.
If the binary data doesn't fit in 1 Gb PostgreSQL allow another strategy to store up to 2 Gb as large object.
The built in functions lo_import and lo_export can read and write files from the disk and store in a system table called pg_largeobjects.
This table have three columns,

loid TYPE oid
pageno TYPE integer
data TYPE bytea

Whe the function lo_import is called PostgreSQL stores the binary data in the pg_largeobjects splitting the extra space in two pages identified by the same loid value and different pageno. The function return the OID value to reference the stored large object.
The function lo_export does the reverse, read the binary chunks for the given OID and return the binary stream on disk.
So, the second quote show how lack of attention and simple logical deduction have the PostgreSQL Guru.

I'll burn my house to light a candle

The next three seconds after reading this quote I remained speechless, the suggestion is probably the most dangerous thing, and far away the most brutish, you can do on your data directory.
I discovered this idiocy by browsing the forum and probably too late to avoid the disaster as the post was months old.
As any PostgreSQL DBA should know the WAL files are used to save the data block changes on non volatile memory before the write actually reach the data file as, if any crash happens, the data block can be applied to the datafile reading from the WAL.
The process is the same used for PITR, the only difference is no archived wal is used and only the files in the pg_xlog are used to reply the blocks on the data files.
The global directory contains a 8k file called pg_control where the last checkpoint location is stored and tells the database which WAL to start the replay during the instance recovery.
Any corruption in the pg_control or in the WAL files results in an instance unable to start. If any, and I repeat any solution is useless is still possible to start the instance using the program pg_resetxlog.
This program does one simple thing. Delete anything in the pg_xlog and generate a new pg_control restarting the XID count from the beginning.
The instance can start again but any boundary between the data blocks and the WAL is lost and any attempt to run read write queries can result in data corruption.
The on line manual is absolutely clear on this.
After running pg_resetxlog the database must start without user access , the entire content must be dumped using pg_dumpall, the data directory must be dropped and recreated from scratch using initdb and then the dump file restored using psql or pg_restore
After the scaring part let me explain the problem, a common one indeed, and the solution.
If you use PostgreSQL as the packaged version shipped with the most common Linux distributions can happen, if you do the distribution upgrade, PostgreSQL version jump a major release, for example from 8.4 to 9.1.
When this happens the new binaries refuse to start on the data directory initialized with the previous major version.
The solution is quite simple. You can compile a matching PostgreSQL version, start the instance and dump the contents to restore in the new major version data directory or, if you have an enterprise amount and you are in hurry to return on line, you can use pg_upgrade, a powerful tool to do the in place upgrade.
I hope these three examples have scared you enough to test and check the documents before doing anything.
Cheers.