Sunday 2 September 2012

Don't let your improbability grow to infinite

After three posts on self celebrating, best practice and scary things now is the time to start talking about the DBA stuff.As you should already discovered this blog is not for beginners. I think there's too much howtos for basic things like installation and table creation, all scattered around the internet.
My intention is explore the unknown topics and try to give a clear solution for any possible problem.
This post will cover the database housekeeping, things to keep under your pillow to let the database run efficiently and let you sleep without worries.

All the section's titles are from Douglas Adams The Hitchhikers Guide To The Galaxy, no copyright infringement intended. I used it because I love it and I never forget my towel. And by the way, THE ANSWER IS 42


Time is an illusion. Lunchtime doubly so.

PostgreSQL comes with a powerful feature called MVCC which stands for Multi Version Concurrency Control. The database track, for each transaction, which data is visible and which not simply assigning a trasaction id, XID, when a new tuple is inserted and when the same tuple is deleted. Each tuple carries two system columns, xmin and xmax, used to determine the visibility in a simple way. Anything greater than the current transaction id is in the future and then invisible. This works absolutely good except one big caveat. The XID is a 4 byte integer and every 4 billions transactions wraps.In the early database versions this behaviour forced the DBA to initdb a new data directory and reload the data from a pg_dump every 4 billions transactions. The risk was the sudden disappearance of all data at XID wrap.
To avoid this a new comparison method was adopted in later versions. The modulo-2^31 method guarantee, for each integer, 2 billions are greater, and then in the future, and 2 billions are lesser, then in the past. A new special XID called frozenXID which is in the past for any real XID was introduced to store in a safe condition the tuples with old transactions id.

The VACUUM  when finds a tuple which XID age is greater than the configuration parameter vacuum_freeze_min_age freezes the xmin value keeping it safe from the wraparound danger.
When one of the databases becomes dangerously near to the wraparound failure the server start emitting messages in the log, when this become too near to the wraparound the system will shut down and refuse to start any new transaction. The limit is 1 million transactions and is valid for each database in the cluster.
So, to avoid any problem the periodic vacuum must include all databases, even template1 and postgres.



mostly harmless

Indexes are a wonderful thing, they order data and give the direct access to the data block without need to read the entire table.Most people believe an index creation is a good thing, even if the index is not used. That's wrong as useless indexes put overhead in the table activity and the query planner can be confused by their presence.

An index block read require the so called random access on disk which, by default, its cost is 4 times a corresponding sequential access. An index block, up to the version 9.1, does not carry the data but only the pointer to the heap block where the data is stored.

Any index read consists in at least two disk read as one is for the index block and another one for the data block. Smaller tables does not use indexes because is quicker a sequential scan to get the data in the shared buffer. Same for queries with no where condition as the entire table must be loaded in memory to satisfy the backend's request.

Any update to the tuple is, because the MVCC, a new insert with a new xmin and this does not affect the index if the new tuple resides in same block but, if the new tuple is stored in a new block the index require to be updated and this generate dead tuples in the index block itself.
Index blocks with dead and live tuples prevent vacuum to recycle the block and cause the index bloat, wasting space on disk and affecting the index performance.

The only solution to fix this is doing an index rebuild with reindex, usually not a cheap operation.
Useless indexes can affect the query planner itself as wrong performance assumption is done.
Being a balanced data structure, assuming we're talking about the btree index type, the scan require an entry point determined by the where condition, if this is covered by the index only partially the scan will cost more than expected and the query become slower than expected as the entry point cannot be determined on the root block.

Sometimes slow queries execution plans show one or more index scan.
This is can happen if the index is used with a wrong the entry point in the where condition.
To fix this a new index according the where condition can solve the problem but in that case keep in mind the problem caused by the overhead.
To check the index usage the system table pg_stat_all_indexes show the index name, the schema name and three columns


idx_scan

idx_tup_read

idx_tup_fetch
Respectively, the number of index scan initiated, the number of index block read for the index and the live tuple read with the index.
Zero or lower values in the idx_scan column should alert you about the index usage and let you ask why this index was created.
High difference between idx_tup_read and idx_tup_fetch will show the index is actually used but no data is returned, probably because the where condition is too much stringent.
To solve this problems does require deep knowledge about the application itself and probably a meeting with the developers to find a better solution.


The Ravenous Bugblatter Beast of Traal

To avoid problems in the query planner and before doing any sort of analysis in performance tuning the first question to ask is:are the database statistics up to date?
The database statistics are one of the most important things to keep after. Out of date statistics can result in unpredictable behaviour in the query planner with subsequent performance downgrade.
To explain this I'll show you a little example.
An absolute minimal table where the transactions are stored can have a transaction id, the product code and the transaction timestamp.

CREATE TABLE 
        t_transact
        (
           i_id_trn bigserial,
           v_prod_code character varying (100),
           d_trn_datetime timestamp with time zone DEFAULT current_timestamp,
           CONSTRAINT pk_id_trn PRIMARY KEY (i_id_trn)
        )
;
To speed up the read on the d_trn_datetime field let's create a new index.


CREATE INDEX idx_trn_datetime
        ON t_transact
        USING btree
        (d_trn_datetime )
;

Now let's insert ten records in this table.


INSERT INTO 
    t_transact
    (v_prod_code)

SELECT
     v_prod_code
FROM
     (
         SELECT
             generate_series(1,10),
             'tst_prod' as v_prod_code
     ) sl_prod

;
and check the execution plan used by the following query


SELECT 
        i_id_trn,
        v_prod_code,
        d_trn_datetime 
FROM t_transact WHERE d_trn_datetime='2012-09-02 15:42:20.447888+01'::timestamp with time zone ; "Bitmap Heap Scan on t_transact (cost=4.27..9.61 rows=2 width=234) (actual time=0.070..0.134 rows=10 loops=1)" " Recheck Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)" " -> Bitmap Index Scan on idx_trn_datetime (cost=0.00..4.27 rows=2 width=0) (actual time=0.043..0.043 rows=10 loops=1)" " Index Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)" "Total runtime: 0.333 ms"

The planner choose a bitmap scan on the index to retrieve the requested rows, assuming the index is cheaper than the table scan.
After running an analyze on the table t_transact this is the new execution plan


"Seq Scan on t_transact  (cost=0.00..1.12 rows=10 width=25) (actual time=0.021..0.082 rows=10 loops=1)"
"  Filter: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)"
"Total runtime: 0.198 ms"

Obviously the table with 10 rows consist in one single 8k block and the sequential scan is the fastest way to get the data. Without statistics the database did the wrong assumption the index is a better choice resulting in a query 40% slower as two disk access are required.
In the same way, if the table grow big and the database doesn't know the sequential scan will result in slower queries.


DON'T PANIC!

Reviewing the previous sections here some house keeping suggestions
  • run vacuum full every 2 billion transactions on any database in the cluster
  • check the index usage and do reindex if tables are updated often
  • check the statistics are up to date to avoid planner confusion