Showing posts with label performance tuning. Show all posts
Showing posts with label performance tuning. Show all posts

Thursday, 12 February 2015

PostgreSQL 7.4, the LRU list

3.1.1 PostgreSQL 7.4, the LRU list

In PostgreSQL 7.4 the free space in the shared buffer was managed using a simple last recently used list. When a page is first request from disk the buffer manager puts it into the first buffer in the free LRU list. If the first buffer is used by another page the list is shifted by one unity and the eventual last page in the list is discarded. If a page in the list is requested again is put back into the first buffer and so on. This simple approach worked quite well with some unpleasant effects. After all the buffer is shared and is not uncommon to have more than one database on the same cluster, each database different design and purposes. This worked quite bad with the LRU list causing unnecessary disk IO because the buffers were evicted without considering their usage frequency. This lead us to the tale of the buffer.

3.1.1.1 The tale of the buffer

I've read this interesting description years ago on the Bruce Momjian website and explains very well how a minimal change can dramatically degrade the database performance.


Our tale will start with a PostgreSQL 7.4 with a 1000 buffers allocated into the shared memory. The page size is the default, 8192 bytes. We have in total

8192 * 1000 = 8192000 bytes. On the data area we have a full packed table without indices which is consuming 1000 pages on the disk. If we run a simple select from the table PostgreSQL will execute the table's sequential scan in order to get the buffers in memory. The sequential is a top to bottom operation. But the shared buffer stores the buffers in reverse order. After the initial read, that will take some time because the disk IO, the table fits completely into the shared buffer with the pages in reverse order. A second select will find start another sequential scan. The buffer manager will search for the table's pages into the shared buffer and will find the first table's page into the last slot of the LRU list. This will cause the page move from the last slot to the first. The other buffers will shift by one slot all togheter and the second table's page will move into the last LRU slot. The buffer manager will find the second table's page in memory and will move the page into the first LRU slot causing the buffers shift again. The final result is a faster read because the table is totally read from the memory.

Let's add a new row to our table. Because is full packed this will cause a new page to be added to the table which page's count becomes 1001. A select with the shared buffer empty will take the same time of the previous condition with one notable exception. When the page 1001 is read from disk the first table's page is evicted from the shared buffer to make room. The second select will search for the first table's page and it will read from the disk, evicting the second page from the shared buffer. Which is read from disk a moment later causing the third page eviction and so on. One single extra page causes the cache spoil. This is a corner case of course but tells us an important lesson on the database nature. They are discreet systems and most of their behaviour is triggered. Usually there is no soft transition between good and bad performance.

Thursday, 20 December 2012

Time and relative arrays in space


Scale what?

Few years ago when I was self employed as PostgreSQL consultant and I was
appointed a big performance tuning task by a great search engine (no, not
THAT search engine).

The application was similar to the Google’s desktop, a grid of boxes with
RSS feeds. Unfortunately the implementation of the two dimensional grid had
bad performance issues.

Any position change with a drag and drop, a delete or an add, even in low
activity periods, required the consequent database update to run for at least
20 seconds. During the peak time the system was unusable as the php hit the
execution limit timeout and this caused any transaction to abort.

The simple design, a table with two integer columns mapping the X-Y co-
ordinates with a unique constraint to avoid duplicates, wasn’t scaling and the
problem was caused by the unique constraint itself.

As the PostgreSQL 8.4 didn’t supported the deferrable unique constraints
the position change required an update for each row in order to keep the sequence
without holes, and no way to run a bulk update because the unique constraint
was violated during the update.

This approach worked well with the stage, once pushed to the live system
the map table became several millions rows big and the updates, with all the
overhead introduced by the indexes itself simply did not scale.

Scale me this!

After one week of analysis and test I’d realized that the solution was to rewrite
completely the application logic changing the conventional table into an aggre-
gated table.

The original table had four fields. userid, boxid,xposition,yposition.

I moved the boxid, from the map table field into a two dimensional text
array binding and adding the userid column as primary key.

In that way I could access the array by userid and get the boxid simply
passing the x-y coordinate. The task to build the box grid became incredibly
simple, with the array unnest.

With this structure any box move,add or delete with drag and drop was
performed over the array values and required only one table row update.

The previous implementation required for the move at least three updates.
For the box delete one database delete with a new sequencing for all the re-
maining rows. For the add a re sequencing for all the rows and an insert.
After the new implementation gone live the improvement was massive.
In peak activiy the longest update was 0.7 s.

Conclusion

After this experience, recently I used the same approach to implement a grid of
FIFO queues simulating a physical structure into the abstract data structure.

Obviously the arrays are not the solution for any scale problem, but if care-
fully designed can be used to aggregate data easing massively the database
updates.

As there’s no implementation of foreign keys over the array’s elements is not
possible, in theory, to have a referential integrity on those data structures.
I heard a proposal to have this feature added to PostgreSQL.

IMHO this is absolutely useless and if you need a foreign on an array, then probably
your database design is wrong.

If you are wondering where is it the web product described in this post, then it’s
a shame it was decommissioned after one year I did my job.


Sunday, 2 September 2012

Don't let your improbability grow to infinite

After three posts on self celebrating, best practice and scary things now is the time to start talking about the DBA stuff.As you should already discovered this blog is not for beginners. I think there's too much howtos for basic things like installation and table creation, all scattered around the internet.
My intention is explore the unknown topics and try to give a clear solution for any possible problem.
This post will cover the database housekeeping, things to keep under your pillow to let the database run efficiently and let you sleep without worries.

All the section's titles are from Douglas Adams The Hitchhikers Guide To The Galaxy, no copyright infringement intended. I used it because I love it and I never forget my towel. And by the way, THE ANSWER IS 42


Time is an illusion. Lunchtime doubly so.

PostgreSQL comes with a powerful feature called MVCC which stands for Multi Version Concurrency Control. The database track, for each transaction, which data is visible and which not simply assigning a trasaction id, XID, when a new tuple is inserted and when the same tuple is deleted. Each tuple carries two system columns, xmin and xmax, used to determine the visibility in a simple way. Anything greater than the current transaction id is in the future and then invisible. This works absolutely good except one big caveat. The XID is a 4 byte integer and every 4 billions transactions wraps.In the early database versions this behaviour forced the DBA to initdb a new data directory and reload the data from a pg_dump every 4 billions transactions. The risk was the sudden disappearance of all data at XID wrap.
To avoid this a new comparison method was adopted in later versions. The modulo-2^31 method guarantee, for each integer, 2 billions are greater, and then in the future, and 2 billions are lesser, then in the past. A new special XID called frozenXID which is in the past for any real XID was introduced to store in a safe condition the tuples with old transactions id.

The VACUUM  when finds a tuple which XID age is greater than the configuration parameter vacuum_freeze_min_age freezes the xmin value keeping it safe from the wraparound danger.
When one of the databases becomes dangerously near to the wraparound failure the server start emitting messages in the log, when this become too near to the wraparound the system will shut down and refuse to start any new transaction. The limit is 1 million transactions and is valid for each database in the cluster.
So, to avoid any problem the periodic vacuum must include all databases, even template1 and postgres.



mostly harmless

Indexes are a wonderful thing, they order data and give the direct access to the data block without need to read the entire table.Most people believe an index creation is a good thing, even if the index is not used. That's wrong as useless indexes put overhead in the table activity and the query planner can be confused by their presence.

An index block read require the so called random access on disk which, by default, its cost is 4 times a corresponding sequential access. An index block, up to the version 9.1, does not carry the data but only the pointer to the heap block where the data is stored.

Any index read consists in at least two disk read as one is for the index block and another one for the data block. Smaller tables does not use indexes because is quicker a sequential scan to get the data in the shared buffer. Same for queries with no where condition as the entire table must be loaded in memory to satisfy the backend's request.

Any update to the tuple is, because the MVCC, a new insert with a new xmin and this does not affect the index if the new tuple resides in same block but, if the new tuple is stored in a new block the index require to be updated and this generate dead tuples in the index block itself.
Index blocks with dead and live tuples prevent vacuum to recycle the block and cause the index bloat, wasting space on disk and affecting the index performance.

The only solution to fix this is doing an index rebuild with reindex, usually not a cheap operation.
Useless indexes can affect the query planner itself as wrong performance assumption is done.
Being a balanced data structure, assuming we're talking about the btree index type, the scan require an entry point determined by the where condition, if this is covered by the index only partially the scan will cost more than expected and the query become slower than expected as the entry point cannot be determined on the root block.

Sometimes slow queries execution plans show one or more index scan.
This is can happen if the index is used with a wrong the entry point in the where condition.
To fix this a new index according the where condition can solve the problem but in that case keep in mind the problem caused by the overhead.
To check the index usage the system table pg_stat_all_indexes show the index name, the schema name and three columns


idx_scan

idx_tup_read

idx_tup_fetch
Respectively, the number of index scan initiated, the number of index block read for the index and the live tuple read with the index.
Zero or lower values in the idx_scan column should alert you about the index usage and let you ask why this index was created.
High difference between idx_tup_read and idx_tup_fetch will show the index is actually used but no data is returned, probably because the where condition is too much stringent.
To solve this problems does require deep knowledge about the application itself and probably a meeting with the developers to find a better solution.


The Ravenous Bugblatter Beast of Traal

To avoid problems in the query planner and before doing any sort of analysis in performance tuning the first question to ask is:are the database statistics up to date?
The database statistics are one of the most important things to keep after. Out of date statistics can result in unpredictable behaviour in the query planner with subsequent performance downgrade.
To explain this I'll show you a little example.
An absolute minimal table where the transactions are stored can have a transaction id, the product code and the transaction timestamp.

CREATE TABLE 
        t_transact
        (
           i_id_trn bigserial,
           v_prod_code character varying (100),
           d_trn_datetime timestamp with time zone DEFAULT current_timestamp,
           CONSTRAINT pk_id_trn PRIMARY KEY (i_id_trn)
        )
;
To speed up the read on the d_trn_datetime field let's create a new index.


CREATE INDEX idx_trn_datetime
        ON t_transact
        USING btree
        (d_trn_datetime )
;

Now let's insert ten records in this table.


INSERT INTO 
    t_transact
    (v_prod_code)

SELECT
     v_prod_code
FROM
     (
         SELECT
             generate_series(1,10),
             'tst_prod' as v_prod_code
     ) sl_prod

;
and check the execution plan used by the following query


SELECT 
        i_id_trn,
        v_prod_code,
        d_trn_datetime 
FROM t_transact WHERE d_trn_datetime='2012-09-02 15:42:20.447888+01'::timestamp with time zone ; "Bitmap Heap Scan on t_transact (cost=4.27..9.61 rows=2 width=234) (actual time=0.070..0.134 rows=10 loops=1)" " Recheck Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)" " -> Bitmap Index Scan on idx_trn_datetime (cost=0.00..4.27 rows=2 width=0) (actual time=0.043..0.043 rows=10 loops=1)" " Index Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)" "Total runtime: 0.333 ms"

The planner choose a bitmap scan on the index to retrieve the requested rows, assuming the index is cheaper than the table scan.
After running an analyze on the table t_transact this is the new execution plan


"Seq Scan on t_transact  (cost=0.00..1.12 rows=10 width=25) (actual time=0.021..0.082 rows=10 loops=1)"
"  Filter: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)"
"Total runtime: 0.198 ms"

Obviously the table with 10 rows consist in one single 8k block and the sequential scan is the fastest way to get the data. Without statistics the database did the wrong assumption the index is a better choice resulting in a query 40% slower as two disk access are required.
In the same way, if the table grow big and the database doesn't know the sequential scan will result in slower queries.


DON'T PANIC!

Reviewing the previous sections here some house keeping suggestions
  • run vacuum full every 2 billion transactions on any database in the cluster
  • check the index usage and do reindex if tables are updated often
  • check the statistics are up to date to avoid planner confusion