Tuesday 25 December 2012

The Christmas Shitload Post

Happy Christmas! Believe it or not we are almost done. We’re almost out
of winter!

This post is my present for a brilliant new PostgreSQL year.
You know, Christmas usually means goodness, happiness, joy and
And if you’re waiting for this, well, you will be hardly disappointed.
I decided to close the year writing my feelings, about VACUUM FULL, one
of the  PostgreSQL’s key functionality and an interesting thing
I discovered on the utility script shipped with the debian package for controlling
the clusters, pg ctlcluster.

And believe me if I tell you I feel like the Angry Birds, when the bad piggies
stole their eggs.

VACUUM two face

VACUUM is one of the most important functions shipped with PostgreSQL. It’s
main goal is to free space cleaning the dead rows generated during the normal
database activity.

As the conventional VACUUM simply marks as reusable the dead rows,
and freeze the tuples avoiding the s***t to hit the fan, preventing the XID
wraparound failure, this does not shrink the data file.
If you have a large table with many dead rows after the VACUUM you will
see the table stop growing.
If you need to reclaim space on disk the VACUUM FULL is what do you
need as it shrinks down the datafile locking in accessExclusive mode the affected
relation and preventing the read and the write.

During a VACUUM FULL on a massively bloated table I noticed a strange

The free space on the table’s tablespace lessened and then suddendly stopped
meanwhile the free space on the pg default began to lower until the filesystem
runt out of space with the subsequent failure of the task.

I started investigating the problem increasing the client verbosity with the
SET client min messages=’debug’;

I’ll explain what I’ve found with a simple test using a
copy of the pg attribute table.

To watch what’s happening I’ve switched on the debug message level for the
client and I’ve turned on the trace sort parameter during a single column index
creation on the relname field.
This is the entire procedure output.

db_test =# CREATE TABLE t_vacuum AS
pg_attribute ;
-- repeat this many times in order to
-- fill the table with a good amount of data
db_test =# INSERT INTO t_vacuum
db_test -# SELECT * FROM t_vacuum
db_test -# ;
INSERT 0 2155
SET trace_sort = ’ on ’;
SET client_min_messages = ’ debug ’;
db_test =# CREATE INDEX
idx_attname ON t_vacuum USING btree ( attname );
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.04 s /0.48 u sec elapsed 0.53 sec
LOG : performsort starting : CPU 0.04 s /0.72 u sec elapsed 0.77 sec
LOG : finished writing run 1 to tape 0: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : finished writing final run 2 to tape 1: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.09 s /2.61 u sec elapsed 2.72
LOG : external sort ended , 836 disk blocks used : CPU 0.14 s /2.84 u sec elapsed 3.16 se

As my work mem is small the index sort switch immediately on an external
sort in order to build the index, then generate the btree from the external sort
Now let’s run a VACUUM FULL the same table.

db_test =# VACUUM FULL t_vacuum ;
DEBUG : building index " pg_toast_16411_index " on table " pg_toast_16411 "
LOG : begin index sort : unique = t , workMem = 16384 , randomAccess = f
LOG : begin index sort : unique = f , workMem = 1024 , randomAccess = f
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort starting : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort done : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.08 sec
DEBUG : vacuuming " public . t_vacuum "
DEBUG : " t_vacuum " : found 0 removable , 275840 nonremovable row versions in 4756 page
DETAIL : 0 dead row versions cannot be removed yet .
CPU 0.17 s /0.99 u sec elapsed 1.87 sec .
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.07 s /0.51 u sec elapsed 0.58 seLOG : finished writing final run 2 to tape 1: CPU 0.14 s /2.58 u sec elapsed 2.73 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.14 s /2.64 u sec elapsed 2.80
LOG : external sort ended , 836 disk blocks used : CPU 0.21 s /2.85 u sec elapsed 3.27 se
DEBUG : drop auto - cascades to type pg_temp_16402
DEBUG : drop auto - cascades to type pg_temp_16402 []
DEBUG : drop auto - cascades to toast table pg_toast . pg_toast_16411
DEBUG : drop auto - cascades to index pg_toast . pg_toast_16411_index
DEBUG : drop auto - cascades to type pg_toast . pg_toast_16411

The interesting part begins at row 9. The database actually performs the
VACUUM, shrinking down the data file and cleaning the dead rows then at row
13 we found the same index creation output.
This explains well why the VACUUM FULL failed.

As VACUUM FULL acts exactly as CLUSTER,  during the process the
table is read sequentially and the data output sent to a new node file and this
happens on the same tablespace where the original table belongs.

The index build is performed on the same index tablespace except for the
sort on disk which is governed by the temp tablespaces parameter. And if this
is not set then the sort is performed on the pg default.

In conclusion, if you want to run a VACUUM FULL you will need space for the new
file node on the same table’s tablespace. The indexes do require the space on the
same tablespace plus the sort on disk, if you don't have enough memory.

So, if you have space shortage on the pg_default, the command SET temp_tablespaces to a
more spacious tablespace will prevent the pg default to run out of space.

Aggressive shutdown

Working with the hot standby on debian I had a bad experience when I’d set up
a new slave server. After the data copy with the archive command started I ac-
tually started the slave forgetting to set the hot standby parameter to on. After
a while the slave reached the consistent state and I issued the pg ctlcluster com-
mand in order to stop the cluster itself with the –force to have a fast shutdown
instead of a smart one.
The script doesn’t have an option to switch between PostgreSQL’s shutdown
modes but, if without options performs a smart shutdown mode.
Adding the –force option the script first try a fast shutdown mode, if the
database doesn’t shutdown in a reasonable time the script switch to the imme-
diate shutdown, if this is not enough to stop the instance finally a kill -9 on
the postgres process is performed.


If you are too much delicate I suggest you to stop reading, now begins the real

I understand the hard efforts done by the global team to create a first class
product, I left Oracle for PostgreSQL and I don’t regret my choice because
PostgreSQL is a damn good database, probably too much good.

But can anyone explain me why, in the name of one of any of the thousands
fake gods invented by the human race, no one put the VACUUM FULL be-
haviour in the on line manual leaving this HUGE lack of documentation on

Regarding the debian's script, I understand the –force option of pg ctlcluster is not
 to be run in normal condition, but does the author understand what
does it means to perform in normal conditions a smart shutdown?

A user that forget to logout can let the script to run forever.
On the other side if the –force is passed to the script and the database takes too much to
shutdown an immediate shutdown is performed without warnings.

And this will wipe out any bloody unlogged table from the cluster on
the next startup.

Then, as cherry on the top, if the immediate does not work a kill -9 is
performed on the postmaster leaving the shared memory as garbage.

And this works the same way if you use the debian’s init script.

HEAVY MACHINE GUN! (Metal slug’s quote)
So, my last advice, is do not use the wrapper pg ctlcluster for stopping the
clusters created with pg createcluster.
It’s better to shutdown the cluster using the command.

/usr/lib/postgresql/<MAJOR VERSION>/bin/pg ctl stop -m <THEWAYYOUWANT>


Thursday 20 December 2012

Time and relative arrays in space

Scale what?

Few years ago when I was self employed as PostgreSQL consultant and I was
appointed a big performance tuning task by a great search engine (no, not
THAT search engine).

The application was similar to the Google’s desktop, a grid of boxes with
RSS feeds. Unfortunately the implementation of the two dimensional grid had
bad performance issues.

Any position change with a drag and drop, a delete or an add, even in low
activity periods, required the consequent database update to run for at least
20 seconds. During the peak time the system was unusable as the php hit the
execution limit timeout and this caused any transaction to abort.

The simple design, a table with two integer columns mapping the X-Y co-
ordinates with a unique constraint to avoid duplicates, wasn’t scaling and the
problem was caused by the unique constraint itself.

As the PostgreSQL 8.4 didn’t supported the deferrable unique constraints
the position change required an update for each row in order to keep the sequence
without holes, and no way to run a bulk update because the unique constraint
was violated during the update.

This approach worked well with the stage, once pushed to the live system
the map table became several millions rows big and the updates, with all the
overhead introduced by the indexes itself simply did not scale.

Scale me this!

After one week of analysis and test I’d realized that the solution was to rewrite
completely the application logic changing the conventional table into an aggre-
gated table.

The original table had four fields. userid, boxid,xposition,yposition.

I moved the boxid, from the map table field into a two dimensional text
array binding and adding the userid column as primary key.

In that way I could access the array by userid and get the boxid simply
passing the x-y coordinate. The task to build the box grid became incredibly
simple, with the array unnest.

With this structure any box move,add or delete with drag and drop was
performed over the array values and required only one table row update.

The previous implementation required for the move at least three updates.
For the box delete one database delete with a new sequencing for all the re-
maining rows. For the add a re sequencing for all the rows and an insert.
After the new implementation gone live the improvement was massive.
In peak activiy the longest update was 0.7 s.


After this experience, recently I used the same approach to implement a grid of
FIFO queues simulating a physical structure into the abstract data structure.

Obviously the arrays are not the solution for any scale problem, but if care-
fully designed can be used to aggregate data easing massively the database

As there’s no implementation of foreign keys over the array’s elements is not
possible, in theory, to have a referential integrity on those data structures.
I heard a proposal to have this feature added to PostgreSQL.

IMHO this is absolutely useless and if you need a foreign on an array, then probably
your database design is wrong.

If you are wondering where is it the web product described in this post, then it’s
a shame it was decommissioned after one year I did my job.

Friday 23 November 2012

Concurrently numb

Here we go again. I'm not dead, not yet.
I've just moved home and started a new job and I'll be honest, before this job I've just played with toys.

With this post I want to start a talk over indexes, how good they can be and how a wrong action or assumption can buy you a one way ticket to disaster.
So, let's start with  the concurrent clause in the CREATE INDEX statement and a short description of the PostgreSQL's general purpose indexes, the b-trees.

Index physical structure

PostgreSQL implements the b-tree index access using the Lehman and Yao's high-concurrency B-tree management algorithm, to find out an extensive description you can check out the README file in the source directorysrc/backend/access/nbtree/README
The chapter on database physical storage give you a good explain how the data files are organized and how is organized the block structure but doesn't tell too much about the index blocks.
An index block is quite similar to the heap block, with a 24 bytes long page header with references to the corresponding WAL entry and timeline, plus the pointers to the begin of the free space, to the end of the free space and to the beginning of the special space.
After the header resides the item pointers sectionto the tuples stored in the end of the page, usually 4 bytes on 32 bit and 8 bytes on windows and 64 bit architectures.
The last section, the special space, is usualy 16 bytes long and stores the pointers to the related pages inside the tree. Another difference with data files is the presence of the metapage, the first page which stores the pointers to the root page, or pages. With this implementation is possible to have the root page split.

Table and index space overhead

Any item in PostgreSQL have special attributes identified with negative att num in the pg_attribute system table.
According with this structure every item have an extra overhead of 27 bytes even if the table doesn't have columns.
Another difference with the heap pages is the fill factor, fixed to 0.7 for non leaf pages and 0.9, but adjustable, for leaf pages.

The visibility map and index only scans

Up to the version 9.1 the index is used only to track the ordered data pointer to the corresponding heap block.
As the index carries the values in an ordered ans structured tree is theoretically possible to read only the index to get the data if the SELECT contains only for the indexed columns.
Another point to my mantra SELECT * FROM table; IS THE ROOT OF EVIL.
The problem with the index only scan is related to the tuple visibility.
The index inside the transaction can be temporarily out of date for deleted tuples this can produce inconsistent results.
The visibility map is then used to track the visible tuples to keep the index read consistent.
The problem with this approach, before the version 9.2, is that this fork is not crash safe and, in the case of instance crash, could result in an out of date map.
In the version 9.2 visibility map changes is logged on the WAL making it the map crash safe and enabling the index only scan.

Bloating indexes

As any PostgreSQL expert should know UPDATE stands for INSERT.
Everytime an update is performed a new row version is created with xmin set to the current transaction XID and the previous version have the xmax set to the same XID.
This permit the magic of MVCC without the use of the rollback segment as Oracle does.
The problem with this approach is that often updated tables can bloat faster and the last updated tuples will move across the datafiles every time a new block is created to store the newer versions.
When this happens if there's an index referencing the tuple, this must be updated as well with a new row version that points to the new heap block.
If this in a table, that is a sequential not ordered collections of blocks, add the block in the table's physical end, the same for the index require the index to be transversed to find where it is the updated row and then the new block creation must follow the b-tree structure in a very expensive process.
Even worse, if the updated value is in a composite index and is not the first column the index entry point is unknown and the read for the update require a great amout of cpu cycles and disk seek.
Then the cherry on the cake. The vacuum process cannot mark an index block reusable if there inside mixed live and dead tuples.
I think you got the point.
Indexes can bloat with values scattered over multiple blocks where dead tuples slow down the search process. As the optimizer doesn't know about this, an index scan over a bloated index can have horrible performances.
And I didn't mentioned the waste of space...

This is a job for REINDEX

Of course PostgreSQL comes with good maintenance statements like VACUUM or CLUSTER, for the tables, and REINDEX for the indexes.
Reindex builds a new index, updates the system catalogue to point to the new one and drop the old one. Looks simple? Not exactly.

To do this a write exclusive lock must be acquired on the table and this prevents any write action for the time needed to reindex.
The read is not affected as the REINDEX needs only a consistent snapshot to build the new relation.
The index build requires the data to be sorted and if the amount doesn't fit in the work_mem value this will be performed on disk slowing down the entire process.
The exclusive lock have one big advantage, if there's some process blocking the REINDEX to acquire the lock the procedure will not start.
This doesn't happen with the concurrent rebuild (REINDEX thanks Rassilon does not have the CONCURRENT clause) and this can give you an extra free ticket to the previously mentioned disaster.

Concurrently what?

The concurrent build create a new INVALID index immediately visible, then performs a second table scan to update the index with the values changed during the creation, then acquires an exclusive lock over the table to validate the index. The point of failure is the last step. As no check is performed before the index build is started if there's any blocking process, for example pg_dump, the last step will wait for the lock, meanwhile the INVALID index is attached and updated adding overhead to the table's operations.
If you cancel the query the index is not dropped automatically. It stays in place in INVALID state since is dropped.
But the drop requires an exclusive lock on the pg_class table and if the blocking process is still around...
Now, what happens if you create an unique index and during the second table scan the unique condition is violated by transactions not visible when the first scan is started? The index build is aborted, same as before, INVALID index attached to the table.
You see. Do you like aisle or window?

Wrap up

Let's review things.

  • Index bloats, sooner or later you'll need to cope with it
  • Periodical reindex reclaim space and improve speed
  • If a write operation is slow take a look for indexes  
  • Reindex locks the writes but not the read
  • It's possible to create a new index without the write lock, beware the dragon

Sunday 2 September 2012

Don't let your improbability grow to infinite

After three posts on self celebrating, best practice and scary things now is the time to start talking about the DBA stuff.As you should already discovered this blog is not for beginners. I think there's too much howtos for basic things like installation and table creation, all scattered around the internet.
My intention is explore the unknown topics and try to give a clear solution for any possible problem.
This post will cover the database housekeeping, things to keep under your pillow to let the database run efficiently and let you sleep without worries.

All the section's titles are from Douglas Adams The Hitchhikers Guide To The Galaxy, no copyright infringement intended. I used it because I love it and I never forget my towel. And by the way, THE ANSWER IS 42

Time is an illusion. Lunchtime doubly so.

PostgreSQL comes with a powerful feature called MVCC which stands for Multi Version Concurrency Control. The database track, for each transaction, which data is visible and which not simply assigning a trasaction id, XID, when a new tuple is inserted and when the same tuple is deleted. Each tuple carries two system columns, xmin and xmax, used to determine the visibility in a simple way. Anything greater than the current transaction id is in the future and then invisible. This works absolutely good except one big caveat. The XID is a 4 byte integer and every 4 billions transactions wraps.In the early database versions this behaviour forced the DBA to initdb a new data directory and reload the data from a pg_dump every 4 billions transactions. The risk was the sudden disappearance of all data at XID wrap.
To avoid this a new comparison method was adopted in later versions. The modulo-2^31 method guarantee, for each integer, 2 billions are greater, and then in the future, and 2 billions are lesser, then in the past. A new special XID called frozenXID which is in the past for any real XID was introduced to store in a safe condition the tuples with old transactions id.

The VACUUM  when finds a tuple which XID age is greater than the configuration parameter vacuum_freeze_min_age freezes the xmin value keeping it safe from the wraparound danger.
When one of the databases becomes dangerously near to the wraparound failure the server start emitting messages in the log, when this become too near to the wraparound the system will shut down and refuse to start any new transaction. The limit is 1 million transactions and is valid for each database in the cluster.
So, to avoid any problem the periodic vacuum must include all databases, even template1 and postgres.

mostly harmless

Indexes are a wonderful thing, they order data and give the direct access to the data block without need to read the entire table.Most people believe an index creation is a good thing, even if the index is not used. That's wrong as useless indexes put overhead in the table activity and the query planner can be confused by their presence.

An index block read require the so called random access on disk which, by default, its cost is 4 times a corresponding sequential access. An index block, up to the version 9.1, does not carry the data but only the pointer to the heap block where the data is stored.

Any index read consists in at least two disk read as one is for the index block and another one for the data block. Smaller tables does not use indexes because is quicker a sequential scan to get the data in the shared buffer. Same for queries with no where condition as the entire table must be loaded in memory to satisfy the backend's request.

Any update to the tuple is, because the MVCC, a new insert with a new xmin and this does not affect the index if the new tuple resides in same block but, if the new tuple is stored in a new block the index require to be updated and this generate dead tuples in the index block itself.
Index blocks with dead and live tuples prevent vacuum to recycle the block and cause the index bloat, wasting space on disk and affecting the index performance.

The only solution to fix this is doing an index rebuild with reindex, usually not a cheap operation.
Useless indexes can affect the query planner itself as wrong performance assumption is done.
Being a balanced data structure, assuming we're talking about the btree index type, the scan require an entry point determined by the where condition, if this is covered by the index only partially the scan will cost more than expected and the query become slower than expected as the entry point cannot be determined on the root block.

Sometimes slow queries execution plans show one or more index scan.
This is can happen if the index is used with a wrong the entry point in the where condition.
To fix this a new index according the where condition can solve the problem but in that case keep in mind the problem caused by the overhead.
To check the index usage the system table pg_stat_all_indexes show the index name, the schema name and three columns



Respectively, the number of index scan initiated, the number of index block read for the index and the live tuple read with the index.
Zero or lower values in the idx_scan column should alert you about the index usage and let you ask why this index was created.
High difference between idx_tup_read and idx_tup_fetch will show the index is actually used but no data is returned, probably because the where condition is too much stringent.
To solve this problems does require deep knowledge about the application itself and probably a meeting with the developers to find a better solution.

The Ravenous Bugblatter Beast of Traal

To avoid problems in the query planner and before doing any sort of analysis in performance tuning the first question to ask is:are the database statistics up to date?
The database statistics are one of the most important things to keep after. Out of date statistics can result in unpredictable behaviour in the query planner with subsequent performance downgrade.
To explain this I'll show you a little example.
An absolute minimal table where the transactions are stored can have a transaction id, the product code and the transaction timestamp.

           i_id_trn bigserial,
           v_prod_code character varying (100),
           d_trn_datetime timestamp with time zone DEFAULT current_timestamp,
           CONSTRAINT pk_id_trn PRIMARY KEY (i_id_trn)
To speed up the read on the d_trn_datetime field let's create a new index.

CREATE INDEX idx_trn_datetime
        ON t_transact
        USING btree
        (d_trn_datetime )

Now let's insert ten records in this table.


             'tst_prod' as v_prod_code
     ) sl_prod

and check the execution plan used by the following query

FROM t_transact WHERE d_trn_datetime='2012-09-02 15:42:20.447888+01'::timestamp with time zone ; "Bitmap Heap Scan on t_transact (cost=4.27..9.61 rows=2 width=234) (actual time=0.070..0.134 rows=10 loops=1)" " Recheck Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)" " -> Bitmap Index Scan on idx_trn_datetime (cost=0.00..4.27 rows=2 width=0) (actual time=0.043..0.043 rows=10 loops=1)" " Index Cond: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)" "Total runtime: 0.333 ms"

The planner choose a bitmap scan on the index to retrieve the requested rows, assuming the index is cheaper than the table scan.
After running an analyze on the table t_transact this is the new execution plan

"Seq Scan on t_transact  (cost=0.00..1.12 rows=10 width=25) (actual time=0.021..0.082 rows=10 loops=1)"
"  Filter: (d_trn_datetime = '2012-09-02 15:42:20.447888+01'::timestamp with time zone)"
"Total runtime: 0.198 ms"

Obviously the table with 10 rows consist in one single 8k block and the sequential scan is the fastest way to get the data. Without statistics the database did the wrong assumption the index is a better choice resulting in a query 40% slower as two disk access are required.
In the same way, if the table grow big and the database doesn't know the sequential scan will result in slower queries.


Reviewing the previous sections here some house keeping suggestions
  • run vacuum full every 2 billion transactions on any database in the cluster
  • check the index usage and do reindex if tables are updated often
  • check the statistics are up to date to avoid planner confusion

Saturday 25 August 2012

Danger Master! Danger!

to configure the PITR you should save the $PGDATA with tar or in alternative you can use pg_dumpall
PostgreSQL trainer describing the point in time recovery

An oid type can carry on 2Gb
PostgreSQL guru explaining the binary data type

Maybe you can use pg_resetxlog to solve your problem
A clueless user trying to solve a mayor version incompatibility data directory issue

Danger, man at work

This is a small excerpt of what I've seen browsing the tecnical forums and community websites.
My first reaction reading all this nonsense was angry as these guys have good reputation in the PostgreSQL community and their ignorance can cause data loss, damages and, definitely, harm PostgreSQL.
Obviously it's impossible to cover anything so I'm writing this post as warning do not trust people only by the reputation. You should investigate any suggested solution and absolutely DO NOT TRY ON THE PRODUCTION, NEVER!.

It's an airplane? It's an ufo? It's a dumb a**

The point in time recovery is a wonderful feature introduced in the version 8.1 and enable the capabilty to restore the entire database cluster at the exact date and time you want.
In this post I'm not describing how to configure it as the documentation is well written. I show how it works to make clear the danger in the trainer's quote.

The PITR is a physical backup where you actually copy the data files meanwhile the database is running. You can do it safely because the write ahead logs generated during the copy are archived and all the changes can be applied to the inconsistent data files later when you need to restore the cluster.
This is possible because the blocks stored in the archived WAL carry on the information where the block is located on disk. In the restore process the physical data block in the data file is overwritten by the WAL data block during the startup process when the database performs the instance recovery.
The pg_dumpall is a logical backup wrapper for the general purpose pg_dump. If you look the pg_dump's source code you will see a collection of SELECT command within the strict SERIALIZABLE transaction isolation level.

The result is an output file with the SQL commands to run against a different cluster to recreate the databases, the tables and the data.

The restore with this tool require an empty cluster generated with initdb up and running. pg_dumpall does not require instance recovery but a full running instance, that means all the archived WAL files are useless as they are usable only during the recovery.

So, if you follow the procedure suggested by the trainer and choose the pg_dumpall you will illude to have a PITR because you are using a logical tool to restore a physical backup and you will loose any database change between your pg_dumpall and the moment the live database server was lost.

OID is not a TARDIS

PostgreSQL have the bytea built in data type for binary objects. As is a TOASTable type with variable lenght the maximum allowed size is 1Gb. The binary data in this way is stored in line as any table column and you can read and write the content using the conventional SQL statements.
If the binary data doesn't fit in 1 Gb PostgreSQL allow another strategy to store up to 2 Gb as large object.
The built in functions lo_import and lo_export can read and write files from the disk and store in a system table called pg_largeobjects.
This table have three columns,

  • loid TYPE oid
  • pageno TYPE integer
  • data TYPE bytea

Whe the function lo_import is called PostgreSQL stores the binary data in the pg_largeobjects splitting the extra space in two pages identified by the same loid value and different pageno. The function return the OID value to reference the stored large object.
The function lo_export does the reverse, read the binary chunks for the given OID and return the binary stream on disk.
So, the second quote show how lack of attention and simple logical deduction have the PostgreSQL Guru.

I'll burn my house to light a candle

The next three seconds after reading this quote I remained speechless, the suggestion is probably the most dangerous thing, and far away the most brutish, you can do on your data directory.
I discovered this idiocy by browsing the forum and probably too late to avoid the disaster as the post was months old.
As any PostgreSQL DBA should know the WAL files are used to save the data block changes on non volatile memory before the write actually reach the data file as, if any crash happens, the data block can be applied to the datafile reading from the WAL.
The process is the same used for PITR, the only difference is no archived wal is used and only the files in the pg_xlog are used to reply the blocks on the data files.
The global directory contains a 8k file called pg_control where the last checkpoint location is stored and tells the database which WAL to start the replay during the instance recovery.
Any corruption in the pg_control or in the WAL files results in an instance unable to start. If any, and I repeat any solution is useless is still possible to start the instance using the program pg_resetxlog.
This program does one simple thing. Delete anything in the pg_xlog and generate a new pg_control restarting the XID count from the beginning.
The instance can start again but any boundary between the data blocks and the WAL is lost and any attempt to run read write queries can result in data corruption.
The on line manual is absolutely clear on this.
After running pg_resetxlog the database must start without user access , the entire content must be dumped using pg_dumpall, the data directory must be dropped and recreated from scratch using initdb and then the dump file restored using psql or pg_restore
After the scaring part let me explain the problem, a common one indeed, and the solution.
If you use PostgreSQL as the packaged version shipped with the most common Linux distributions can happen, if you do the distribution upgrade, PostgreSQL version jump a major release, for example from 8.4 to 9.1.
When this happens the new binaries refuse to start on the data directory initialized with the previous major version.
The solution is quite simple. You can compile a matching PostgreSQL version, start the instance and dump the contents to restore in the new major version data directory or, if you have an enterprise amount and you are in hurry to return on line, you can use pg_upgrade, a powerful tool to do the in place upgrade.
I hope these three examples have scared you enough to test and check the documents before doing anything.

Saturday 18 August 2012

The angry dba

Before anything on database administration let me introduce some best practice I acquired during my experience in the consulting company I mentioned before.
This company offer Oracle services to important enterprises in Italy and, despite the size, is well structured and organized.
The first thing I've learned was how to write good documents, not too much verbose but complete in any part.
Then I was inducted to the company's best practice in SQL coding as Oracle have some limitations in the size and characters allowed in the identifiers.
PostgreSQL does not have this limitations giving freedom of choice for any format and structure, with unicode characters (if the database supports UTF8 of course) and a 64 characters maximum length for the identifiers.
This is a good thing but, believe me, sometimes it can be a bloody nightmare.
I've written some guidelines to get things sorted out.

Rule 1 – No camel case or reserved keywords in the identifier name

Let start from the beginning.
Should you need to create a table to store debit notes for the year 2002, which name will you choose?
My old school development habits suggest me something like debit_notes_2002.
A new school developer or maybe used to work with AGILE, SCRUM,ITIL and S**** will probably use the name DebitNotes2002.
This in PostgreSQL, is a problem because all the names are converted in lower case making the camel case useless.
It's possible to keep the case preserved enclosing the identifiers between double quotes “.
That mean that ANY query involving your beautiful identifiers will contain the quotation and this affects the readability of the query.
This happens for any identifier named after reserved keyword as well.
Same story for the column names .

Rule 2 - self explaining schema  

In the rule 1 I declared I would name the table for debit notes as debit_notes_2002.
I lied.
I like to put prefix in any identifier to keep track which relation I'm are dealing with.
PostgreSQL stores any relation in one system table named pg_class and give you system views like pg_tables and pg_indexes to query.
But why I should bother in system catalogue queries or people inquiry if I can let the database describe itself?
The way to do this is quite simple.
I put a prefix on any identifier to tell what kind it is or which data carries.
Adding a t_ prefix in the table name will tell to any developer and, most important to any dba/performance tuner, that the relation carries physical data.
Adding a v_ prefix to any view will tell that we are dealing with a short cut to a query and, hopefully, this will avoid the bad habit to join views resulting in the query planner going nuts.
On the same way giving a prefix to the columns to tell the data type will reduce to the developers the risk to get a wrong behaviour for wrong typecast, and will give to the dba precious informations about the space and storage used by the table.
A good structure for building self explaining names can be this.

Unique index
Primary key
Foreign key
Unique key
Sql function
Plpgsql function
Plpython function
Plperl function
Table 1: Object prefix

The same for the data type used for the column.
An example for the most common data type is shown in table 2

Character varying

Rule number 4 – format the SQL queries

Sometime I receive requests to debug or optimize queries written in this way
select * from debitnoteshead a join debitnoteslines b on debnotid where a.datnot=b. datnot and b.deblin>1;
Frustrating isn't it?
Despite my legendary bad temper when I'm in service I'm absolutely professional and probably too much kind because I don't bounce the request and I fix the query.
I know I know, I'm doing wrong.
The company doesn't pay my salary to hit tab, enter and ctrl+u, but to keep the company's systems up, running and sound.
So, if you don't want to waste your time formatting when you receive something like this, I suggest to simply bounce back the request asking for a pretty format.
Some simple rules can enhance the readability.
  1. SQL keywords in upper case
  2. all the identifiers must be indented on the same level
  3. All the same level SQL keywords must be indented at same level
  4. Avoid the SELECT * as is a waste of memory and doesn't tell anything useful
  5. If possible inner join the tables explicitally in the WHERE condition
  6. In the join use the keyword ON to make clear which column is used for joining
  7. In the table list use self explaining aliasing
Using those rules the previous query looks like this.
               debitnoteshead hrd,
               debitnoteslines lns
               AND hdr.datnot=lns.datnot
               AND lns.deblin>1
Now look beauty.

With this last advice ends my first blog post.
Just small afterword before ending.
Probably someone will think this guidelines are too strict to be accepted.
Obviously for the code already written nothing can be done, but can still teach the developers to write decent code.
I think this is a goal can be achieved and believe me, it worth it.
For the new projects this guidelines give you a powerful vision for the database structure without wasting your time through data dictionaries or asking people about the data.
IMHO our job is a war where the casualties are downtime and data loss.
I consider the data the most precious thing a company own and definitely our mission is to guarantee the smallest reaction time to fix any problem minimizing any possible loss of money.

Thursday 16 August 2012


What is blue, bigger in the inside and with time travel capabilities?
 If your answer is the TARDIS, then, yes you're close but the right one is PostgreSQL.

 Welcome to my little quiet space in the noisy and confusing PostgreSQL universe.

 I've started working seriously with databases in the 2004, before I toyed with small web applications MySQL driven, then my life changed when I was employed by a big consulting company in Florence. I became member of what I call the Oracle dba friar.

I discovered how big and challenging can be working with serious amount of data and how important is to have clear, defined and well written informations, possibly in one single place and not scattered around the time and space.

I left the Oracle world in the 2007 and I created one of the biggest PostgreSQL conference in Europe, if I well think, before me the name PGDay was unknown.

Now I moved from Italy to the UK, actually I'm Italian so please accept my apologies for my english, and I had confirmed my feelings.
The pure  PostgreSQL dba is a rare animal, personally I know only two of them, myself and a clever guy from the east midlands I had the pleasure to work with.

I tried to understand why, despite the wide use by great companies PostgreSQL have so few real dba and so many  people pretend to be.

Imho I think the reason can be the lack of good training on this complex argument.

Yes, you can get training from great companies, directly from the developers, but let me ask you a question.

Should you want to learn how to drive a Formula 1, who do you ask?
The engineer or the driver?

I am the driver.