Sunday 29 November 2015

Meetup wrap up

This second meetup went very well. The audience was interested and we had fun time thanks to the beer and pizzas offered alongside with the venue by our sponsor brandwatch.

Here a couple of pictures from the meetup.

The recording worked much better than the previous time, here's the presentation's video. We'll meet again shortly for a nice beer. Next technical talk will be probably in January.

Tuesday 24 November 2015

Meetup live hangout

Three days to go for the next Brighton PostgreSQL meetup.
I'll run a live hangout of the talk.

You can join the event there.

The record will become available on youtube shortly after the talk's end.

Wednesday 11 November 2015

Streaming replication at the brighton postgresql meetup

November 27th at 19.00 GMT I'll talk at the Brighton PostgreSQL meetup.

This time the group chosen the streaming replication as topic.

The talk will cover the PostgreSQL write ahead logging and the crash recovery process. The audience will learn how to setup a standby server using the streaming replication and how to troubleshoot it.

Please RSVP here.

Thursday 22 October 2015

The infamous life of the Magician

What follows is the synthesis of several years of frustration. Before you start reading please note that the things written in the dark age section do not apply for the high end environments like Oracle. That's mostly because starting an Oracle project without a DBA on board is an interesting and creative way to get bankrupt in few months. Obviously things evolves and maybe in the next decade my Oracle fellows will join me in this miserable situation.

A long time ago... 

In the 1998, long before becoming a DBA, I spent few months working as tourist entertainer. For me it was a life changing job, my nature changed dramatically. I was very shy, afraid of human contact. After that job the real me came out, I discovered the pleasure of interacting with people and the shyness disappeared forever. I remember different aspects of this strange job. I entertained children and adults, I learned the art of balloon twisting, acting on the stage and, of course, how to do magic tricks.

From the common people point of view the magician seems a normal guy capable of doing impossible things. From the magician point of view... what a laugh. Everything is a cheat. When I first learned the tricks that fascinated me during my childhood I couldn't believe that somebody would trust me. Then, with great difficulties I discovered how a fluid movement combined with voice and face expression can lead people to loose their tracks and believe the unbelievable.

Often after the show children came to me asking how learn to do magic. I was vague, I talked about hard work, special attitude and magic powers secretly shared across the magicians. I didn't want to spoil their childhood dreams. What if they discovered that the disappearing foulard was possible because of a cheap fake thumb anybody can buy in a magic shop? The magician knows the secret behind the tricks. He dispense joy and wonder but he can't enjoy the magic anymore because he knows that, after all, there's no such thing like magic.

Here we go again

Fast forward to 2004. I started another life changing  job. I left the software development for good and I joined the elite society of the Oracle DBAs. For me was like discovering a whole new universe. A dimension where the data is almost a living entity and where the DBA can manipulate those data, exposing them to the common people transformed through a fake perception. Again, I started doing magic; This time the knowledge were shared across a restricted group of high skilled persons. The DBA team, the developers which had a solid database background and the designers which built and refined the data structures. In particular the design team acted like a glue between the DBA and the developers sometimes having different point of views. The magic were shared and everybody worked actively in the same direction, pushing the database and the data access to the maximum possibilities.

In this new world the children were the end users. They saw the magic and they believed things were simple and clear like presented. Ignoring the hard work and the carefully designed structures behind their magical interfaces. They didn't asked about the magic accepting it like as a fact of their experience. All worked quite well. 

 The dark age

Year 2015, the era of big data, the full stack developers and the data ignorance. In those days I often feel like Obi Wan Kenobi, longing for the old civilised times. The equation is still there. The magician the extended circle who knows the magic and the children. It's only completely unbalanced. An example will help you to understand better.
Let's consider a developer with a great idea. The developer knows that there is a technology which can build easily and without hassle this idea. Then the project starts from the wrong direction, from the application layer, going down to the logical data layer and finally to the physical layer. This crazy approach is like building a terraced house from the roof. But, ``hey it's just a proof of concept right? We need to have something working to show to the funders''. Nothing wrong with this except that when the money comes in, when the ball starts rolling nobody make this simple question: ``what happens if we are successful?''. If the project is successful, usually the same people which caused the problem of scale (because of scale that we are talking), start a brainstorm for finding new solutions in order to mitigate the issue with short sight. The loop is something like this.

With this approach all the fixes are lasting no more than few months, which incidentally is the horizon of the events given by the agile methodology. This approach have another bad effect. The skyscraper built on the terraced house foundation is patched and changed only on the upper side. The building becomes more unstable every time is ``fixed''. Sooner or later the brainstorm will consider the possibility to get on board somebody who understands the database layer. Compared to ten years ago doing this of job is pretty frustrating. Speaking of data integrity or normalisation is like talking in a foreign language. Even the simplest task of adopting views or functions like a database interface is a game of will, where any idea needs constant and exhausting explanations.
In this environment the magician still does the magic. The difference is that the magician only knows that the magic happens because of his hard work. The extended circle of knowledge that supports the magician is completely missing. All the things happening behind the scenes are hidden and unknown. And the children ignore completely that their perception is just a trick. They truly believe that things are working smooth because of their capabilities.


 It sucks being a magician

I'm a DBA and my job is horrible. Quite strange isn't it? I feel like I was an Alcoholics Anonymous trying to get rid of an addiction rather an IT professional. But it's true, my job is horrible. Being a DBA requires dedication and hard work in antisocial hours. I bet anybody wants to spend the entire Sunday night working on databases instead of sleeping or clubbing right? I wouldn't change my job for anything else. Sadly in the last few years the pleasure of my job has become a burden. Ten years ago the hard work was a known fact across sufficient people to build a critical mass, a support group against the children's push back. Today the miserable magician is left alone. Nobody understands what he's doing. Often the proposal for any improvement are hampered. The others see him like a disruption to the general work flow rather than a resource.


When you do things right...

...people won't be sure you've done anything at all. Said God to Futurama's Bender. That's true. Unfortunately this is a two edge blade. In the years I've learnt how to do super critical tasks with confidence generating a minimal disruption to the production. In fact most of the time nobody will notice anything. And that's quite bad because in the general perception it will seem that you are a lazy bad tempered ugly jerk. And that's all true for me, except for the laziness. With this particular skill a DBA can fix things under the hood with people barely noticing something happened, except for the things running better. There's nothing wrong in this but maybe a better recognition that things aren't happening by real magic it would help. Maybe it's just time to stop doing magic in the background, put the people in front of their responsibility and let the children grow.

Thursday 10 September 2015

The next meetup - PostgreSQL query planning

Like previously said, the next Brighton PostgreSQL meetup will be September 25th at 7 pm BST. The topic chosen by the member is the query planning and execution in PostgreSQL.
I will do the presentation exploring the various steps a query passes through from the client to the execution. I'll also explain how to read the execution plan and why sometimes the executor seems to ignore the indices put in place for speeding up the operations.

The meetup is hosted kindly by Brandwatch

There are limited places so you want to join us secure your seat here

Monday 17 August 2015

The sea elephants

Friday 14th August we kicked off the Brighton PostgreSQL  Meetup.
We had a nice evening with cool people all togheter discussing about PostgreSQL and how we can run effectively the meetup.

We decided to have a regular monthly meetup hosted around Brighton, possibly, by companies or any suitable venue.

The next meetup will be the 25th of September and this time there will be some PostgreSQL talks. The general interest favours the standby servers and the streaming replication. Another topic could be the query planning and execution.
I can cover both of the topics. We just need to decide how many talks will be in one hour session.

After the meetup we’ll head to the nearest pub for a nice beer.

I can only say a big thank you, to all the meetup members.

That’s all for now.  

Wednesday 12 August 2015

Meet Marvin

There is just one day left and we'll start the Brighton PostgreSQL Meetup. I invested some resources in this project like and I truly believe it can be a success.
I still can't believe that in just one month 25 people already have shown  the interest on being part of the Brighton PostgreSQL Group. And today another nice suprise. I received the new shiny mascot for our group.

He's Marvin, the sea elephant. The author if this fantastic drawing is Elena, an incredibly talentuous artist from Italy. 

I've also opened a twitter account for our social media presence.

So we are ready to start. I'm very excited and I'll write a detailed report of our first meetup in few days.

Thursday 6 August 2015

Directory format and default compression

After upgrading some clusters to PostgreSQL 9.4.4 I noticed an increase of the database backup. Because the databases are quite large I'm taking the advantage of the parallel export introduced with PostgreSQL 9.3.

The parallel dump uses the PostgreSQL's snapshot export with multiple backends. The functionality requires the dump to be in directory format where a toc file is saved alongside with the compressed exports, one per each table saved by pg_dump.

Initially I believed that the latest release introduced a change in the default compression level. However I noticed that using the custom format the resulting file was smaller.

For testing purposes I've created a database with two tables big 700MB each one. Using the custom format with the default compression level they are saved in 442 MB. The directory format stores the same data in 461MB, +4%.

After some research I spotted a recent change introduced in the version 9.4.2.
In pg_dump, fix failure to honor -Z compression level option together with -Fd

With another search, this time on the committers mailing list, I found this commit .

Basically the change  in the 9.4.2 "ignored the fact that gzopen() will treat "-1" in the mode argument as an invalid character", causing the compression level to be set to 1, the minimum.

The fix is already committed and should appear in the next minor release. Until then, if using the directory format, the  workaround is to pass the flag -Z 6 to pg_dump.

The bug has been backpatched to the version 9.1. However I've never noticed this issue until I switched from the 9.3.9 to the 9.4.4.

Friday 31 July 2015

Nuts and bolts part 4 - naming conventions and editors

Named identifiers

Writing and debugging the SQL is not simple and a little bit of clarity becomes a massive help. Adding a prefix to the identifiers gives to the SQL developer a great knowledge about the database schema.

This naming convention makes clear the difference between tables which are physical entities, and the views which are names for saved SQL statement.

Adopting a similar approach for the column names makes the data type immediately recognisable.

Both naming schemas prevent the risk of having reserved keywords used for the identifier's name.

The drawback is if the object type changes. For example if a table becomes a view the name should reflect the change. In an ideal world this shouldn't happen. In the real life building a solid schema design reduces greatly this kind of issues. However, changing the data type, in particular if dealing with big tables, is more complicated than renaming a field and, after all is a DBA problem.

The editor

Unlikely many commercial RDBMS PostgreSQL, ships only with the command line client psql. There is a good quantity of third party clients with good support for the database features and a good connectivity layer. An exhaustive list of those clients can be found on the PostgreSQL wiki Is difficult to say which editor is the best. When I started learning PostgreSQL the only tool available was PgAdmin 2 and phpPgAdmin. I decided for the former and then I switched to the newer PgAdmin 3 which added a powerful editor and multi platform support. In the years I tested some of the other clients like TOra, SQL workbench and SQL Maestro and I never found the same confidence and ease of usage like PgAdmin 3. Whether is the tool of your choice this should have the following features.

libpq connector

One of the reasons I do not like SQL workbench is the JDBC connector. Writing and testing the SQL code is a quick process. Write the statement, test it, then change it, the test it again and so on. The client response in this method is absolutely important. The libpq connector have virtually no lag, except the disk/network bandwidth.

Removal of the Byte Order Mark (BOM)

The BOM is a marker set in the UTF8 files to determine the byte order. The client psql manages the presence but for some reasons it can fail with an error when running sql scripts with this marker. In PgAdmin 3 is possible to disable the BOM when saving the files.

Monday 27 July 2015

Brighton PostgreSQL Meetup

After a while I finally started a PostgreSQL Meetup in Brighton.
I reckon there are several people interested in PostgreSQL in the UK and I've met many of them in the Sussex area.

I'm quite new to the Meetup platform so I still need to learn how to use it correctly. However, anybody is absolutely welcome to help to organise a PostgreSQL community in the area.

The link is here.
I've already scheduled a PostgreSQL beer, an informal chit chat about our database of choice and the future of the community.

Save the date,
Friday, August 14, 2015 6:30 PM.


The location is the Brunswick in Hove. Big beer garden, no music outside and nice ales.


Wednesday 22 July 2015

Nuts and Bolts - part 3


The INSERT statement is composed by two elements, the attribute list and the values list. The lists are positional. The formatting should mark clearly the various elements of the query in order to make it simpler to spot the correspondence between the attribute and the value.

  • After the words INSERT INTO there is one space followed by the table's name
  • After the table's name there is one carriage return
  • The attributes if present are enclosed between round brackets indented one tab from the word INSERT
  • The attribute's list indents one tab from the round brackets
  • The attributes indent one tab fro the round brackets and are separated by a carriage return
  • The word VALUES indents at the same level of INSERT and is followed by a carriage return
  • The value's list is surrounded by round brackets indented one tab from the word VALUE
  • The values indent one tab fro the round brackets and are separated by a carriage return


The WHERE condition on the update is the same of the SELECT. Take a look to 1.1.1 for more informations. Just after the table's name there is the keyword SET and the list of the updates. The formatting should reflect this particular structure.
  • After the table's name there is one carriage return
  • The word SET indents one tab from the word UPDATE
  • The update's list indents one tab from the word set
  • Each update is separated by a carriage return 


The delete is the simplest of the DML as lists just the target relation and the eventual WHERE condition. The formatting is quite simple as the where condition is the only part which requires to be structured as seen in the select.

Data Definition Language

The data definition language groups all the commands defining, altering or dropping the database's relations. Each command type follows a different syntax structure.


The create statement is composed by the verb CREATE followed by the object type (e.g. TABLE) and the new relation's name followed by a carriage return. The rest of the statement indents one tab from the verb CREATE.
The rules for the CREATE TABLE are the following.
  • The fields are indented one tab stop from the surrounding brackets
  • Between the field's name and the type there is one or more tab stops
  • The number of tabs is determined by the field with the longest name
  • Between the field's type and the optional DEFAULT there is one space
  • Between the the optional DEFAULT and the optional field's constraint there is one space
  • If the DEFAULT is missing, between the field's type and the field's constraint there is one space
  • Table constraints will require the constraint name followed by one carriage return
  • The constraint type indents one tab from the word CONSTRAINT
  • Any clause indents one tab from the constraint type

The rules for the CREATE VIEW are the following.
  • After the view name there is one carriage return
  • The word AS indents one tab from the verb CREATE
  • The round brackets surrounding the view definition indents at the same level of the word AS
  • The view's SELECT follows the same rules defined in the select.


The alter statement's only rule is very simple, all the alter actions indent by one tab fromt he ALTER verb and are separated by a carriage return.


The drop statement's formatting is very simple. Everything is on a single line.

Saturday 4 July 2015

Nuts and Bolts - part 2

WITH statements

Because the WITH statement works like having a temporary table, the formatting will follow rules similar to the CREATE TABLE statement.

• Between the WITH and the alias there is a tab followed by the word AS and a carriage return
• The round brackets surrounding the inner query indent one tab from the alias
• The inner query indents one tab from the round brackets


This coding style is at first sight complex and with some drawback as it can slow down the coding process because of more keystrokes required for the uppercase and the carriage return. However, when the writing automatism is acquired the coding speed is very fast.

This coding style creates the query using a top down structure where the different part of the query are indented in order to reflecting their level inside the query’s logic. This increases the query readability in particular when dealing with complex queries.

The vertical block structure gives great control on the query’s sub sections. With query editors supporting the partial code execution (e.g. pgadmin3’s query ) is possible to highlight the query’s single sections and execute them immediately.

I developed this style from the induction I received in my first DBA job. My employer enforced along the developers and DBA a similar set of rules which worked perfectly, improving the produc- tivity and the code’s consistency across the company.

Incidentally this matches almost perfectly the way PostgreSQL processes a select. What follow is an excerpt of the PostgreSQL’ manual describing how the select is processed.

SELECT retrieves rows from zero or more tables. 
The general processing of SELECT is as follows:

1. All queries in the WITH list are computed.
These effectively serve as temporary tables that can be referenced in the FROM list.
A WITH query that is referenced more than once in FROM is computed only once.

2. All elements in the FROM list are computed.
(Each element in the FROM list is a real or virtual table.)
If more than one element is specified in the FROM list, they are cross-joined together.

3. If the WHERE clause is specified, all rows that do not satisfy the condition
are eliminated from the output.

4. If the GROUP BY clause is specified, or if there are aggregate function calls,
the output is combined into groups of rows that match on one or more values,
and the results of aggregate functions are computed.
If the HAVING clause is present, it eliminates groups that do not satisfy the
given condition.

5. The actual output rows are computed using the SELECT output expressions
for each selected row or row group.

6. SELECT DISTINCT eliminates duplicate rows from the result.
SELECT DISTINCT ON eliminates rows that match on all the specified expressions.
SELECT ALL (the default) will return all candidate rows, including duplicates.

7. Using the operators UNION, INTERSECT, and EXCEPT, the output of more than one
SELECT statement can be combined to form a single result set.
The UNION operator returns all rows that are in one or both of the result sets.
The INTERSECT operator returns all rows that are strictly in both result sets.
The EXCEPT operator returns the rows that are in the first result set but not in the second.
In all three cases, duplicate rows are eliminated unless ALL is specified.
The noise word DISTINCT can be added to explicitly specify eliminating duplicate rows.
Notice that DISTINCT is the default behavior here, even though ALL is the default for SELECT itself.

8. If the ORDER BY clause is specified, the returned rows are sorted in the specified order.
If ORDER BY is not given, the rows are returned in whatever order the system finds fastest to produce.

9. If the LIMIT (or FETCH FIRST) or OFFSET clause is specified, the SELECT statement only returns
a subset of the result rows.

the SELECT statement locks the selected rows against concurrent updates.


Thursday 2 July 2015

Nuts and bolts - part 1

Before starting with the book's topic I want to explain how to set up an efficient environment and some good practice which can improve the code's readability and quality. As somebody will notice these methods are completely opposite to the general code style trends. I'll try to give the motivation for each rule. Anyway, in general because the SQL universe is a strange place this requires strange approach. In order to write and read effectively the SQL the coder should gain the capability to get a mental map between the query's sections and the underlying logic. This can be facilitated using a clear and well defined formatting.

Code formatting

The SQL language is based on statements. Each statement is terminated by a specific character, by default ``;'' . In order to understand the logic of the SQL statements is very important to get the statements formatted in the same way the database executes them. We'll look first how to format the SELECT, then the DML and finally the DDL. Each of those queries needs a different structure because of the different logic. The common formatting rules are the following.
  • There is a tab separator of 7 spaces
  • All the keywords are in upper case
  • After a round bracket there is a carriage return and one tab indented
  • The opening and closing round brackets are indented at the same level 


When processing a query the parser works backward. It starts from the innermost and lowest complete statements and moves upward in order to get the entire picture. There is one remarkable exception. The WITH statements are computed first because required in the rest of the query. Actually the WITH acts like a temporary table. In order to see immediately the way PostgreSQL processes the query the indention should follow the same logic.
Basic formatting rules.
  • After the word SELECT there is a carriage return
  • The columns are separated by a comma and a carriage return
  • The columns indent one tab from the respective SELECT
  • The word FROM indents at the same level of SELECT
  • The relation indent one tab from the word FROM
  • The word WHERE indents at the same level of SELECT and FROM
  • A single condition indents one tab from the WHERE
  • Multiple conditions indent two tabs from the where and the logical operators indent one tab from the where
  • Between the logical operators and the condition there is one tab
  • Group by and order by follow the same rules of the select list
  • The queries joined by UNION, INTERSECT, and EXCEPT indent one tab
  • The terminator character indents at the same level of the SELECT,FROM and WHERE
Let's format a simple select from one relation and with one or more condition in the where section.

The first thing to do is to change all the keywords in uppercase.

We'll then format the select list.

Finally  the from and the where.

Having multiple conditions requires an extra tab in order to keep aligned the logical operators and the conditions. 

  • If using the implicit joins the relations are separated by a comma and a carriage return, and indented at the same level
  • When using explicit joins specify the kind of join [INNER,LEFT OUTER,CROSS etc.]
  • After the left hand relation there is a carriage return
  • The join clause indents at the same level of the left hand relation
  • After the words [INNER,LEFT OUTER,CROSS] JOIN follows the relation's name and a carriage return
  • The word ON indents one tab from the JOIN and there is a carriage return
  • The join conditions follow the same rules of the WHERE condition
  • Nested joins indent one tab from the upper join
  • The usage of the aliases, with the optional word AS is strongly encouraged 
Query with implicit join 

Query with explicit inner join

  • All the subqueries will indent one tab from the surrounding brackets
  • If the subquery is short in can stay in line with the round brackets (e.g. SELECT now())
Query with subselect in the SELECT's list

Query with subselect in the FROM's list

Tuesday 16 June 2015

PostgreSQL 8.0, the 2q memory manager

With the revolutionary PostgreSQL 8.0 were introduced a different memory manager, the two queues.

This algorithm uses three list of buffer page pointers called cache directory blocks (CDB). The lists T1,T2 are actually pointing buffered pages. B1 lists pages been recently in the shared buffer.

Figure 3.1: PostgreSQL 8.0, CDB lists
The list T1 is used as LRU for the pages loaded from disk. The list T2 is used as LRU list for pages already cached and evicted from the list T1. The list B1 is a LRU list of pages recently evicted from the shared buffer.
When a new page is read from disk is put initially at the beginning of T1. All the CDB in T1 are shifted and the last element in T1 is evicted from the shared buffer, its CDB is put at the B1's top.

Figure 3.2: 2q, read from disk
When a backend requests a page present in T1 its CDB is removed from T1 and placed at the T2's top. The last element in T2 is evicted from the shared buffer and its CDB is put at the B1's top.

Figure 3.3: 2q, page in T1
When a backend requests a page in T2 the CDB is moved to the list's top.

Figure 3.4: 2q, page in T2
Finally, when a page is read from the disk but its CDB is present in B1, then the page is put at the T2's top.

Figure 3.5: 2q, page in B1
The list T1 is the LRU list where the pages requested only one time are evicted quickly from the shared buffer. The list T2 is a MRU list where the pages are moved back on the top when required and where the pages from T1 are stored if required before the eviction. The list B1 is used as extra cache to track pages recently evicted from the shared buffer that need to stay in T2 if requested again.

Despite the interesting approach, this tricky process did not show good performances in production. Things changed when the clock sweep was implemented in PostgreSQL 8.1.

the ARC

The 2q algorithm was an emergency implementation caused by a software patent held on the algorithm initially chosen for the PostgreSQL 8.0's memory manager. Old books, printed meanwhile the 8.0 was in development, wrongly list the ARC as memory manager. The ARC is an efficient algorithm which keeps in memory two lists of page buffers, one for the LRU and for the MRU. Each list adapts dynamically with the different requests of the cluster's activity.

Thursday 21 May 2015

PostgreSQL Talk

I'm not dead yeat. Just hard days.
However, I'll talk at the Skiff in the sunny Brighton about PostgreSQL, the big the fast and the (NOSQL on) ACID.

 If you are around we should kick in the talk at 19.00 BST.
This talk is part of the SANE meetings happening on a monthly basis.

More informations about the Skiff there. Here the slides

Thursday 12 February 2015

PostgreSQL 7.4, the LRU list

3.1.1 PostgreSQL 7.4, the LRU list

In PostgreSQL 7.4 the free space in the shared buffer was managed using a simple last recently used list. When a page is first request from disk the buffer manager puts it into the first buffer in the free LRU list. If the first buffer is used by another page the list is shifted by one unity and the eventual last page in the list is discarded. If a page in the list is requested again is put back into the first buffer and so on. This simple approach worked quite well with some unpleasant effects. After all the buffer is shared and is not uncommon to have more than one database on the same cluster, each database different design and purposes. This worked quite bad with the LRU list causing unnecessary disk IO because the buffers were evicted without considering their usage frequency. This lead us to the tale of the buffer. The tale of the buffer

I've read this interesting description years ago on the Bruce Momjian website and explains very well how a minimal change can dramatically degrade the database performance.

Our tale will start with a PostgreSQL 7.4 with a 1000 buffers allocated into the shared memory. The page size is the default, 8192 bytes. We have in total

8192 * 1000 = 8192000 bytes. On the data area we have a full packed table without indices which is consuming 1000 pages on the disk. If we run a simple select from the table PostgreSQL will execute the table's sequential scan in order to get the buffers in memory. The sequential is a top to bottom operation. But the shared buffer stores the buffers in reverse order. After the initial read, that will take some time because the disk IO, the table fits completely into the shared buffer with the pages in reverse order. A second select will find start another sequential scan. The buffer manager will search for the table's pages into the shared buffer and will find the first table's page into the last slot of the LRU list. This will cause the page move from the last slot to the first. The other buffers will shift by one slot all togheter and the second table's page will move into the last LRU slot. The buffer manager will find the second table's page in memory and will move the page into the first LRU slot causing the buffers shift again. The final result is a faster read because the table is totally read from the memory.

Let's add a new row to our table. Because is full packed this will cause a new page to be added to the table which page's count becomes 1001. A select with the shared buffer empty will take the same time of the previous condition with one notable exception. When the page 1001 is read from disk the first table's page is evicted from the shared buffer to make room. The second select will search for the first table's page and it will read from the disk, evicting the second page from the shared buffer. Which is read from disk a moment later causing the third page eviction and so on. One single extra page causes the cache spoil. This is a corner case of course but tells us an important lesson on the database nature. They are discreet systems and most of their behaviour is triggered. Usually there is no soft transition between good and bad performance.

Thursday 5 February 2015

PostgreSQL 9.4.1, 9.3.6, 9.2.10, 9.1.15 & 9.0.19 Released

The PostgreSQL Global Development Group has released an important update with fixes for multiple security issues to all supported versions of the PostgreSQL database system, which includes minor versions 9.4.1, 9.3.6, 9.2.10, 9.1.15, and 9.0.19. This update includes both security fixes and fixes for issues discovered since the last release. In particular for the 9.4 update, there is a change to the way unicode strings are escaped for the JSON and JSONB data types.
All users should update their PostgreSQL installation at the next opportunity.

For the full news click here

Wednesday 4 February 2015

The memory

The PostgreSQL memory at first sight looks simple. If compared with the complex structures implemented in the other DBMS to a careless reader could seem rudimentary. However, the memory and in particular the shared buffers implementation is complex and sophisticated. This chapter will dig down deep into the PostgreSQL's memory.

3.1 The shared buffer

The shared buffer is a segment allocated at cluster's startup. Its size is determined by the GUC parameter shared_buffers and the size can be changed only restarting the cluster. The shared buffer is used to manage the data pages as seen in 2.6, which are called buffers when loaded in rhw memory. Having a RAM segment is not uncommon in the database universe. Is a rapid exchange area where the buffers are kept across the sessions and keeps the data near the CPU. Like his competitors PostgreSQL strives to keep in memory what is important things and not everything. In the era of the in memory databases this could seems an obsolete concept. Unfortunately the truth is that the resources, and so the money, are not infinite. Before digging into the technical details we'll have a look to the history of the memory manager. The way PostgreSQL sticks an elephant into a small car without Timelord technology.

Back in the old days PostgreSQL was quite rudimentary. The version 7.4 did not had the tablespaces, it was without a mechanism to prevent the XID wraparound failure (except the routine vacuuming of course) and the memory  manager was a simple LRU buffer. Since then the algorithm evolved to become a simple but sophisticated system to cache effectively the buffers. We'll look briefly to the memory manager's history.

Thursday 29 January 2015

The cluster in action - part 2

2.4 The background writer

Before the spread checkpoints the only solution to ease down the IO spike caused by the checkpoint was to tweak the background writer. This process were introduced with the revolutionary PostgreSQL 8.0. The writer, as the name suggests, works in the background searching for dirty buffers to write on the data files. The writer works in rounds. When the process awakes scans the shared buffer for dirty buffers. When the amount of buffers cleaned reaches the value set in bgwriter_lru_maxpages the process sleeps for the time set in bgwriter_delay.

2.5 The autovacuum

The routine vacuuming is an important task to prevent the table bloat and the dreaded XID wraparound failure. If enabled the autovacuum launcher starts one daemon for each relation with enough dead tuples to trigger the conditions set in autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. An autovacuum daemon is a normal backend and appears in the view pg_stat_activity. Because the XID wraparound failure is a really serious problem, the autovacuum to prevent wraparound starts even if the autovacuum is turned off.

2.6 The backends

The PostgreSQL backend architecture is the brilliant solution to a nasty problem. How to guarantee the buffers are read only by one session at time and avoid the bottleneck of a long waiting queue. When a backend needs to access a particular tuple, either for read or write, the relation's pages are accessed to find the tuple matching the search criteria. When a buffer is accessed then the backend sets a pin on the buffer which prevents the other backends requiring the same page to wait. As soon as the tuple is found and processed the pin is removed. If the tuple is modified the MVCC enforces the tuple's visibility to the other backends. The process is fine grained and very efficient. Even with an high concurrency rate on the same buffers is very difficult to have the backends entangled.
A backend process is a fork of the main postgres process. It's very important to understand that the backend is not the connection but a server process which interacts with the connection. Usually the backend terminates when the connection disconnects. However, if a client disconnects ungracefully meanwhile a query is running without signalling the backend, the query will continue only to find there's nothing listening on the other side. This is bad for many reasons. First because is consuming a connection slot for nothing. Also the cluster is doing something useless consuming CPU cycles and memory.

Like everything in PostgreSQL the backend architecture is oriented to protect the data and in particular the volatile shared buffer. If for some reasons one of the backend process crashes then the postgres process terminates all the backends in order to prevent the potential shared buffer corruption. The clients should be able to manage this exception resetting the connection.

2.7 Wrap up

The cluster's background activity remains most of the time unnoticed. The users and developers can mostly ignore this aspect of the PostgreSQL architecture leaving the difficult business of understanding the database heartbeat to the DBA, which should have the final word on any potential mistake in the design specs. The next chapters will explore the PostgreSQL's architecture in details, starting with the memory.

Monday 26 January 2015

The cluster in action - part 1

The cluster in action

PostgreSQL delivers his services ensuring the ACID rules are enforced at any time. This chapter will give an outlook of a ``day in the life'' of a PostgreSQL's cluster. The chapter approach is purposely generic. At this stage is very important to understand the global picture rather the technical details.

After the startup

When the cluster completes the startup procedure it starts accepting the connections. When a connection is successful then the postgres main process forks into a new backend process which is assigned to the connection for the connection's lifetime. The fork is quite expensive and does not work very well for a high rate of connection's requests. The maximum number of connections is set at startup and cannot be changed dynamically. Whether the connection is used or not for each connection slot are consumed 400 bytes of shared memory.
Alongside the client's request the cluster have several subprocesses working in the background.

The write ahead log

The data pages are stored into the shared buffer either for read and write. A mechanism called pinning ensures that only one backend at time is accessing the requested page. If the backend modifies the page then this becomes dirty. A dirty page is not yet written on its data file. However the page's change is first saved on the write ahead log as WAL record and the commit status for the transactions is then in the directory clog or the directory pg_serial, depending on the transaction isolation level. The wal records are stored into a shared buffer's area sized by the parameter wal_buffers before the flush on disk into the pg_xlog directory on fixed length segments. When a WAL segment is full then a a new one is created or recycled. When this happens there is a xlog switch. The writes on the WAL are managed by a background process called WAL writer. This process were first introduced with PostgreSQL 8.3.

The checkpoint

The cluster, on a regular basis, executes an important activity called checkpoint. The frequency of this action is governed by the time and space, measured respectively in seconds and log switches between two checkpoints. The checkpoint scans the shared buffer and writes down to the data files all the dirty pages. When the checkpoint is complete the process determines the checkpoint location and writes this information on the control file stored into the cluster's pg_global tablespace. In the case of unclean shutdown this value is used to determine the WAL segment from where to start the crash recovery.
Before the version 8.3 the checkpoint represented a potential bottleneck because the unavoidable IO spike generated during the writes. That's the reason why the version 8.3 introduced the concept of spread checkpoints. The cluster aims to a particular completion target time measured in percent of the checkpoint timeout. The default values are respectively 0.5 and 5 minutes. This way the checkpoint will spread over a target time of 2.5 minutes. From PostgreSQL 9.2 a new checkpointer process has been created to manage efficiently the checkpoint.

Friday 23 January 2015


The acronym RTFM stands for Read The Fucking Manual. It's quite obvious that reading the manual is the perfect approach for finding the correct information or at least a hint for finding the solution.

The false confidence or a lack of humbleness make people to forget how important is to read the documents. Understanding the documents is the next step, and is not simple indeed. In particular if the background is poor or the reader have preconceptions which alter the word meanings.

On the same message board mentioned in 1.4 an user asked some details about the binary data. Another "expert" explained how PostgreSQL had two ways of storing binary data. The bytea and the OID, which maximum size limit was respectively  one and two GB. I replied with a simple question. How is possible to store two GB in a four byte integer without strange voodoo rituals?

After an unpleasant argument finally the "expert" admitted he did not understood the mechanism used by PostgreSQL to workaround the TOASTed datum's size limit. If you are curious about that, well, RTFM!

Tuesday 20 January 2015

Failure is not an option

The failure is not an option. Despite this statement is quite pretentious is also the rule number zero of any decent DBA. The task failure, should this be a simple alter table or an emergency restore, is not acceptable. The database is the core of any application and therefore is the most important element of the infrastructure.

In order to achieve this impossible level, any task should be considered single shot. Everything must run perfectly at the first run like if without rollback plan. However the rollback procedure must be prepared and tested alongside with the main task on the test environment in order to get a smooth transition if the rollback should happen. It's also very important to remember the checklist. This allow the DBA to catch the exact point of any possible failure ensuring a rapid fix when this happens.

Having a plan B gives a multiple approach for the task if something goes wrong. For example, when dealing with normal size databases* is possible to implement many strategies for the disaster recovery having a similar time needed for the recovery. When the amount of data becomes important this is no longer true. For example a logical restore takes more time than a point in time recovery or a standby failover. In this case the plan A is the physical restore.If this does not work then the logical recovery should be used instead.

 * I consider normal sized any database under the 500 GB

Saturday 10 January 2015

DBA in a nutshell

I've already started writing the second volume of the database administration series.
The first volume was quite basic with enough information to be useful but not too much to scare people.

The second volume is strongly focused on the database administration and I decided to start it with a chapter explaining what it means to be a DBA.

Probably this will dissuade many people to start this exciting career.

So, here we go again.

A database administrator is a strange combination of theory and practice. A mix of strictness and loose rules. It's quite difficult to explain what exactly a DBA does. Working with databases requires passion, knowledge and a strange combination of empathy and pragmatism in order to try to understand what the DBMS is thinking. With the commercial products the knowledge is limited by the vendor's documentations and is improved mostly by the personal experience. With a free DBMS like PostgreSQL the source code's availability facilitates the knowledge acquisition. Also, reading the superb poetry which is the source code, which is C language, creates an intimate relation with the cold binary programs and the geniuses who create them.

A day in the life of a DBA does not have fixed boundaries. It can last one day or several months, for example if there are long procedures to run inside the well controlled maintenance windows. Every day is a different combination of the duties including the routine tasks, the monitoring and the proactive thinking. The latter is probably the distinctive mark between a good and DBA and a cheap professional. The capability of building a mental map for finding any possible issue, projected in the near future, can make the difference between spending the night sleeping or working frantically before the next day begins. Of course when the emergency strikes there is the fourth and most important duty. The emergency handling.

Routine tasks

Under the group of routine tasks fall the procedures which are well consolidated on the documents and in the DBA mind. For example, configuring the PostgreSQL's memory parameters should be something immediate. Any task in this category is successful if remains completely unnoticed. Is not unlikely for the live systems that the routine tasks are performed in antisocial hours like Sunday night or early morning in the working days.


A system without monitoring is an one way ticket to the disaster. Whether solution is used it should be something simple to configure and with a decently low rate of false positives. Having a nagging monitor is exactly the same like not having at all. The important alerts will pass completely unnoticed. Alongside with the general purpose solutions like nagios there is a new interesting PostgreSQL's dedicated called Configuring a passive error detector like tail_n_mail is a very good idea to trap any possible server's misbehaviour.

Proactive thinking

Reacting to the issues is fine. Trying to prevent them is much better. This kind of tasks is strictly related with the monitoring. For example let's consider a master and a slave configured in continuous recovery. Because the slave is made copying the data files we should expect both having the same size. A (not so) small size difference between the master and the standby server can tell us the filesystem have some problems or is configured not correctly on one box. In this example, assuming the filesystem is xfs a bloating box is caused by the dynamic EOF allocation introduced in the newer releases.

Emergency handling

Shit happens, deal with it. Sooner or later a disaster will strike requiring all the experience and competence of the database experts for putting back the system in production state. It doesn't matter if the issue is caused by an user screwing up the database or a power outage or a hardware failure or a fleet of Vogon spaceships ready to demolish the earth. The rule number zero when in emergency is never guess. Guessing is a one way ticket to the disaster, in particular if what is the action can destroy the database.
As example let's consider one of the most common causes of outage in the linux world, the distribution's upgrade. Usually when a linux distribution is upgraded, is very likely PostgreSQL will upgrade the binaries to a different major version. Now the data area is not compatible across the major releases. The immediate effect when a system is upgraded is PostgreSQL will refuse to start because the version's mismatch. The fix is quite simple. A dump and reload of the databases in the data area, or alternatively the upgrade in place with pg_upgrade, will get the cluster in production state. Few years ago a desperate user posted a message asking for help on a message board. One ``expert'' guessed pg_resetxlog would be the solution. Before reading further spend two minutes to read the pg_resetxlog documentation. In the emergency, the ``expert'' guess would transform a simple version's mismatch in a corrupted data area.

Thursday 1 January 2015

Happy 2015

Finally I finished the first volume of the PostgreSQL book administration.
The book is free to download on slideshare or, if  you want the hard copy there is the button to order on

I'll build an ebook version in the next days and I'll make it available on amazon's kindle and the kobo.

I hope this book will spread the knowledge on PostgreSQL.

Anyway, I've not finished. I've already started a second volume and a manual for the developers willing to learn the noble art of the SQL writing.

Download the book :
Buy the paperback: 

Happy new PostgreSQL year everybody.