Showing posts with label danger. Show all posts
Showing posts with label danger. Show all posts

Saturday, 10 January 2015

DBA in a nutshell

I've already started writing the second volume of the database administration series.
The first volume was quite basic with enough information to be useful but not too much to scare people.

The second volume is strongly focused on the database administration and I decided to start it with a chapter explaining what it means to be a DBA.

Probably this will dissuade many people to start this exciting career.

So, here we go again.



A database administrator is a strange combination of theory and practice. A mix of strictness and loose rules. It's quite difficult to explain what exactly a DBA does. Working with databases requires passion, knowledge and a strange combination of empathy and pragmatism in order to try to understand what the DBMS is thinking. With the commercial products the knowledge is limited by the vendor's documentations and is improved mostly by the personal experience. With a free DBMS like PostgreSQL the source code's availability facilitates the knowledge acquisition. Also, reading the superb poetry which is the source code, which is C language, creates an intimate relation with the cold binary programs and the geniuses who create them.

A day in the life of a DBA does not have fixed boundaries. It can last one day or several months, for example if there are long procedures to run inside the well controlled maintenance windows. Every day is a different combination of the duties including the routine tasks, the monitoring and the proactive thinking. The latter is probably the distinctive mark between a good and DBA and a cheap professional. The capability of building a mental map for finding any possible issue, projected in the near future, can make the difference between spending the night sleeping or working frantically before the next day begins. Of course when the emergency strikes there is the fourth and most important duty. The emergency handling.

Routine tasks

Under the group of routine tasks fall the procedures which are well consolidated on the documents and in the DBA mind. For example, configuring the PostgreSQL's memory parameters should be something immediate. Any task in this category is successful if remains completely unnoticed. Is not unlikely for the live systems that the routine tasks are performed in antisocial hours like Sunday night or early morning in the working days.

Monitoring

A system without monitoring is an one way ticket to the disaster. Whether solution is used it should be something simple to configure and with a decently low rate of false positives. Having a nagging monitor is exactly the same like not having at all. The important alerts will pass completely unnoticed. Alongside with the general purpose solutions like nagios there is a new interesting PostgreSQL's dedicated called http://opm.io/OPM. Configuring a passive error detector like tail_n_mail is a very good idea to trap any possible server's misbehaviour.

Proactive thinking

Reacting to the issues is fine. Trying to prevent them is much better. This kind of tasks is strictly related with the monitoring. For example let's consider a master and a slave configured in continuous recovery. Because the slave is made copying the data files we should expect both having the same size. A (not so) small size difference between the master and the standby server can tell us the filesystem have some problems or is configured not correctly on one box. In this example, assuming the filesystem is xfs a bloating box is caused by the dynamic EOF allocation introduced in the newer releases.

Emergency handling

Shit happens, deal with it. Sooner or later a disaster will strike requiring all the experience and competence of the database experts for putting back the system in production state. It doesn't matter if the issue is caused by an user screwing up the database or a power outage or a hardware failure or a fleet of Vogon spaceships ready to demolish the earth. The rule number zero when in emergency is never guess. Guessing is a one way ticket to the disaster, in particular if what is the action can destroy the database.
As example let's consider one of the most common causes of outage in the linux world, the distribution's upgrade. Usually when a linux distribution is upgraded, is very likely PostgreSQL will upgrade the binaries to a different major version. Now the data area is not compatible across the major releases. The immediate effect when a system is upgraded is PostgreSQL will refuse to start because the version's mismatch. The fix is quite simple. A dump and reload of the databases in the data area, or alternatively the upgrade in place with pg_upgrade, will get the cluster in production state. Few years ago a desperate user posted a message asking for help on a message board. One ``expert'' guessed pg_resetxlog would be the solution. Before reading further spend two minutes to read the pg_resetxlog http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html documentation. In the emergency, the ``expert'' guess would transform a simple version's mismatch in a corrupted data area.



Saturday, 3 May 2014

This week update

For this week I'll not update the blog with the new chapter.
I'm scattered around time and space and I've little time to concentrate on writing.

Anyway I'll take the occasion to stress a little on the source of informations about PostgreSQL.
As someone will surely notice on my blog there's no indication about the postgresql releases nor informations about the critical releases.

The reason is because it's better to refer to the official website.

There're few websites which display the PostgreSQL releases in a foreign language.
Yes, bunch of idiots, I'm talking about you.

If you follow their outdated informations you will miss critical bugs which can harm your standbys or result in a catastrophic data loss exposing your data directory to dogs and pigs.

So this week's lesson is, follow www.postgresql.org for any release and don't trust anything except the official releases.


Tuesday, 25 December 2012

The Christmas Shitload Post


Happy Christmas! Believe it or not we are almost done. We’re almost out
of winter!

This post is my present for a brilliant new PostgreSQL year.
You know, Christmas usually means goodness, happiness, joy and
kindness.
And if you’re waiting for this, well, you will be hardly disappointed.
I decided to close the year writing my feelings, about VACUUM FULL, one
of the  PostgreSQL’s key functionality and an interesting thing
I discovered on the utility script shipped with the debian package for controlling
the clusters, pg ctlcluster.

And believe me if I tell you I feel like the Angry Birds, when the bad piggies
stole their eggs.


VACUUM two face

VACUUM is one of the most important functions shipped with PostgreSQL. It’s
main goal is to free space cleaning the dead rows generated during the normal
database activity.

As the conventional VACUUM simply marks as reusable the dead rows,
and freeze the tuples avoiding the s***t to hit the fan, preventing the XID
wraparound failure, this does not shrink the data file.
If you have a large table with many dead rows after the VACUUM you will
see the table stop growing.
If you need to reclaim space on disk the VACUUM FULL is what do you
need as it shrinks down the datafile locking in accessExclusive mode the affected
relation and preventing the read and the write.

During a VACUUM FULL on a massively bloated table I noticed a strange
behaviour.

The free space on the table’s tablespace lessened and then suddendly stopped
meanwhile the free space on the pg default began to lower until the filesystem
runt out of space with the subsequent failure of the task.

I started investigating the problem increasing the client verbosity with the
SET client min messages=’debug’;

I’ll explain what I’ve found with a simple test using a
copy of the pg attribute table.

To watch what’s happening I’ve switched on the debug message level for the
client and I’ve turned on the trace sort parameter during a single column index
creation on the relname field.
This is the entire procedure output.


db_test =# CREATE TABLE t_vacuum AS
SELECT
*
FROM
pg_attribute ;
SELECT 2155
-- repeat this many times in order to
-- fill the table with a good amount of data
db_test =# INSERT INTO t_vacuum
db_test -# SELECT * FROM t_vacuum
db_test -# ;
INSERT 0 2155
SET trace_sort = ’ on ’;
SET client_min_messages = ’ debug ’;
db_test =# CREATE INDEX
idx_attname ON t_vacuum USING btree ( attname );
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.04 s /0.48 u sec elapsed 0.53 sec
LOG : performsort starting : CPU 0.04 s /0.72 u sec elapsed 0.77 sec
LOG : finished writing run 1 to tape 0: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : finished writing final run 2 to tape 1: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.09 s /2.61 u sec elapsed 2.72
LOG : external sort ended , 836 disk blocks used : CPU 0.14 s /2.84 u sec elapsed 3.16 se
CREATE INDEX


As my work mem is small the index sort switch immediately on an external
sort in order to build the index, then generate the btree from the external sort
file.
Now let’s run a VACUUM FULL the same table.


db_test =# VACUUM FULL t_vacuum ;
DEBUG : building index " pg_toast_16411_index " on table " pg_toast_16411 "
LOG : begin index sort : unique = t , workMem = 16384 , randomAccess = f
LOG : begin index sort : unique = f , workMem = 1024 , randomAccess = f
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort starting : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort done : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.08 sec
DEBUG : vacuuming " public . t_vacuum "
DEBUG : " t_vacuum " : found 0 removable , 275840 nonremovable row versions in 4756 page
DETAIL : 0 dead row versions cannot be removed yet .
CPU 0.17 s /0.99 u sec elapsed 1.87 sec .
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.07 s /0.51 u sec elapsed 0.58 seLOG : finished writing final run 2 to tape 1: CPU 0.14 s /2.58 u sec elapsed 2.73 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.14 s /2.64 u sec elapsed 2.80
LOG : external sort ended , 836 disk blocks used : CPU 0.21 s /2.85 u sec elapsed 3.27 se
DEBUG : drop auto - cascades to type pg_temp_16402
DEBUG : drop auto - cascades to type pg_temp_16402 []
DEBUG : drop auto - cascades to toast table pg_toast . pg_toast_16411
DEBUG : drop auto - cascades to index pg_toast . pg_toast_16411_index
DEBUG : drop auto - cascades to type pg_toast . pg_toast_16411
VACUUM


The interesting part begins at row 9. The database actually performs the
VACUUM, shrinking down the data file and cleaning the dead rows then at row
13 we found the same index creation output.
This explains well why the VACUUM FULL failed.

As VACUUM FULL acts exactly as CLUSTER,  during the process the
table is read sequentially and the data output sent to a new node file and this
happens on the same tablespace where the original table belongs.

The index build is performed on the same index tablespace except for the
sort on disk which is governed by the temp tablespaces parameter. And if this
is not set then the sort is performed on the pg default.

In conclusion, if you want to run a VACUUM FULL you will need space for the new
file node on the same table’s tablespace. The indexes do require the space on the
same tablespace plus the sort on disk, if you don't have enough memory.

So, if you have space shortage on the pg_default, the command SET temp_tablespaces to a
more spacious tablespace will prevent the pg default to run out of space.


Aggressive shutdown

Working with the hot standby on debian I had a bad experience when I’d set up
a new slave server. After the data copy with the archive command started I ac-
tually started the slave forgetting to set the hot standby parameter to on. After
a while the slave reached the consistent state and I issued the pg ctlcluster com-
mand in order to stop the cluster itself with the –force to have a fast shutdown
instead of a smart one.
The script doesn’t have an option to switch between PostgreSQL’s shutdown
modes but, if without options performs a smart shutdown mode.
Adding the –force option the script first try a fast shutdown mode, if the
database doesn’t shutdown in a reasonable time the script switch to the imme-
diate shutdown, if this is not enough to stop the instance finally a kill -9 on
the postgres process is performed.

WARNING INCOMING AIRSTRIKE!!!!!

If you are too much delicate I suggest you to stop reading, now begins the real
shitload.

I understand the hard efforts done by the global team to create a first class
product, I left Oracle for PostgreSQL and I don’t regret my choice because
PostgreSQL is a damn good database, probably too much good.

But can anyone explain me why, in the name of one of any of the thousands
fake gods invented by the human race, no one put the VACUUM FULL be-
haviour in the on line manual leaving this HUGE lack of documentation on
PostgreSQL’s KEY FEATURE?

Regarding the debian's script, I understand the –force option of pg ctlcluster is not
 to be run in normal condition, but does the author understand what
does it means to perform in normal conditions a smart shutdown?

A user that forget to logout can let the script to run forever.
On the other side if the –force is passed to the script and the database takes too much to
shutdown an immediate shutdown is performed without warnings.

And this will wipe out any bloody unlogged table from the cluster on
the next startup.

Then, as cherry on the top, if the immediate does not work a kill -9 is
performed on the postmaster leaving the shared memory as garbage.

And this works the same way if you use the debian’s init script.


HEAVY MACHINE GUN! (Metal slug’s quote)
So, my last advice, is do not use the wrapper pg ctlcluster for stopping the
clusters created with pg createcluster.
It’s better to shutdown the cluster using the command.

export PGDATA=<PATH TO DATA>
/usr/lib/postgresql/<MAJOR VERSION>/bin/pg ctl stop -m <THEWAYYOUWANT>



HAPPY CHRISTMAS!







Saturday, 25 August 2012

Danger Master! Danger!


to configure the PITR you should save the $PGDATA with tar or in alternative you can use pg_dumpall
PostgreSQL trainer describing the point in time recovery

An oid type can carry on 2Gb
PostgreSQL guru explaining the binary data type

Maybe you can use pg_resetxlog to solve your problem
A clueless user trying to solve a mayor version incompatibility data directory issue


Danger, man at work

This is a small excerpt of what I've seen browsing the tecnical forums and community websites.
My first reaction reading all this nonsense was angry as these guys have good reputation in the PostgreSQL community and their ignorance can cause data loss, damages and, definitely, harm PostgreSQL.
Obviously it's impossible to cover anything so I'm writing this post as warning do not trust people only by the reputation. You should investigate any suggested solution and absolutely DO NOT TRY ON THE PRODUCTION, NEVER!.

It's an airplane? It's an ufo? It's a dumb a**

The point in time recovery is a wonderful feature introduced in the version 8.1 and enable the capabilty to restore the entire database cluster at the exact date and time you want.
In this post I'm not describing how to configure it as the documentation is well written. I show how it works to make clear the danger in the trainer's quote.

The PITR is a physical backup where you actually copy the data files meanwhile the database is running. You can do it safely because the write ahead logs generated during the copy are archived and all the changes can be applied to the inconsistent data files later when you need to restore the cluster.
This is possible because the blocks stored in the archived WAL carry on the information where the block is located on disk. In the restore process the physical data block in the data file is overwritten by the WAL data block during the startup process when the database performs the instance recovery.
The pg_dumpall is a logical backup wrapper for the general purpose pg_dump. If you look the pg_dump's source code you will see a collection of SELECT command within the strict SERIALIZABLE transaction isolation level.

The result is an output file with the SQL commands to run against a different cluster to recreate the databases, the tables and the data.

The restore with this tool require an empty cluster generated with initdb up and running. pg_dumpall does not require instance recovery but a full running instance, that means all the archived WAL files are useless as they are usable only during the recovery.

So, if you follow the procedure suggested by the trainer and choose the pg_dumpall you will illude to have a PITR because you are using a logical tool to restore a physical backup and you will loose any database change between your pg_dumpall and the moment the live database server was lost.


OID is not a TARDIS

PostgreSQL have the bytea built in data type for binary objects. As is a TOASTable type with variable lenght the maximum allowed size is 1Gb. The binary data in this way is stored in line as any table column and you can read and write the content using the conventional SQL statements.
If the binary data doesn't fit in 1 Gb PostgreSQL allow another strategy to store up to 2 Gb as large object.
The built in functions lo_import and lo_export can read and write files from the disk and store in a system table called pg_largeobjects.
This table have three columns,

  • loid TYPE oid
  • pageno TYPE integer
  • data TYPE bytea

Whe the function lo_import is called PostgreSQL stores the binary data in the pg_largeobjects splitting the extra space in two pages identified by the same loid value and different pageno. The function return the OID value to reference the stored large object.
The function lo_export does the reverse, read the binary chunks for the given OID and return the binary stream on disk.
So, the second quote show how lack of attention and simple logical deduction have the PostgreSQL Guru.


I'll burn my house to light a candle

The next three seconds after reading this quote I remained speechless, the suggestion is probably the most dangerous thing, and far away the most brutish, you can do on your data directory.
I discovered this idiocy by browsing the forum and probably too late to avoid the disaster as the post was months old.
As any PostgreSQL DBA should know the WAL files are used to save the data block changes on non volatile memory before the write actually reach the data file as, if any crash happens, the data block can be applied to the datafile reading from the WAL.
The process is the same used for PITR, the only difference is no archived wal is used and only the files in the pg_xlog are used to reply the blocks on the data files.
The global directory contains a 8k file called pg_control where the last checkpoint location is stored and tells the database which WAL to start the replay during the instance recovery.
Any corruption in the pg_control or in the WAL files results in an instance unable to start. If any, and I repeat any solution is useless is still possible to start the instance using the program pg_resetxlog.
This program does one simple thing. Delete anything in the pg_xlog and generate a new pg_control restarting the XID count from the beginning.
The instance can start again but any boundary between the data blocks and the WAL is lost and any attempt to run read write queries can result in data corruption.
The on line manual is absolutely clear on this.
After running pg_resetxlog the database must start without user access , the entire content must be dumped using pg_dumpall, the data directory must be dropped and recreated from scratch using initdb and then the dump file restored using psql or pg_restore
After the scaring part let me explain the problem, a common one indeed, and the solution.
If you use PostgreSQL as the packaged version shipped with the most common Linux distributions can happen, if you do the distribution upgrade, PostgreSQL version jump a major release, for example from 8.4 to 9.1.
When this happens the new binaries refuse to start on the data directory initialized with the previous major version.
The solution is quite simple. You can compile a matching PostgreSQL version, start the instance and dump the contents to restore in the new major version data directory or, if you have an enterprise amount and you are in hurry to return on line, you can use pg_upgrade, a powerful tool to do the in place upgrade.
I hope these three examples have scared you enough to test and check the documents before doing anything.
Cheers.