Monday 25 August 2014

The missing chapter 6 part 1 and two, data integrity

I've started the sixth chapter, the one on the data integrity I've forgotten. There are the first two parts alongside with the introduction. I've also updated the book on slideshare with the new cover and the last incomplete chapter for the developers. The beautiful cover is made by Chiaretta & Bon. Kudos and many thanks.

I've also uploaded the latex sources on github for anybody to fork and review my crappy english. My former colleague and friend Craig Barnes already started reviewing the tex files, many thanks for the priceless help.

Here's the github repository url :

Data integrity

There's just one thing worse than losing the database. Having the data set full of rubbish. The data integrity has been part of PostgreSQL since the beginning. It offers various levels of strength ensuring the data is clean and consistent. In this chapter we'll have a brief look to the various constraints available. The PostgreSQL's constraints can be grouped in two kind. The table constraints and the column constraints. The table constraints are defined on the table's definition after the field's list. The column constraints appear in the field's definition after the data type. Usually for the primary keys and the unique keys the definition is written as table constraint.
The constraint applies the enforcement to any table's row without exclusion. When creating a table constraint on a fully populated table the data is validated first. Any validation error aborts the constraint creation. However, the foreign keys and check constraints accept the clause NOT VALID. With this clause the database assumes the data is valid and skips the validation. The cration is almost immediate. The new constraint is then enforced only for the new data. When using this option the data must be consistent.

Primary keys

A primary key is the unique row identifier. Having this constraint enforced ensures the row can be addressed directly using the key value. A primary key can be enforced on a single or multi column. The data aspect must be unique with the strictest level. That means the NULL values are not permitted in columns participating to the primary key. When creating a primary key this implicitly adds a new unique index on the affected fields. In order to avoid the exclusive lock on the affected table the unique index can be built before the primary using the CONCURRENTLY clause and then used in the primary key definition as shown in 8.3. Using the primary key is the fastest way to access the table's contents.
There is the primary key definition as table and column constraint.

With the table's constraint definition is possible to specify the constraint name and to have a multi column constraint. When writing a multi column constraint the participating columns should be listed separate by commas.

The most common primary key implementation, and probably the best, is to have a serial column as primary key. A serial field is short for integer NOT NULL which default value is associated to the nextval for an auto generated sequence. Because the sequence have its upper limit to the bigint upper limit, this ensures the data does not wraps in the table's lifetime. In the case the primary key is expected to reach the value of 2,147,483,647 the type of choice should be bigserial rather serial. This will create the primary key's field as bigint which upper limit is 9,223,372,036,854,775,807.

However it's still possible to alter the field later in order to match the new requirements. Because changing the data type requires a complete table's rewrite, any view referencing the affected column will abort the change.

Here's the t_data's type change output with the client message level set to debug3.

Dealing with a big amount of data presents also the problem to have enough space for fitting twice of the original table plus the downtime caused by the exclusive lock on the affected relation. A far better approach is to add a new bigint NULLable column without default value. Setting up a trigger for the inserts will keep in sync the new values with the original primary key. Then an update procedure will set the value for the rows. This should run in small batches to avoid to overfill the pg_xlog directory with long running transactions. When everything is in place the new column could then become NOT NULL and a unique index will finally enforce the uniqueness for the new field.

The primary key can then be dropped and recreated using the new unique index. This is permitted only if there's no foreign key referencing the field. In this case a multi drop and create statement is required. The final cleanup should include the trigger's drop and the old primary key removal. Any view using the old primary key should be rebuilt before the drop.

Unique keys

The unique keys are similar to the primary keys. They enforce the uniqueness using an implicit index but they allow the presence of NULL values. Their usage is for enforcing uniqueness on columns not used as primary key. Similar to the primary key the unique constraints are based on a unique index. In fact there's little difference between the unique index and the unique key except the presence of the latter in the system table pg_constraint.

No comments:

Post a Comment