This is almost the entire chapter 11. I'm still writing the final section, I'd like to put into a separate post though. I've also almost finished the restore's performance. After this the book is complete. I will start a review to make it a decent writing before publishing onto lulu.com and amazon kindle.

I'm not sure amazon permits to sell books for free I'll find a solution anyway.

A couple of things to know before start coding...

This chapter is completely different from the rest of the book. It's dedicated to the developers. PostgreSQL is a fantastic infrastructure for building powerful and scalable applications. In order to use al its potential there are some things to consider. In particular if coming from other DBMS there are subtle caveats that can make the difference between a magnificent success or a miserable failure.

SQL is your friend

Recently the rise of the NOSQL approach,has shown more than ever how SQL is a fundamental requirement for managing the data efficiently. All the shortcuts, like the ORMs or the SQL engines implemented over the NOSQL systems, sooner or later will show their limits. Despite the bad reputation around it, the SQL language is very simple to understand. Purposely built with few simple English words is a powerful tool set for accessing, managing or defining the data model. Unfortunately this simplicity have a cost. The language must be parsed and converted into the database structure and sometimes there is a misunderstanding between what the developer wants and what the database understands.
Mastering the SQL is a slow and difficult process and requires some sort of empathy with the DBMS. Asking for advice to the database administrator is a good idea to get introduced in the the database mind. Having a few words with the DBA is a good idea in any case though.

Design comes first

One of the worst mistakes when building an application is to forget about the foundation, the database. With the rise of the ORM this is happening more frequently than it could be expected. Sometimes the database itself is considered as storage, one of the biggest mistake possible.
Forgetting about the database design is like building a skyscraper from the top. If unfortunately the project is successful there are good chances to see it crumbling down instead of making money.
In particular one of the PostgreSQL's features, the MVCC, is completely ignored at design time. In 7.6 it's explained how is implemented and what are the risks when designing a data model. It doesn't matter if the database is simple or the project is small. Nobody knows how successful could be a new idea. Having a robust design will make the project to scale properly.

Clean coding

One of the the first things a developer learns is how to write formatted code. The purpose of having clean code is double. It simplifies the code's reading to the other developers and improves the code management when, for example, is changed months after the writing. In this good practice the SQL seems to be an exception. Is quite common to find long queries written all lowercase, on one single line with keywords used as identifier. Trying just to understand what such query does is a nightmare. What follow is a list of good practice for writing decent SQL and avoid a massive headache to your DBA. If you don't have a DBA see 11.4.

The identifier's name

Any DBMS have its way of managing the identifiers. PostgreSQL transforms the identifier's name in lowercase. This doesn't work very well with the camel case, however it's still possible to mix upper and lower case letters enclosing the identifier name between double quotes. This makes the code difficult to read and to maintain. Using the underscores in the old fashion way it makes things simpler.

Self explaining schema

When a database structure becomes complex is very difficult to say what is what and how it relates with the other objects. A design diagram or a data dictionary can help. But they can be outdated or maybe are not generally accessible. Adopting a simple prefix to add to the relation's name will give an immediate outlook of the object's kind.

Object	Prefix
Table	t_
View	v_
Btree Index	idx_bt_
GiST Index	idx_gst_
GIN Index	idx_gin_
Unique index	u_idx_
Primary key	pk_
Foreign key	fk_
Check	chk_
Unique key	uk_
Type	ty_
Sql function	fn_sql_
PlPgsql function	fn_plpg_
PlPython function	fn_plpy_
PlPerl function	fn_plpr_
Trigger	trg_
rule	rul_

A similar approach can be used for the column names, making the data type immediately recognisable.

Type	Prefix
Character	c_
Character varying	v_
Integer	i_
Text	t_
Bytea	by_
Numeric	n_
Timestamp	ts_
Date	d_
Double precision	dp_
Hstore	hs_
Custom data type	ty_

Query formatting

Having a properly formatted query helps to understand which objects are involved in the data retrieval and the relations between them. Even with a very simple query a careless writing can make it very difficult to understand.

A query like this have many issues .

using lowercase keywords it makes difficult to spot them
the wildcard * hides which columns are effectively needed; it retrieves all the columns consuming more bandwidth than required; it prevents the index only scans
the meaningless aliases like a and b it make difficult to understand which relations are involved.
writing a statement with no indenting it makes difficult to understand the query's logic.

Despite existence of tools capable to prettify such queries, their usage doesn't solve the root problem. It's better to write immediately decent SQL following those simple rules.

All SQL keywords should be in upper case
All the identifiers and kewrords should be grouped by line break and indented at the same level
In the SELECT list specify all and only the columns required by the query
Avoid the auto join in order to make clear the relation's logic
Adopt meaningful aliases

And that's the fixed SQL.

Foreign keys

A foreign key is a constraint enforced using the values another table's field. The classical example is the tables storing the addresses and cities. We can store the addresses with the city field, inline.

Being the city a duplicated value over many addresses, this will cause the table bloat by storing long strings, duplicated many and many times, alongside with the the address. Defining a table with the cities and referencing the city id in the addresses table will result in a smaller row size.

When dealing with referencing data, the main concern is the data consistency between the tables. In our example, putting an invalid identifier for the city in the t_addresses table will result missing data when joining. The same result will happen if for any reason the city identifier in the t_cities table is updated.

The foreign keys are designed to enforce the referential integrity. In our example will we'll enforce a strong relationship between the tables.

The key is enforced in two ways. When a row with an invalid i_id_city hits the table t_addresses the key is violated and the transaction aborts. Deleting a city from the t_cities table which id is referenced in the t_addresses, will violate the key. The same will updating a i_id_city referenced in the t_addresses.

The enforcement is performed via triggers. The pg_dump or pg_restore option -disable-trigger will permit the the data restore with the schema already in place. For more informations take a look to 9 and 10.

The FOREIGN KEYS have been enriched with an handful of options which make them very flexible. The referenced table can drive actions on the referencing using the two options ON DELETE and ON UPDATE. By default the behaviour is to take NO ACTION if there are referencing rows until the end of the transaction. This is useful if the key check should be deferred to the end of the transaction. The other two actions are the RESTRICT which does not allow the deferring and the CASCADE which cascade the action to the referred rows.
If we want our foreign key restrict the delete with no deferring and cascade any update, here's the DDL.

Another very useful clause available with the foreign and check constraints is the NOT VALID. When the constraint is created with NOT VALID, the initial check is skipped making the constraint creation instantaneous. This is acceptable if the actual data is consistent. The constraint is then enforced for all the new data. The invalid constraint can be validated later with the command VALIDATE CONSTRAINT.

Check constraints

A check constraint is a user defined check to enforce specific condtions on the rows. The definition can be a condition or a used defined function. In this case the function must return a boolean value. As for the foreign keys, the check accepts the NOT VALID clause to speed up the creation.

The check is satisfied if the condition returns true or NULL. This behaviour can produce unpredictable results if not fully understood. An example will help to clarify. Let's create a CHECK constraint on the v_address table for enforcing a the presence of a value. Even with the check in place the insert without the address completes successfully.

This is possible because the v_address does not have a default value and accepts the NULL values. The check constraint is violated if, for example we'll try to update the v_address with the empty string.

Our check constraint will work as expected if we set for the v_address field a fallback default value.

Please note the existing rows are not affected by the default value change.
The message for the update and the insert is exactly the same because PostgreSQL attempts to create a new row in both cases. When the constraint fails the transaction is rolled back leaving the dead row in place. We'll take a better look to the MVCC in 7.6.

Not null

For people approaching the database universe the NULL value can be quite confusing. A NULL value is an empty object without any type or meaning. Actually when a field is NULL it doesn't consumes physical space. By default when defining a field this is NULLable. Those fields are quite useful to omit some columns, for example, at insert time.
The NULLable fields can be enforced with an unique constraint as the NULL are not considered as duplicates. When dealing with the NULL it's important to remind the NULL acts like the mathematical zero. When evaluating an expression with a NULL, the entire expression becomes NULL.
The NOT NULL is a column constraint which forbids the presence of NULL in the affected field. Combining a NOT NULL with a unique constraint is exactly like having enforced a PRIMARY KEY. When altering a field the table is scanned for validation and the constraint's creation is aborted if any NULL value is present on the affected field.

For example, if we want to add the NOT NULL constraint to the field v_address in the t_addresses table the command is just.

In this case the alter fails because the column v_address contains NULL values from the example seen in 6.4. The fix can be performed in a single query using the coalesce function. This function returns the first not null value from the left. This way is possible to change the v_address on the fly to a fixed placeholder .

Adding new columns with NULL is quick. PostgreSQL simply adds the new attribute in the system catalogue and manages the new tuple structure considering the new field as empty space. Adding a field with NOT NULL requires the presence of the DEFAULT value as well. This is an operation to consider carefully when dealing with big amount of data. This way the table will be locked in exclusive mode and a complete relation's rewrite will happen. A far better way is to add a NOT NULL field is to add at first as NULLable field. A second alter will add the default value to have the new rows correctly set. An update will then fix the NULL values without locking the table. Finally the NOT NULL could be enforced without hassle.

PostgreSQL Dba

Pages

Saturday, 13 September 2014

Chapter 11 - A couple of things to know before start coding...