Saturday 18 August 2012

The angry dba

Before anything on database administration let me introduce some best practice I acquired during my experience in the consulting company I mentioned before.
This company offer Oracle services to important enterprises in Italy and, despite the size, is well structured and organized.
The first thing I've learned was how to write good documents, not too much verbose but complete in any part.
Then I was inducted to the company's best practice in SQL coding as Oracle have some limitations in the size and characters allowed in the identifiers.
PostgreSQL does not have this limitations giving freedom of choice for any format and structure, with unicode characters (if the database supports UTF8 of course) and a 64 characters maximum length for the identifiers.
This is a good thing but, believe me, sometimes it can be a bloody nightmare.
I've written some guidelines to get things sorted out.


Rule 1 – No camel case or reserved keywords in the identifier name


Let start from the beginning.
Should you need to create a table to store debit notes for the year 2002, which name will you choose?
My old school development habits suggest me something like debit_notes_2002.
A new school developer or maybe used to work with AGILE, SCRUM,ITIL and S**** will probably use the name DebitNotes2002.
This in PostgreSQL, is a problem because all the names are converted in lower case making the camel case useless.
It's possible to keep the case preserved enclosing the identifiers between double quotes “.
That mean that ANY query involving your beautiful identifiers will contain the quotation and this affects the readability of the query.
This happens for any identifier named after reserved keyword as well.
Same story for the column names .


Rule 2 - self explaining schema  


In the rule 1 I declared I would name the table for debit notes as debit_notes_2002.
I lied.
I like to put prefix in any identifier to keep track which relation I'm are dealing with.
PostgreSQL stores any relation in one system table named pg_class and give you system views like pg_tables and pg_indexes to query.
But why I should bother in system catalogue queries or people inquiry if I can let the database describe itself?
The way to do this is quite simple.
I put a prefix on any identifier to tell what kind it is or which data carries.
Adding a t_ prefix in the table name will tell to any developer and, most important to any dba/performance tuner, that the relation carries physical data.
Adding a v_ prefix to any view will tell that we are dealing with a short cut to a query and, hopefully, this will avoid the bad habit to join views resulting in the query planner going nuts.
On the same way giving a prefix to the columns to tell the data type will reduce to the developers the risk to get a wrong behaviour for wrong typecast, and will give to the dba precious informations about the space and storage used by the table.
A good structure for building self explaining names can be this.



Object
Prefix
Table
t_
View
v_
Index
idx_
Unique index
uidx_
Primary key
pk_
Foreign key
fk_
Check
chk_
Unique key
uk_
Type
ty_
Sql function
fn_sql_
Plpgsql function
fn_plpg_
Plpython function
fn_plpy_
Plperl function
fn_plpr_
Trigger
trg_
rule
rul_
Table 1: Object prefix

The same for the data type used for the column.
An example for the most common data type is shown in table 2



Type
Prefix
Character
c_
Character varying
v_
integer
i_
text
t_
bytea
by_
numeric
n_




Rule number 4 – format the SQL queries

Sometime I receive requests to debug or optimize queries written in this way
select * from debitnoteshead a join debitnoteslines b on debnotid where a.datnot=b. datnot and b.deblin>1;
Frustrating isn't it?
Despite my legendary bad temper when I'm in service I'm absolutely professional and probably too much kind because I don't bounce the request and I fix the query.
I know I know, I'm doing wrong.
The company doesn't pay my salary to hit tab, enter and ctrl+u, but to keep the company's systems up, running and sound.
So, if you don't want to waste your time formatting when you receive something like this, I suggest to simply bounce back the request asking for a pretty format.
Some simple rules can enhance the readability.
  1. SQL keywords in upper case
  2. all the identifiers must be indented on the same level
  3. All the same level SQL keywords must be indented at same level
  4. Avoid the SELECT * as is a waste of memory and doesn't tell anything useful
  5. If possible inner join the tables explicitally in the WHERE condition
  6. In the join use the keyword ON to make clear which column is used for joining
  7. In the table list use self explaining aliasing
Using those rules the previous query looks like this.
SELECT
                productcode,
                noteid,
                datnot
FROM
               debitnoteshead hrd,
               debitnoteslines lns
WHERE
                         hdr.debnotid=lns.debnotid
               AND hdr.datnot=lns.datnot
               AND lns.deblin>1
;
Now look beauty.



With this last advice ends my first blog post.
Just small afterword before ending.
Probably someone will think this guidelines are too strict to be accepted.
Obviously for the code already written nothing can be done, but can still teach the developers to write decent code.
I think this is a goal can be achieved and believe me, it worth it.
For the new projects this guidelines give you a powerful vision for the database structure without wasting your time through data dictionaries or asking people about the data.
IMHO our job is a war where the casualties are downtime and data loss.
I consider the data the most precious thing a company own and definitely our mission is to guarantee the smallest reaction time to fix any problem minimizing any possible loss of money.




2 comments:

  1. Are there any technical benefits to using the implicit join notation over explicit INNER JOIN e.g.

    SELECT
    productcode,
    noteid,
    datnot
    FROM
    debitnoteshead AS hrd INNER JOIN
    debitnoteslines AS lns ON
    hdr.debnotid = lns.debnotid
    WHERE
    hdr.datnot = lns.datnot AND
    lns.deblin > 1;
    -- missing indentation due to comments formatting constraints.

    Thank you for your sharing the fruits of your experience.

    ReplyDelete
  2. Using the implicit join give the query planner the freedom of choice.
    Using the explicit join you can influence it as explained here

    http://www.postgresql.org/docs/9.1/static/explicit-joins.html

    I like the implicit notations because I feel the inner join more like a filter constraint, it's bad habit from my oracle experience where, before the 9i, no inner join or left outer join was allowed and the the (+) operator dominate the queries ;)

    ReplyDelete