Sunday 13 July 2014

Chapter 8 - Part 1, pg_dump at glance

Pretty busy week (and week end), I was able to write just those boring lines mostly derived from the pg_dump's help. I'll do better next time, I promise.

Backup

The hardware is subject to faults. In particular if the storage is lost the entire data infrastructure becomes inaccessible, sometime for good. Also human errors, like not filtered delete or table drop can happen. Having a solid backup strategy is then the best protection, ensuring the data is still recoverable. In this chapter we'll take a look only to the logical backup with pg_dump.


pg_dump at glance

As seen in 3.1.5, pg_dump is the PostgreSQL's utility for saving consistent backups. Its usage is quite simple and if launched without options it tries to connect to the local cluster with the current user sending the dump to the standard output.
The pg_dump help gives useful informations about the usage.
postgres@tardis:~/dump$ pg_dump --help
pg_dump dumps a database as a text file or to other formats.

Usage:
  pg_dump [OPTION]... [DBNAME]

General options:
  -f, --file=FILENAME          output file or directory name
  -F, --format=c|d|t|p         output file format (custom, directory, tar,
                               plain text (default))
  -j, --jobs=NUM               use this many parallel jobs to dump
  -v, --verbose                verbose mode
  -V, --version                output version information, then exit
  -Z, --compress=0-9           compression level for compressed formats
  --lock-wait-timeout=TIMEOUT  fail after waiting TIMEOUT for a table lock
  -?, --help                   show this help, then exit

Options controlling the output content:
  -a, --data-only              dump only the data, not the schema
  -b, --blobs                  include large objects in dump
  -c, --clean                  clean (drop) database objects before recreating
  -C, --create                 include commands to create database in dump
  -E, --encoding=ENCODING      dump the data in encoding ENCODING
  -n, --schema=SCHEMA          dump the named schema(s) only
  -N, --exclude-schema=SCHEMA  do NOT dump the named schema(s)
  -o, --oids                   include OIDs in dump
  -O, --no-owner               skip restoration of object ownership in
                               plain-text format
  -s, --schema-only            dump only the schema, no data
  -S, --superuser=NAME         superuser user name to use in plain-text format
  -t, --table=TABLE            dump the named table(s) only
  -T, --exclude-table=TABLE    do NOT dump the named table(s)
  -x, --no-privileges          do not dump privileges (grant/revoke)
  --binary-upgrade             for use by upgrade utilities only
  --column-inserts             dump data as INSERT commands with column names
  --disable-dollar-quoting     disable dollar quoting, use SQL standard quoting
  --disable-triggers           disable triggers during data-only restore
  --exclude-table-data=TABLE   do NOT dump data for the named table(s)
  --inserts                    dump data as INSERT commands, rather than COPY
  --no-security-labels         do not dump security label assignments
  --no-synchronized-snapshots  do not use synchronized snapshots in parallel jobs
  --no-tablespaces             do not dump tablespace assignments
  --no-unlogged-table-data     do not dump unlogged table data
  --quote-all-identifiers      quote all identifiers, even if not key words
  --section=SECTION            dump named section (pre-data, data, or post-data)
  --serializable-deferrable    wait until the dump can run without anomalies
  --use-set-session-authorization
                               use SET SESSION AUTHORIZATION commands instead of
                               ALTER OWNER commands to set ownership

Connection options:
  -d, --dbname=DBNAME      database to dump
  -h, --host=HOSTNAME      database server host or socket directory
  -p, --port=PORT          database server port number
  -U, --username=NAME      connect as specified database user
  -w, --no-password        never prompt for password
  -W, --password           force password prompt (should happen automatically)
  --role=ROLENAME          do SET ROLE before dump

Connection options

The connection options specify the way the program connects to the cluster. All the options are straightforward except for the password. Is possible to avoid the password prompt or to disable it but the password cannot be specified on the command line. In an automated dump script this can be worked around exporting the variable PGPASSWORD or using the password file.
The PGPASSWORD variable is considered not secure and shouldn't be used if untrusted users are accessing the server. The password file is a text file named .pgpass and stored in the home directory of the os user which connects to the cluster.
Each file's line specify a connection using the following format.
hostname:port:database:username:password
If, for example, we want to connect to the database db_test with the username usr_test on the host tardis with port 5432 and the password is testpwd8.1, the password file will contain this row
tardis:5432:db_test:usr_test:testpwd
For security reasons the file will not work if group or others accessible. In order to make it work you should issue the command chmod go-rw .pgpass . The password file is used also by other PostgreSQL programs like the client psql.


General options

The general options are a set of switches used to control the backup's output and format. The -f followed by the FILENAME outputs the backup on file.

The -F specifies the backup format and requires a second option to tell pg_dump which format to use. The option can be one of those, c d t p which corresponds to custom directory tar plain.

If the parameter is omitted then pg_dump uses the p format. This outputs a SQL script which recreates the objects when loaded into an empty database. The format is not compressed and is suitable for direct load using the client psql.
The the custom together with the directory format is most versatile format. It offers compression and flexibility at restore time. The file offers the parallel restore functionality and the selective restore of single objects.

The directory format stores the schema dump, the dump's table of contents alongside with the compressed data dump in the directory specified with the -f switch. Each table is saved in a different file and is compressed by default.From the version 9.3 this format offers the parallel dump functionality.

The tar format stores the dump in the conservative tape archive format. This format is compatible with directory format, does not supports compression and have the 8 GB limit on the size of individual tables.

The -j option specifies the number of jobs to run in parallel for dumping the data. This feature appeared in the version 9.3 and uses the transaction's snapshot export to give a consistent data snapshot to the export jobs. The switch is usable only with the directory format and only against PostgreSQL 9.2 and later.

The option -Z specifies the compression level for the compressed formats. The default is 5 resulting in a dumped archive from 5 to 8 times smaller than the original database.
The option -lock-wait-timeout is the number of milliseconds for the table's lock acquisition. When expired the dump will fail. Is useful to avoid the program to wait forever for a table lock but can result in failed backups if set too much low.

Output options

The output options control the way the program outputs the backup. Some of those options are meaningful only under specified conditions, other are quite obvious.
The -a option sets the data only export. Separating schema and data have some effects at restore time, in particular with the performance. We'll see in the detail in 9 how to build an efficient two phase restore.

The -b option exports the large objects. This is the default setting except if the -n switch is used. In this case the -b is required to export the large objects.

The options -c and -C are meaningful only for the plain output format. They respectively add the DROP and CREATE command before the object's DDL. For the archive formats the same option exists for pg_restore.

The -E specifies the character encoding for the archive. If not set the origin database encoding will be used.

The -n switch is used to dump the named schema only. It's possible to specify multiple -n switches to select many schemas or using the wildcards. However despite the efforts of pg_dump to get all the dependencies resolved, something could be missing. There's no guarantee the resulting archive can be successfully restored.

The -N switch does the opposite of the -n switch. Skips the named schema. Accepts wildcards and it's possible to specify multiple schemas with multiple -N switches. When both -n and -N are given, the behavior is to dump just the schemas that match at least one -n switch but no -N switches. If -N appears without -n, then schemas matching -N are excluded from what is otherwise a normal dump.

The -o option dumps the object id as part of the table for every table. This options should be used only if the OIDs are part of the design. Otherwise this option shouldn't be used.

The -O have effects only on plain text exports and skips the object ownership set.

The -s option dumps only the database schema.

The -S option is meaningful only for plain text exports. This is the super user which disables the triggers if the export is performed with the option -disable-triggers. Anyway, as suggested on the manual, it's better to run the restoring script as superuser.

The -t switch is used to dump the named table only. It's possible to specify multiple tables using the wildcards or specifying the -t more.

The -T does the opposite, skips the named table in the dump.

The switch -x skips dumping the privileges settings usually dumped as grant/revoke commands.

The option -binary-upgrade is used only for the in place upgrade program pg_upgrade. Is not for general usage.
The option -column-inserts result in the data exported as INSERT commands with all the column names specified. Because by default the data is saved as COPY statement, using this switch will results in a bigger dump file and a very slow restoration. It's sole advantage is any error in the restore will skip just the affected row and not the entire table's load.

The -disable-dollar-quoting disables the newer dollar quoting for the function's body and uses the standard SQL quoting.

The -disable-triggers emits the triggers disable and re enable for the data only export. Disabling the triggers ensure the foreign keys won't fail for the missing referenced data. This switch is meaningful only for the plain text export.

The -exclude-table-data=TABLE skips the data for the named table but dumps the table's definition. It follow the same rules of the -t and -T for specifying multiple tables.

The -insert dumps the data as INSERT command. Same as the -column-inserts the restore is very slow and should be used only for reloading data into non-PostgreSQL databases.

The -no-security-labels doesn't include the security labels into the dump file.

The -no-synchronized-snapshots allows the parallel export against pre 9.2 databases. Because the snapshot export feature is missing this means the database shall not change its status meanwhile the export is running. Otherwise there will be a not consistent data export. If in doubt do not use this option.

The -no-tablespaces skips the tablespace assignments.

The -no-unlogged-table-data does not export data for the unlogged relations.

The -quote-all-identifiers cause all the identifiers to be enclosed in double quotes. By default only the identifiers with keyword's name are quoted.

The -section option specifies one of the three export's sections. The pre-data, with the table, the view and the function definitions. The data section where the actual table data is dumped as COPY or INSERTS, accordingly with the command line options. The post-data section where the constraint, the index and the eventual GRANT REVOKE commands are finally executed. This switch is meaningful only for text exports.

The -serializable-deferrable uses a serializable transaction for the dump, to ensure the database state is consistent. The dump execution waits for a point in the transaction stream without anomalies to avoid the risk of the dump failing or causing other transactions to abort for the serialization_failure. The option is not beneficial for a dump intended only for disaster recovery. It's useful for the dump used for reloading the data into a read only database which needs to have a consistent state compatible with the origin's database.

The switch -use-set-session-authorization causes the usage of SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set the objects ownership. The resulting dump is more standards compatible but the SET SESSION AUTHORIZATION command requires the super user privileges whereas ALTER OWNER doesn't.

1 comment:

  1. Having a backup is only of value if the data is usable, so daily checks are needed to verify this. The virtual data rooms makes use of an email notification system to inform administrators of the status of the backups.

    ReplyDelete