Convert a PostgreSQL database from LATIN1 to UTF-8

When we initially launched the Hub in private beta, we made the mistake of not specifying UTF-8 encoding in the database cluster, which had the unfortunate side effect of raising an exception every time a user would submit non-ascii characters in an input field.

Side note, we also received bug reports relating to the 2010-2 Hardy release, which was fixed in the 11.1 release by setting the encoding of all MySQL and PostgreSQL based appliances to use UTF-8 by default.

Anyway, a commonly asked question, without any real definitive and explanatory answers is "how do you convert a database or cluster from encoding X (e.g., LATIN1) to UTF-8?".

Well, there is no simple command to do the conversion, instead you need to export the database contents while encoding it to UTF-8, drop the database, recreate the database while specifying UTF-8 encoding, and finally import the data.

If you only have one database, or a select few, I'd recommend converting the whole cluster, which will make all databases use the specified encoding by default.

Keep in mind that this is potentially disastrous, so I would highly recommend doing lots of testing before hand in a virtual machine. I cannot stress this enough, test in a VM before production.

Your plan of action should go something like this:

  • Set up a replica base system in a VM
  • Export the production DB in its native encoding
  • Import the exported data into the VM and verify
  • Export the VM DB using UTF-8 encoding
  • Drop the DB, recreate it using UTF-8, import and verify
  • Rinse, repeat...

While performing testing take detailed notes, and get comfortable with the procedure so you don't have a heart attack while doing it in production.

The following should give you a good idea of the steps you need to perform on your production system:

1. Perform a full system backup (goes without saying)!

2. Backup all PostgreSQL configuration files for reference

mkdir -p /tmp/pgsql-conf-backup
cp /etc/postgresql/$VERSION/main/* /tmp/pgsql-conf-backup/

3. Put site in maintenance mode (if relevant)

4. Stop all services that access the database

5. Export database in both current encoding (e.g., LATIN1) as well UTF-8

It's good to have a pristine dump of the database in case something goes wrong so you can roll back to a working system.

cd /
su postgres
pg_dump --blobs --oids $DB_NAME > /tmp/$DB_NAME.latin1.sql
pg_dump --blobs --oids --encoding=UTF-8 $DB_NAME > /tmp/$DB_NAME.utf8.sql
exit

For reference, see the pg_dump documentation or manpage.

6. Drop the current cluster, create a new one in UTF-8

/etc/init.d/postgresql-$VERSION stop
pg_dropcluster --stop $VERSION main
pg_createcluster --start -e UTF-8 $VERSION main
/etc/init.d/postgresql-$VERSION stop

7. Tweak your PostgreSQL configuration

This should be scripted and well tested. You can also refer to the backups made in step 2.

For example, we have a customized pg_hba.conf for authentication and access, password encryption and binding to localhost.

8. Create users and database

This should also be scripted and well tested. Here we create the applications database user, create the database, and set the postgres users password.

9. Import the UTF-8 encoded data into the database and verify encoding

/etc/init.d/postgresql-$VERSION start

su postgres
psql --set ON_ERROR_STOP=on $DB_NAME < /tmp/$DB_NAME.utf8.sql

psql
\l
\q

exit

10. Start services we stopped earlier in step 4

11. Verify your application is working as expected

This is when you can breath again...

12. Take out of maintenance mode and perform cleanup (/tmp)

Add new comment