Of Databases and Character Encodings
Gimlet is old. Its roots date back to mid-2006, when it started life as a Ruby on Rails 1.1.6 program. Along the way, we’ve built up a bit of cruft here and there; last night, we cleaned a bunch of it out when we fixed the character encodings in our database. This paves the way for us to upgrade to newer and better tools.
Feeling like a nerd? Read on.
Hey what's a character encoding?
Since the days of Braille, Morse code, and telegraphs, people have been using binary codes to represent letters and numbers. The idea is simple: As long as everyone agrees on the letters and numbers and their codes (say, ‘01000001’ means ‘A’) writing a program to turn those 1s and 0s into human-readable text is easy.
For many years, the “standard” character encoding was called ASCII (or the closely-related Latin-1) – which covered all the characters you see on a American keyboard plus a few extras. However! There’s no way to represent ב in Latin-1. Can’t do it. It’s not a Latin character. Because people really want to store documents in different languages, it was time for a new standard. After some fits and starts, most of the world settled on a standard called UTF-8. Instead of just being able to represent all the letters, numbers, and symbols used in American English, UTF-8 has enough capacity to represent all the letters, numbers, and symbols in every language written by humans.
As you might imagine, that’s the encoding we’re using in Gimlet.
OK so what did we do last night?
As we mentioned, Gimlet is old. Old enough that UTF-8 hadn’t completely taken over the world when we started working on it. But, being forward-thinking, we created the database with UTF-8-encoded tables and served our web pages with a UTF-8 encoding. Everything looked great – until we upgraded the software that connected to our database.
Suddenly, where we expected to see Gimlét, we saw Gimlét – and a variety of similar oddities. Upon closer investigation, it turned out that the old database connector had been talking to the database in Latin-1. We were taking UTF-8 data from the Web, encoding it as Latin-1, and storing it in UTF-8 tables. When we got it back out, the database connector kindly re-converted the data so it looked nice to everyone. The new database connector wasn’t having any of this funny business, so all non-Latin-1 characters looked broken to everyone. We asked our new database connector to deal in Latin-1, and the problem went away, though we were still storing “broken” data in our database. An uneasy stalemate was reached – until we went to upgrade Ruby.
Ruby 1.9 brooked none of this silliness. It steadfastly refused to cooperate in our decode-recode scheme, dutifully keeping the data in UTF-8 format all the way from the database to the screen, with predictably broken-looking results. People on forums suggested hacks, but they were ugly hacks. It was time to fix the data.
Fortunately, it was easy to see the problem in a standalone database management program:
select name from accounts where id = 1;
Gimlét
We needed to convert the text to UTF-8. Of course, the database already thought the text was in UTF-8, so this didn’t work:
select name, convert(name using utf8) from accounts where id = 1;
Gimlét Gimlét
The trick was to do what our old database connector had been doing: take the data, interpret it as Latin-1, then re-interpret it as UTF-8. Because MySQL is “smart” about character encodings, we also had to tell it the characters were just a string of bytes after its Latin-1 encoding. Finally, we tell it to re-code the data as real UTF-8:
select name, convert(cast(convert(name using latin1) as binary) using utf8) from accounts where id = 1;
Gimlét Gimlét
And from there, we wrote a little script to fix all the text columns in the database, and all was well.