I find it very annoying when I come across web apps that still can’t handle non-ASCII characters correctly. In this article, I’ll show you how, with just a few lines of code, you can “globalize” your online app. If you look at the illustration on the left, if you don’t plan for users entering words with “weird” characters, your output will look simply awful.
The “magic” word is “UTF-8“. Just in case you heard of it but don’t quite know what it is, UTF-8 is a basically an efficient way to represent any character of the 100,000 defined by Unicode (including the Greek alphabet, Chinese symbols and typographical marks). Basic latin characters are represented just like in ASCII, one byte per letter, whereas Japanese kanji may need 3 bytes. Most modern software platforms understand Unicode and UTF-8, so all you need to do is to know how to use it.
To create a new table with UTF-8 encoding, just write something like
CREATE TABLE names ( first_name VARCHAR(50), last_name VARCHAR(50) ) CHARACTER SET utf8 COLLATE utf8_general_ci;
In the example above, we are creating a table called ‘
names‘ contraining two columns that can hold Unicode data encoded as UTF8 (
CHARACTER SET utf8) with a case insensitive Unicode-based sort order (
COLLATE utf8_general_ci). To learn more about sorting (collation) in MySQL, check this page.
Tip: It’s a good idea to use
VARCHAR instead of
CHAR for column names, otherwise MySQL will reserve 3 bytes for each character, i.e.
CHAR(50) will reserve 150 bytes. As a sidenote, technically, UTF-8 may require up to 4 bytes per character, but MySQL currently thinks that three is good enough; apparently 4 bytes/character will be available in MySQL 6.
To alter an existing table for UTF-8, the easiest way is
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Now that tha data is stored in UTF-8, you still need a way to specify that the input/output is UTF-8. If you’re in control of you server, you can edit your
my.cnf to add
[mysqld] default-character-set=utf8 default-collation=utf8_general_ci
but most likely you won’t have access – or you can’t rely in the users of the app to have access, so what you do is add a a line in PHP after you connect to the database:
$db_connect = mysql_connect(DB_SERVER, DB_USER, DB_PASS) or die("Could not connect"); mysql_set_charset('utf8', $db_connect);
Final step is to let browsers know that you are serving UTF-8 content, so your HTML must contain the charset, like this:
<html> <head> <meta http-equiv="Content-type" value="text/html; charset=utf-8"> ...