Does your PHP/MySQL app speak Unicode?

This article is outdated. MySQL supports more encodings and hopefully improper encoding is a thing of the past.

I find it very annoying when I come across web apps that still can’t handle non-ASCII characters correctly. In this article, I’ll show you how, with just a few lines of code, you can “globalize” your online app. If you look at the illustration on the left, if you don’t plan for users entering words with “weird” characters, your output will look simply awful.

The “magic” word is “UTF-8“. Just in case you heard of it but don’t quite know what it is, UTF-8 is a basically an efficient way to represent any character of the 100,000 defined by Unicode (including the Greek alphabet, Chinese symbols and typographical marks). Basic latin characters are represented just like in ASCII, one byte per letter, whereas Japanese kanji may need 3 bytes. Most modern software platforms understand Unicode and UTF-8, so all you need to do is to know how to use it.

MySQL

To create a new table with UTF-8 encoding, just write something like

CREATE TABLE names ( first_name VARCHAR(50),
                     last_name VARCHAR(50)
                   )
CHARACTER SET utf8 COLLATE utf8_general_ci;

In the example above, we are creating a table called ‘names‘ contraining two columns that can hold Unicode data encoded as UTF8 (CHARACTER SET utf8) with a case insensitive Unicode-based sort order (COLLATE utf8_general_ci). To learn more about sorting (collation) in MySQL, check this page.

Tip: It’s a good idea to use VARCHAR instead of CHAR for column names, otherwise MySQL will reserve 3 bytes for each character, i.e. CHAR(50) will reserve 150 bytes. As a sidenote, technically, UTF-8 may require up to 4 bytes per character, but MySQL currently thinks that three is good enough; apparently 4 bytes/character will be available in MySQL 6.

To alter an existing table for UTF-8, the easiest way is

ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

PHP

Now that tha data is stored in UTF-8, you still need a way to specify that the input/output is UTF-8. If you’re in control of you server, you can edit your my.cnf to add

[mysqld]
default-character-set=utf8
default-collation=utf8_general_ci

but most likely you won’t have access – or you can’t rely in the users of the app to have access, so what you do is add a a line in PHP after you connect to the database:

$db_connect = mysql_connect(DB_SERVER, DB_USER, DB_PASS) or die("Could not connect");
mysql_set_charset('utf8', $db_connect);

HTML

Final step is to let browsers know that you are serving UTF-8 content, so your HTML must contain the charset, like this:

<html>
  <head>
    <meta http-equiv="Content-type" value="text/html; charset=utf-8">
    ...
Picture of Armand Niculescu

Armand Niculescu

Senior Full-stack developer and graphic designer with over 25 years of experience, Armand took on many challenges, from coding to project management and marketing.

23 Responses

  1. ‘with just a few lines of code, you can “globalize” your online app’ can you now?…what about locales, formatting dates/numbers/currency, handling different calendars, different writing systems, etc.? wow, way to trivialize a whole industry.

    1. Paul, I know that true globalization is not an easy matter, and PHP does not make it any easier (compared to .NET for example).
      Still, the number of apps that do not take in account anything other that English is staggering so I had to start SOMEWHERE.

      If there’s enough interest, I will write about language files, time/date formatting and more.

  2. Very nice article Armand, it would be very nice to see more articles on language files, time, date…etc

    Thanks in advance.

  3. I didn’t notice before that non-ASCII characters incompatibility can be solved easily with just additional .UTF-8 table. Thanks for nice and neat solution

  4. I didn’t notice before that non-ASCII characters incompatibility can be solved easily with just additional .UTF-8 table. Thanks for nice and neat solution

Comments are closed.