This article is outdated. MySQL supports more encodings and hopefully improper encoding is a thing of the past.
I find it very annoying when I come across web apps that still can’t handle non-ASCII characters correctly. In this article, I’ll show you how, with just a few lines of code, you can “globalize” your online app. If you look at the illustration on the left, if you don’t plan for users entering words with “weird” characters, your output will look simply awful.
The “magic” word is “UTF-8“. Just in case you heard of it but don’t quite know what it is, UTF-8 is a basically an efficient way to represent any character of the 100,000 defined by Unicode (including the Greek alphabet, Chinese symbols and typographical marks). Basic latin characters are represented just like in ASCII, one byte per letter, whereas Japanese kanji may need 3 bytes. Most modern software platforms understand Unicode and UTF-8, so all you need to do is to know how to use it.
MySQL
To create a new table with UTF-8 encoding, just write something like
CREATE TABLE names ( first_name VARCHAR(50), last_name VARCHAR(50) ) CHARACTER SET utf8 COLLATE utf8_general_ci;
In the example above, we are creating a table called ‘names
‘ contraining two columns that can hold Unicode data encoded as UTF8 (CHARACTER SET utf8
) with a case insensitive Unicode-based sort order (COLLATE utf8_general_ci
). To learn more about sorting (collation) in MySQL, check this page.
Tip: It’s a good idea to use VARCHAR
instead of CHAR
for column names, otherwise MySQL will reserve 3 bytes for each character, i.e. CHAR(50)
will reserve 150 bytes. As a sidenote, technically, UTF-8 may require up to 4 bytes per character, but MySQL currently thinks that three is good enough; apparently 4 bytes/character will be available in MySQL 6.
To alter an existing table for UTF-8, the easiest way is
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
PHP
Now that tha data is stored in UTF-8, you still need a way to specify that the input/output is UTF-8. If you’re in control of you server, you can edit your my.cnf
to add
[mysqld] default-character-set=utf8 default-collation=utf8_general_ci
but most likely you won’t have access – or you can’t rely in the users of the app to have access, so what you do is add a a line in PHP after you connect to the database:
$db_connect = mysql_connect(DB_SERVER, DB_USER, DB_PASS) or die("Could not connect"); mysql_set_charset('utf8', $db_connect);
HTML
Final step is to let browsers know that you are serving UTF-8 content, so your HTML must contain the charset, like this:
<html> <head> <meta http-equiv="Content-type" value="text/html; charset=utf-8"> ...
23 Responses
‘with just a few lines of code, you can “globalize” your online app’ can you now?…what about locales, formatting dates/numbers/currency, handling different calendars, different writing systems, etc.? wow, way to trivialize a whole industry.
Paul, I know that true globalization is not an easy matter, and PHP does not make it any easier (compared to .NET for example).
Still, the number of apps that do not take in account anything other that English is staggering so I had to start SOMEWHERE.
If there’s enough interest, I will write about language files, time/date formatting and more.
Great post! It’s a bit over my head but I’ll definitely bookmark for later.
Very nice article Armand, it would be very nice to see more articles on language files, time, date…etc
Thanks in advance.
Thanks for share. I need this.
I didn’t notice before that non-ASCII characters incompatibility can be solved easily with just additional .UTF-8 table. Thanks for nice and neat solution
Yeah, nice article. I like the way you explain.
How column data can be converted to UTF-8 if the imput was in other encoding?
To convert existing tables to UTF8, you use
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
You should really try to use “utf8_unicode_ci” and NOT “utf8_general_ci”
I did not write it, but you should read this guys page for the differences: https://www.davidtan.org/differences-between-utf8_unicode_ci-and-utf8_general_ci/
Very interesting article. I used
utf8_general_ci
because it is the setting recommended by MySQL themselves: See Configuring the Character Set and Collation for Applications (http://dev.mysql.com/doc/refman/5.0/en/charset-applications.html)….I will do some tests myself and see what are the speed differences between the two collations and if there are any other caveats, because
utf8_unicode_ci
seems more accurate indeed.I didn’t notice before that non-ASCII characters incompatibility can be solved easily with just additional .UTF-8 table. Thanks for nice and neat solution
Great, it helped me a lot. Thanks…
Very nice article Armand, it would be very nice to see more articles on this topic.
Very clear: thanks! Looks like just what I need, I’ll try it now
Thanks for good article.
Very good info.Thanks
Really nice post thanks for sharing
i am want to se more articles on this topic…thanks
Hi, to convert existing tables to UTF8, you use
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Thanks! Those two SQL Queries were what I was missing!
thank you very much!!!
i got the solution here…. thank you very much Armand……
Thanks for the review. How column data can be converted to UTF-8 if the imput was in other encoding?
Comments are closed.