Over the past few weeks, Reena has been building a new website off and on for a local Jesuit parish. Everything was going along smoothly until we hit a bump in the road when dealing with a particular page with some Vietnamese text. After some small (but not obvious) changes in configuration and more research than I cared to do on the subject of character sets, it's all working properly.
The initial problem was that the content being copied wasn't showing up in the editor as it was in the original document. Whenever I run into a situation like this, I usually fall back to my high school methods of building websites and open up a text editor. Simple, plain, no extra features to get in the way, just give me the code. Copy and paste worked just fine for me, so I figured that there must be a problem with the file type that Reena was using. A quick "Save As" with the text encoding set to "Unicode UTF-8" and everything looked great.
Unfortunately that wasn't enough; loading the page via a browser resulted in more broken text. I certainly don't speak or read Vietnamese, but I was pretty sure that � and ? characters in the middle of words weren't part of the language. Back to the research desk. After some digging around online, I came across a great article written by Dave Shea that covered some of the issues that can arise when dealing with HTML and foreign languages.
Following his tips, I slightly changed the header of the page to the following:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="vi">
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Once these changes were in place, everything was looking great. The page loaded correctly on our local test environment, so I let Reena know and went on to something else. A couple weeks later, Reena was uploading the page to the live site and noticed some of the same text problems again. So I jumped to conclusions and figured that the encoding was off again. It wasn't, so I did some more testing of the transfer to make sure that the file was making it to the server in the same format as it was on our local server. Everything looked good, so now I move on to the configuration settings for Apache.
Since we were dealing with a foreign language, the first place I checked in the httpd.conf file was the AddLanguage and AddCharSet directives. There were some differences between our live server and test server, so I modified them to match, gracefully restarted, and gave it another test. Still no luck. By this point I'm starting to run out of places to check, so it's back to Google for some more research.
Most of the resources I found that related to the problem we were experiencing referred to a previously-unknown-to-me directive called AddDefaultCharset. Because of some cross site scripting issues, this directive can be used to force text/html documents on your server to be encoded using a specific character set. If this directive is turned on, all text/html files will come out encoded as ISO-8859-1, whether you specify the character encoding in your file or not.
# Default charset to iso-8859-1 (http://www.apache.org/info/css-security/).
I set this to "off", gracefully restarted, and the page looked as it should. I really didn't want to leave this potential security hole open and I really didn't want to specify a default character set either. Part of the description of the AddDefaultCharset directive caught my eye when I went back and reread it. This directive would only be applied "to any response that does not have any parameter on the content type in the HTTP headers." So unless you're specifically sending a character set in a header, the character set defined by AddDefaultCharset takes over.
In the end, I added a little PHP header at the top of the Vietnamese page and turned the AddDefaultCharset directive in Apace back on. The headers are being sent properly, so Apache isn't applying the ISO-8859-1 character set. Since character sets aren't being applied on other pages via PHP, Apache takes over and sets the character set automatically.
header( 'Content-Type: text/html; charset=utf-8' );
So, long story short, check your file encoding as well as the headers being sent by your server if you're running into issues. While Dave Shea's article is a few years old, there's still some great info in it, give it a read.
Want even more information? Check out some of the links below to learn what should be more than enough.