Skip to: Navigation | Content | Sidebar | Footer


Weblog Entry

HTML and Foreign Languages

July 29, 2003

This is an article I needed to find myself six months ago. Feel free to link gratuitously with phrases like “html translation” and “unicode web” and “foreign language web site” and any other appropriate search term you can think of so that others may benefit from it.

I’ve recently had to get my hands dirty with HTML in French, Greek characters, and English OS support for Asian languages, so I figured I’d pass on the results of my muddling creating the various translations of the Zen Garden. This is a short but sweet summary of what I know on the subject.

Don’t Panic

First of all: How in the world do you even start with foreign character support, especially if you don’t speak the language? If you receive a foreign-language document and get asked to put it on the web, this is about the point you start panicking.

Relax, it’s actually surprisingly easy, given a fairly modern Operating System with decent language support. Here’s what you need to know.

Operating System Support

You may not be able to see the document in its original character set, but depending on your OS, you might be able to copy and paste the characters between documents without damaging the data. I’ve had luck copying from Windows Notepad and pasting into my HTML editor.

Easiest way to tell is to try with a small amount of data — paste it into a properly-encoded document (see below), and view it in Mozilla or IE6. If it renders properly with the desired characters intact, you’re good to go.

If it doesn’t, you may not have the correct language pack installed — it should be possible to work with the data anyway (even if you can’t view it — just make sure you test on a system that can), but it can’t hurt to install any foreign language packs you can get your hands on, just in case. The 200MB of disk space is negligible in 2003.

File Formats

UTF-16 files are out. Do not try saving your .html, .asp, or.php as a double-byte Unicode file. Most modern browsers support it, but some older ones do not (IE5/Mac comes to mind). Not only that, but your file size doubles, and IIS and PHP alike have trouble with the files so unless you’re serving up static HTML (not likely in 2003) you won’t be able to use them anyway.

Feel free to save a properly-encoded or UTF-8 document as anything you wish though. It can be .html, .php, .asp and so on.

Document Encoding

It’s all about character encoding, baby. Redundancy is the key; define your XML namespace if working with XHTML, and also (regardless if you’re using HTML 4.01 or XHTML 1.x) add a <meta> tag to specify your document’s encoding.

XML Namespace:

It goes in your <html> tag, and looks like this:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

In this case, English is the language, designated by the "en". (complete list of the ISO 639 character codes)

<meta> Tag Encoding:

On top of setting your XML language, 9.8 times out of 10 you’ll also want to specify document encoding. I’m a little unclear on the difference between the two, but WaSP has a summary of the best way to encode a document. Syntax looks like this:

<meta http-equiv="content-type"
    content="text/html; charset=iso-8859-1" />

The charset (character set) is the key. For most western European languages based on Latin characters, you won’t need to change this; just include it. For eastern European, Asian, and all other languages, there are different charsets — lists are available but the best resource for this is in your Mozilla-based browser; hit View->Character Coding, and you should find a comprehensive list of all possibilites with their associated charset value. Use the code in brackets (UTF-8, US-ASCII etc.) and not the full name.

Note that the WaSP article linked above has further information on server-side character encoding. This is beyond my current abilities, but is something highly recommended by the W3C. Worth a read, if you want to really do it properly.

Unicode character encoding works just fine, and in some cases is preferable. The difference here is that we’re not saving the document as a double-byte Unicode file; we’re instead merely setting the document’s charset to Unicode through the meta tag. Sample Unicode encoding:

<meta http-equiv="content-type"
    content="text/html; charset=utf-8" />

As far as individual characters, you may want to try using HTML Character Entities for occurances of non-ASCII characters. That is, you might want to use &Uuml; instead of just the character itself, Ü. This can be tedious and trying though, and given proper encoding as discussed above, may even be unnecessary.

Accessibility Concerns

One last thing to consider before we wrap up. WAI lists “identifying changes in language” as a priority 1 accessibility concern, which is to say, it’s Really Important that you do this. If your HTML switches at any point from the main language to another, you must provide some cue for the browser that this is happening. See the WAI for more on this.

Conclusions

This document was written by an embarrassingly unilingual English speaker with extremely limited foreign language capability beyond grade-school French classes. If I’ve managed to wrangle over a dozen translations of a document using these techniques, chances are they’re good enough for most cases. Inevitably I’ll have made some errors and over-simplified, but hey — that’s what the comments are for.

Further Reading:

Reader Comments

July 29, 01h

Re: UTF-8 editors: SC UniPad, an excellent tool.

jacob says:
July 29, 03h

Also remember that the character encoding may also be declared in the HTTP “Content-Type” header.

Keith says:
July 29, 10h

Great article. I often get asked to put foreign language information up at the hospital and until now figured PDF was the only realistic option. It’ll be interesting to see how it pans out, but I plan on seeing if I can’t do some of this myself.

July 29, 10h

Thanks for the article!
By the way! Have you done the russian translation of Zen Garden by yourself?

Peter J. says:
July 29, 11h

As I understand it, if you’re writing XHTML you can’t assume the character entities will be defined unless you specify a DTD: the simple presence of the namespace isn’t enough. The safest bet if you don’t want to use “binary” characters (Unicode or charset-specific) is to use the generic XML representation: &mdash; = &#8212;, etc.

Also regarding double-byte encoding, file sizes, etc., I’ve found two of Tim Bray’s articles on Unicode (On the Goodness of Unicode and Characters vs. Bytes) to be real lifesavers.

Dave S. says:
July 29, 11h

Excellent, thanks for the links Peter.

I’m running translations both with HTML character entities (Norwegian) and without (Portuguese) — each passes XHTML 1.0 Strict validation without a hiccough.

So from my understanding of what you’re saying, I should just use the characters as-is (Portugese) or go one step further and use XML representation instead of HTML character entities (Norwegian), just in case it so happens the user agent doesn’t use the standard HTML character entities — correct?

July 29, 12h

One difficulty you may encounter if you find yourself actually working with large numbers of files saved as UTF-8 is finding an editing environment that is any better than notepad which preserves the encoding. Here at work (advoy.com) the front end team has been dragged kicking and screaming into using ultraedit, simply because it preserves the encoding type, doesn’t munch the characters, and doesn’t stick a couple bytes of garbage at the beginning of the file to remember how it is encoded. It auto-detects when you check the right Advanced -> Configuration..

Has anyone else found an editor that deals with UTF-8 files? Homesite, along with every other editor except notepad and ultraedit that we have tried are unable to auto detect UTF-8 and save as UTF-8 with out adding their own flags to the beginning of the file. Any thoughts? (btw our team has been in touch with Macromedia regarding Homesite, they acknowledged the problem and then it went nowhere).

Peter J. says:
July 29, 12h

At the Garden you’re fine using character entities because you’ve got the doctype. Mozilla will handle the entities unless you omit the doctype and serve your pages as application/xhtml+xml, at which point the strict XML parser kicks in and you get “XML Parsing Error: undefined entity” for the HTML entities; I can’t speak to the behaviour of other browsers that support a/x+x (are there any?).

I believe (though I can’t find the reference) that if the encoding of an XML document is unspecified in the XML declaration — or if the XML declaration is missing altogether — it is assumed to be UTF-8.

July 30, 03h

Dave, great article.

Besides I just wanted to let you know, that the german translation identifies itself as dutch. xml:lang=”nl”.

Regards

July 30, 04h

…Sorry, wrong button… The ‘translations’ link takes me to an ‘Directory Listing Denied’ message.

Adam Rice says:
July 30, 05h

Not exactly sure what you’re getting at here. At one point, you write “Unicode files are out… Most modern browsers support it, but some older ones do not (IE5/Mac comes to mind)”. Later your write “Unicode character encoding works just fine, and in some cases is preferable.”

At any rate, I recommend UTF-8 for all non-Roman situations. IE5/Mac has no trouble with pages containing UTF-8-encoded Japanese.

You write “on top of setting your XML language, 9.8 times out of 10 youíll also want to specify document encoding. Iím a little unclear on the difference between the two…”. The XML language is the human language the document is written in. You know, like English. The encoding is the charset, like ISO-8859-1 (Latin-1). Japanese, for example, has 3 different possible Japanese-only encodings (shift-jis, iso-2022-jp, and EUC-JP), plus Unicode (similar situation for Chinese), so putting the correct coding in there can be the difference between the browser showing the page correctly and showing a page full of junk that requires a trip to the Encoding menu.

July 30, 06h

I’d say Vietnamese is another good example of the difference between encoding and human language. Vietnamese can be encoded in Unicode, TCVN, VIQR, VISCII, VNI, VNCII/VPS, and Windows-1258. None of these encodings are at all compatible with one another, except for VIQR, which is basically the lowest common denominator, being an ASCII representation of Vietnamese.

chunshek says:
July 30, 08h

Heya Dave. You may also take a look at EmEditor at http://www.emurasoft.com/. I’ve worked with it for a couple of years and swear by it.

Dave S. says:
July 30, 10h

Adam - Unicode is a character encoding method. Unicode files, as I’ve poorly defined in the article, refer to double-byte files. That is, each character occupies 16 bits of space, versus the 8 that a regular plain-text file would normally require. Those are unusable.

Unicode document encoding is a completely different beast, however. Only by a simple tag, I can mark the file as UTF-8 and thus use any foreign characters I require. I am unfamiliar with the way the character data is stored though. Unicode is something I haven’t dabbled with extensively, so if I am totally wrong here, I expect someone else can correct me.

With regards to your second point - that’s about how I understood it, I just didn’t want to take the leap into the deep end. The reason I say ‘9.8 times out of 10’ is because in a lot of cases the character encoding is essential, in some it’s not. If you fool around with western languages, you’ll generally find the encoding doesn’t make a difference.

Dave S. says:
July 30, 10h

Minz & Thijs - thanks. Brain dead errors on my part, I’ll fix ‘em.

alex - no, I had help with the Russian translation.

Jacob - any links to resources on configuring HTTP headers? It’s assumed a company with a decent network admin would know this stuff, but not everyone has one of those.

Peter J. - Good to know, thanks.

July 30, 11h

The whole Unicode part appears to be a bit confusing. Unicode is _not_ a ‘character encoding method’, utf-8 and utf-16 are. The utf-16 encoding is what you wrongly refer to as ‘Unicode files’.

Just ‘tagging the file as utf-8’ does not magically makes the encoding utf-8. Doing that probably worked for you because: 1) the data you were copying from notepad already _was_ encoded as utf-8, or 2) your html editor recognized the meta tag and made sure it encoded any characters your entered as utf-8.

I try to explain this at http://www.vandervossen.net/2003/07/do_not_fear_unicode

If your site is running on Apache, and you are allowed to change server settings using .htaccess files, you can set the Content-Type HTTP header for _all_ files with a .html extension in a directory by creating a .htaccess file with the following line:

AddCharset utf-8 .html

Make sure to test if this is working using the Mozilla ‘live http headers’ extension.

If you want to use utf-8 encoded php files, you need the following in your .htacces file:

php_value default_charset UTF-8
AddCharset utf-8 .php

July 31, 01h

That file is indeed encoded using UTF-8.

I don’t know why IIS has problems with the other file. Could be because it is UTF-16, but I must admit I never been anywhere near an IIS installation ever, so I cannot help you there.

Adam Rice says:
July 31, 09h

Thijs did an excellent job explaining all this. As a lagniappe, I’ll recommend Ken Lunde’s excellent book, Understanding CJKV Information Processing (O’Reilly, blowfish on the cover). Covers all the issues involved in multi-byte character sets (Unicode supports >2 bytes per character, if it ever needs it).

Dave S. says:
July 31, 09h

Thanks Thijs, that clears up a few things, but here’s why I’m still confused on the file format issue:

This file was saved straight from Notepad as “UTF-8” (it’s in the drop-down list) and uploaded. All the SSI’s work, so it renders properly. (although ignore the character encoding, since it’s buried in an include file)

This is the exact same file, saved as “Unicode” from the drop-down. IIS doesn’t like it.

So I’d imagine the latter is actually UTF-16? What I was trying to get at was that in my experience there’s a way to save a Unicode file that breaks IIS; UTF-16 looks to be that, if I’m now understanding better, but most editors don’t make that differentiation - you kinda have to know ahead of time what the difference is.

July 31, 10h

Dave, the second link gives me a 500: Internal Server Error.

Dave S. says:
July 31, 10h

That’s the point :)

July 31, 10h

The first file contains _no_ characters with a character code above 127. It indeed is encoded as UTF-8, but it is at the same time encoded as US_ASCII and at the same time also ISO-8859-1. All these encodings share the same first 127 characters at the same positions, so if you only use the first 127 characters in a file, there is no difference.

Dave S. says:
July 31, 10h

True — try this file — it contains an assortment of international characters. It is saved as UTF-8, and IIS handles it just fine.

So why did saving the previous file as ‘Unicode’ give IIS problems? Is that because it’s UTF-16?

August 05, 02h

IIS probably assumes that you’ll be using character encodings where one byte = one character. UTF-16 uses two bytes for each character (well, four bytes for some of the less-used parts of Unicode) .. so for normal everyday ASCII text a UTF-16 file will have alternating bytes containing 0x00 and the ASCII characters … the ASCII null characters (0x00) will confuse IIS, PHP, Perl, etc. So don’t use UTF-16 for any file which is going to be interpreted by your web server (PHP, server-side includes, etc.)

UTF-8 is a variable-length encoding with some useful properties. it stores ASCII text unchanged .. an ASCII text file (which would include no characters with codes over 127) is unchanged if converted to UTF-8. As soon as you have any non-ASCII characters (such as accents over letters, or any non-Roman scripts, or any non-ASCII symbols such as curly quotes or em dashes, etc.) then the UTF-8 encoding is not going to match any other encodings. But … all your server-side include directives, PHP commands, etc. are plain ASCII so they’ll be unchanged and therefore everything should work.

UTF-8 is a very cool encoding. (at least if you’re writing in English.. it causes file sizes to be larger than otherwise necessary for Asian languages)