Link to the
source articleby Gabe Bokor
Web page design has become an increasingly valuable skill for
translators, both for creating their own Web sites and for meeting
their customers' needs for
multilingual
Web pages. The question is: "How can a non-English Web page
be created that appears correctly on most browsers and platforms,
with a variety of user settings and different fonts installed
in the author's and the users' systems?"
| For fast-loading, sharp, editable
characters in any language, text is preferable. Use
graphics for compatibility where loading speed is of
little or no concern. |
Although Web authoring WYSWYG software is becoming increasingly
international, it usually lags behind current development
in Web standards. While such software allows people to create
Web pages without learning and using codes, some knowledge
of HTML and related standards is useful when creating more
sophisticated or multilingual
Web pages.
From ASCII to Latin-1
The original character set of the Internet is ASCII
(American Standard Code for Information Interchange), the
characters you can find on the keyboard of a standard American
computer or typewriter. Even today, most information is transmitted
over the Internet in the form of ASCII characters, with each
character being encoded with a number in the range from 1
to 127. In HTML, the language of the Web, this encoding is
represented as &#xxx; with the x's standing for the character's
code number. Thus, the letter "a" can be encoded as a,
the letter "A" as A, and the figure bracket "{" as
{.
It was not difficult to extend this notation from ASCII to
the characters used by most Western European languages, resulting
in the "Latin-1" character set (also known as the ISO-8859-1
or, somewhat erroneously, as the "extended ASCII" character
set). Whereas there are 127 possible 7-bit ASCII characters,
the 8-bit Latin-1 encoding allows 255 characters to be represented.
Latin-1 includes, in addition to the regular ASCII characters,
non-English letters such as the "é" (é) and "ñ" (ñ),
some special symbols, such as the bullet (•) encoded as €,
the section sign (§), encoded as § and others. Most
of the "extended ASCII" characters also have a named form,
which is easier to remember than the numeric form. For example,
"é" can be used in a Web page as either é or é,
"ñ" as ñ or ñ, and "§" as § or
§. All the major Web browsers support both forms
of encoding.
The numerical and named HTML codes of the Latin-1 character
set can be found, for example, at
http://www.owlnet.rice.edu/~jwmitch/iso8859-1.html.
The newer versions of Netscape Navigator and Microsoft Internet
Explorer don't need
any encoding for the
Latin-1 (= ISO-8859-1 = Windows Code Page 1252) characters
when configured for the Western European character set (default
for U.S. and Western European browsers). In that case, a Web
page text typed using the standard Windows keyboard ( U.S.
or U.S. International) and the Windows character set appears
correctly on browsers running on either Mac or Windows machines,
although the Mac doesn't support some characters of the Latin-1/ISO-8859-1
character set.
From Latin-1 to Unicode
Note: You may not be able
to read some of the characters used in this section unless
you have the Eastern European and Cyrillic character sets
of Windows 95/98 installed in your system. These character
sets are part of the Multilanguage Support of the standard
Windows 95/98 software.
Most languages of the world, however, are not restricted
to the ISO-8859-1 character set. Hungarian, for example, has
the characters o", O", u", and U"; Czech has u* and r(, Romanian
has t, and other characters, which are not part of Latin-1.
Then there are characters of alphabets other than our Roman
alphabet, such as Cyrillic (?????????), Hebrew, Arabic and
others. Over the years, several forms of encoding have been
devised for these characters either by replacing some of the
255 Latin-1 characters by others or by using a completely
different 8-bit or 16-bit encoding. Some examples of these
character sets are given in the table below.
iso-8859-1 or CP-1252 or Latin-1 |
Western European |
iso-8859-2 or CP-1250 or Latin-2 |
Eastern European |
iso-8859-3 |
Esperanto, Galician, Maltese, Turkish |
iso-8859-4 |
Scandinavian, Baltic |
iso-8859-5 |
Cyrillic |
iso-8859-6 |
Arabic |
iso-8859-7or CP-1253 |
Greek |
iso-8859-8 |
Hebrew |
GB2312 |
Simplified Chinese |
Big5 |
Traditional Chinese |
Shift_JIS, EUC-JP |
Japanese |
KOI8-R |
Russian |
ISO-2022-KR, EUC-KR |
Korean |
Table 1 - Character sets
Outside the ISO-8859-1 character set, the character generated
in a word processor or editor may not be interpreted correctly
by a browser. For example, the character "Û" generated in
a Windows word processor will be displayed by a browser as
"Û" under the Western European (ISO-8859-1) encoding, as a
(Hungarian) "U"" under the Central European encoding (ISO-8859-2),
as a (Russian) "?" under ISO-8859-5, as "?" under Windows-1251,
and "?" under KOI-8-R.
The Web author has two options to make sure the characters
of the page will be displayed as intended:
- 1. She may convert each character to the proper code (either
manually, using code tables, or automatically, using appropriate
word processing or Web authoring software);
- 2. She can use the special characters unencoded and instead
tell the browser how to interpret them.
If the latter option is selected, a META tag containing the
"charset" (character set) attribute is inserted in the header
of the Web page. For example, the following META tag tells
the browser that the page is a Russian page generated with
the Windows Code Page 1251 character set:
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=WINDOWS-1251">
Tags are not case-sensitive.
Another Russian encoding system (KOI-8) would use the META
tag
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=koi-8-r">.
Another hint to the browser about the type of encoding
is the META tag containing the language attribute. For Russian,
this tag would be the following:
<META NAME="LANGUAGE" CONTENT="RU">
These codes, together with the direction-of-text codes DIR
and BDO (with their two possible values LTR and RTL for left-to-right
and right-to-left), can also be applied at the level of Web
page elements, such as paragraph <P> or table cell <TD>.
For example, a paragraph in Hebrew might have the following
tag:
<P CHARSET=ISO-8859-8 LANGUAGE="he" DIR="rtl">
Unfortunately, at this time not all browsers understand the
language and direction-of-text tags, and they do not implement
the character set/language indicated in the META tag and at
page element level consistently.
If the
Web author didn't indicate, using
the charset and/or language attribute in the META tag, the
type of encoding of the page, the
reader
must select the proper character set in his or her browser.
In Microsoft Internet Explorer 4.0, this selection is made
under View - Fonts, Netscape Navigator 4.0 and Microsoft Internet
Explorer 5.0 under View - Encoding, and in Netscape Navigator
4.5 under View - Character Set.
Regardless of the type of encoding used by the Web author,
the reader of the Web page must have the appropriate fonts
installed in order to correctly read the page.
Most of the historic character sets listed in Table 1 (except
those of the ISO family) are mutually incompatible. Due to
the inconsistent manner different browsers interpret page-level
and page element-level charset and language codes, results
can be unpredictable when languages encoded with different
character sets, such as Chinese and Russian, are used on the
same Web page. A character set that would include all the
above character sets as subsets was needed.
Unicode
(ISO-10646), in which each character is encoded using 16 bits
(compared to 7 bits in ASCII), is such a universal character
set. Unicode theoretically allows 65536 characters to be generated,
covering all the known languages of the world, many graphic
symbols, computer commands, and more. A projected extension
of the standard will allow this number to be increased to
about a million. As of this writing, about 40,000 Unicode
characters have been defined.
The structure of Unicode is the following (the four characters
after the U+ symbol indicate the range of codes assigned to
that set in hexadecimal notation):
U+0000 - U+007F |
ASCII (Standard English; can be combined with
other blocks) |
U+0080 - U+00FF |
Latin 1 (Danish, Dutch, Spanish, French, Italian,
etc.) |
U+0100 - U+017F |
European Latin (Czech, Polish, Romanian, etc.)
|
U+0180 - U+01FF |
Extended Latin (Croatian digraphs and Pinyin diacritic
vowels) |
U+0250 - U+02AF |
Standard Phonetic (International Phonetic Association
characters) |
U+02B0 - U+02FF |
Modifiers (glottal stops, tone transcription letters,
etc.) |
U+0300 - U+036F |
Generic Diacritics (umlauts, Vietnamese tone marks,
etc.) |
U+0370 - U+03FF |
Greek and Coptic |
U+0400 - U+04FF |
Cyrillic and Cyrillic variants (Serbian, etc.)
|
U+0530 - U+058F |
Armenian |
U+0590 - U+05FF |
Hebrew and Yiddish |
U+0600 - U+06FF |
Arabic |
U+0900 - U+097F |
Devanagari |
U+0980 - U+09FF |
Bengali |
U+0A00 - U+0A7F |
Gurmukhi |
U+0A80 - U+0AFF |
Gujarati |
U+0B00 - U+0B7F |
Oriya |
U+0B80 - U+0BFF |
Tamil |
U+0C00 - U+OC7F |
Telugu |
U+0C80 - U+0CFF |
Kannada |
U+0D00 - U+0D7F |
Malaylam |
U+0E00 - U+0E7F |
Thai |
U+0E80 - U+0EFF |
Lao |
U+1000 - U+105F |
Tibetan |
U+10A0 - U+10FF |
Georgian |
U+2000 - U+27BF |
General Punctuation, Symbols, Dingbats, Arrows,
Blocks, etc. |
U+3000 - U+303F |
CJK (Chinese, Japanese, Korean) Symbols and Punctuation
|
U+3040 - U+309F |
Hiragana |
U+30A0 - U+30FF |
Katakana |
U+3100 - U+312F |
Bopomofo (Chinese/Mandarin phonetic characters
for teaching) |
U+3130 - U+318F |
Hangul Elements (Korean) |
U+3190 - U+3D2F |
CJK Marks, Letters, Enclosed Ideographs, etc.
|
U+4000 - U+8BFF |
Chinese/Japanese/Korean Han Ideographic characters
|
Table 2 - The structure of Unicode
(Source: A Look at Unicode: Classifying Language by Lorri
Mon http://www.uic.edu/~lmon/unicode.html )
The notation &#xxxx; used for the Latin-1
character set is understood by the most recent versions of
Web browsers when it's extended to the higher Unicode characters.
Thus, the double-dagger symbol ‡ can be coded as ‡,
where 8225 is the decimal equivalent of the character's code
U+2021 in the hexadecimal notation of Table 2. The character
is located in the light green area of Table 2.
One encoding form of Unicode, known as UTF-8 (Unicode Transfer
Format), uses variable-length encoding: one byte (7 bits)
for ASCII and 2 to 6 bytes (up to 31 bits) for the other character
sets. Thus, ASCII is a subset of UTF-8, and the ASCII characters
can be used unencoded on a UTF-8 Web page. The META tag with
the charset attribute in the header of a UTF-8-encoded page
is the following:
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=UTF-8">.
More information about the Unicode standard can be found at
http://www.unicode.org/unicode/standard/standard.html.
The charts of all Unicode characters, in .pdf format, can
be accessed from the page http://www.unicode.org/charts/.
How to generate and read Unicode text
Microsoft Word v. 8 provides Unicode support, allowing
the user to save a file as "Unicode Text," but I've found
this feature to be quite buggy. This version of Word uses
not UTF-8, but 16-bit encoding, i.e., even ASCII characters
are encoded with two bytes; the second byte appears as an
empty square on the screen. Chris Pratley, Program Manager
of Microsoft Office (chrispr@MICROSOFT.com),
informed me that the version of Word that comes with Office
2000 has solved this and other problems.
You can also generate a UTF-8-encoded Web
page in Microsoft Front Page 2000 by selecting File - Properties
- Language - Save Document As - Multilingual (UTF-8) or by
manually changing the charcode META tag in the HTML view of
the page. Selecting UTF-8 under Tools - Page Options - Default
Font - Multilingual UTF-8 will not do the trick. The page
defaults back to Windows-1252 or the character set of the
first character typed.
Regarding other software for generating UTF-8 code, Otto Stolz
(Otto.Stolz@uni-konstanz.de)
informed me as follows:
"You could edit your HTML source with UniEdit and store it
in UTF-8. See http://www.lang.duke.edu/uniintro.htm.
You may also wish to try Tango Creator, the Unicode-capable
HTML editor from Alis. See http://www.alis.com/internet_products/creator/creator.html."
Both Netscape Navigator and Microsoft Internet Explorer v.
4.0 and higher are fully UTF-8-compatible.
If everything else fails
If the compatibility problems between Web author and the intended
reader cannot be solved via coding, the last-resource solution
is the graphic format. Of course, any character can be converted
into, and displayed as, graphics, in which case no character
encoding is required and the reader does not need to have any
special font installed in his or her system. The graphic formats
handled by Web browsers are JPEG and GIF. The relative merits
of these two formats are beyond the scope of this article, but
it's good to remember that graphics always take up more disk
space and load more slowly than the respective text files. They
are also more difficult to modify if the text has to be changed
for any reason at a later time. Therefore, for fast-loading,
sharp, editable characters in any language, text is preferable.
Use graphics for compatibility where loading speed is of little
or no concern.