Unicode
The Internet and the World Wide Web are international,
reaching people who use many different languages. At
present, much of what is on the Web is in English, but
sites in other languages are rapidly increasing in number.
Single-byte Character Sets
In the past, character sets were defined so that each
character was represented by a single byte code, allowing
each set to consist of up to 256 characters. The most
commonly used character set in browsers is ISO-8859, also
named the Latin-1 character set, which includes most of the
characters used in European languages.
Browsers can be configured to use a specific character set.
If you and everyone in your audience have your browsers
configured the same, everyone can read your texts. If some
are set up for other character sets, at least part of your
text will be unreadable.
In addition, languages such as Chinese, Japanese, and
Korean have many more that 256 different characters.
Publishing pages in these languages requires either putting
the text in graphics or using a multi-byte character set.
Unicode, The Multi-byte Character Set
Simply increasing the character code to two bytes (16 bits)
makes it possible to specify over 65,000 possible
characters, instead of just 256. Unicode is a two-byte
character set that includes over 60 different kinds of
characters, ranging from the Latin-1 characters to Hebrew,
Thai, Katakana Japanese, and Tibetan. It also includes many
sets of symbols such as geometric shapes, currency symbols,
and diacritical marks.
Other multi-byte character sets exist. The Universal
Character System (UCS) uses four bytes, allowing codes for
up to two billion different characters. Unicode is actually
a subset of UCS.
XML Is Designed For Unicode
If you have need for character beyond what is provided by
the single-byte sets, XML allows you to use the power of
Unicode. You can do this in two ways:
-
Create your XML in a single-byte character set like
Latin-1 and then convert your file to Unicode using a
program like Java Development Kit's
native2ascii. This approach works well when you
have large amounts of text to convert.
-
Use Unicode character reference numbers to specify
particular characters. This method might be used when you
wanted a limited number of characters within your text.
Of course, just because you specify a particular Tibetan
character in your XML file does not mean that the person
viewing the page will have the Tibetan font set loaded on
their computers.
Resources