Last Modified: 1/29/08
  Computer Training
Introduction to XML

Unicode

The Internet and the World Wide Web are international, reaching people who use many different languages. At present, much of what is on the Web is in English, but sites in other languages are rapidly increasing in number.

Single-byte Character Sets

In the past, character sets were defined so that each character was represented by a single byte code, allowing each set to consist of up to 256 characters. The most commonly used character set in browsers is ISO-8859, also named the Latin-1 character set, which includes most of the characters used in European languages.

Browsers can be configured to use a specific character set. If you and everyone in your audience have your browsers configured the same, everyone can read your texts. If some are set up for other character sets, at least part of your text will be unreadable.

In addition, languages such as Chinese, Japanese, and Korean have many more that 256 different characters. Publishing pages in these languages requires either putting the text in graphics or using a multi-byte character set.

Unicode, The Multi-byte Character Set

Simply increasing the character code to two bytes (16 bits) makes it possible to specify over 65,000 possible characters, instead of just 256. Unicode is a two-byte character set that includes over 60 different kinds of characters, ranging from the Latin-1 characters to Hebrew, Thai, Katakana Japanese, and Tibetan. It also includes many sets of symbols such as geometric shapes, currency symbols, and diacritical marks.

Other multi-byte character sets exist. The Universal Character System (UCS) uses four bytes, allowing codes for up to two billion different characters. Unicode is actually a subset of UCS.

XML Is Designed For Unicode

If you have need for character beyond what is provided by the single-byte sets, XML allows you to use the power of Unicode. You can do this in two ways:

  1. Create your XML in a single-byte character set like Latin-1 and then convert your file to Unicode using a program like Java Development Kit's native2ascii. This approach works well when you have large amounts of text to convert.
  2. Use Unicode character reference numbers to specify particular characters. This method might be used when you wanted a limited number of characters within your text.

Of course, just because you specify a particular Tibetan character in your XML file does not mean that the person viewing the page will have the Tibetan font set loaded on their computers.

Resources

Previous Home Next

Topics

Summary

HTML Is Not Enough

What Is XML?
  Ontologies
  SGML, HTML, & XML

XML Basics
  HTML Example
  XML File
  Structure
  Paths
  Well-Formed
  DTDs
  Schemas
  Validation
  Unicode
  What It Means

Transforming For
Presentation

  DHTML
  CSS
  XSL

Serving And Processing XML
  Server Side
  Client Side

XML Applications   Information Reuse
  B2B
  Text Encoding
  Syndication

Security

XML Resources On The Web

Part Two Of Class

 
Previous Home Next

Other Topics:   XML Editors

©1999 UW Technology