Character sets, encodings and references

To enhance interoperability and internationalization of the web, a HTML document should specify a character set to be used.

In earlier days the main character set on the web was called ASCII. ASCII included English uppercase and lowercase letters, numbers and some special characters.

As in a such a widely used media as the WWW that arrangement soon became inadequate so new standards evolved.

The ISO (International Standards Organization) defines character sets that are used around the world. For the most of Indo-European languages ISO standardized character set starts with ISO-8859-#.

Some common ISO-8858 character sets:

ISO 8859-1 Western Europe
ISO 8859-2 Western and Central Europe
ISO 8859-3 Western Europe and South European (Turkish, Maltese plus Esperanto)
ISO 8859-4 Western Europe and Baltic countries (Lithuania, Estonia, Latvia and Lapp)
ISO 8859-5 Cyrillic alphabet
ISO 8859-6 Arabic
ISO 8859-7 Greek
ISO 8859-8 Hebrew
ISO 8859-9 Western Europe with amended Turkish character set
ISO 8859-10 Western Europe with rationalized character set for Nordic languages, including complete Icelandic set
ISO 8859-11 Thai

To optimize different character sets and group them together, the Unicode Consortium developed the Unicode Standard.

The Unicode Standard or UTF (Uniform Transformation Format) is currently divided into 3 groups, UTF-8, UTF-16 and UTF-32 with the first two being used most of the time.

The UTF-8 is using one to four 8-bit bytes and it is the most often used set on the Web, considering that it covers all of the western languages. In fact the first 128characters of Unicode correspond to ASCII while the first 256 characters correspond to ISO-8859-1 standard.

The UTF-16 is using one or two 16-bit words (coding units). The advantage comes obvious when there is a need of using characters that are not standard in ASCII (western languages) but i.e. in Asian languages. That is because UTF-16 never exceeds 2 bytes per character, while UTF-8 can reach 3 to 4 bytes per character and thus consume more memory.

The format called UTF-32 is rarely used. Although it is suitable for (currently) all characters, being designed with fixed width character size (4 bytes or 32 bits) it always consumes a lot of unnecessary memory allocation.

To declare a character set of a particular web page we need to use character encoding method. That is a method that converts sequence of bytes into a sequence of characters.

The best way to "tell" a server and a browser which character set should be used during document's transmission and interpretation is by adding it to the header, inside a <meta> element like this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Servers and browsers have other means to override or "repair" the set defined by the site developer, in case if i.e. client's settings defer or don't allow proposed website's character set encoding.

To ensure expression of some characters that might have problems with the encoding or might be disabled by the configuration, it is possible to use SGML based character references directly in the document.

The references are expresses as numbers or as entities.

Examples of some character references:

char.

entity

name

	dec.

hex.

quotation mark

ampersand

apostrophe

less-than

greater-than

‘

open single quote

‘

’

close single quote

’

registered trademark

™

trademark

™

•

bullet

•

›› go to examples ››

Character sets, encodings and references

Comments

About

Contact

Get Connected