The Unicode rules in XML

Unicode Consortium develops so called the Unicode standard which defines the characters in UTF (Unicode Transformation Format) standard. Unicode supports most of the widely used character sets such as Latin, Greek, Han, Arabic, Hebrew, Devanagari ..etc.

In Unicode each character is represented by a number called code point. The chart of the code point is available at http://www.unicode.org/charts/. Based on the ways the code points can be encoded, Unicode is said to have following encoding methods:

UTF-2

UTF-2 represents each character in 2 bytes. Example A has code point 65 in Unicode and represented as 0x0041. The disadvantage of this is, it supports limited number of characters.

UTF-8

UTF-8 is the standard character encoding used in web, where it uses 1 byte(8 bits) – 4 bytes to represent various characters. It has a variable length of representing characters. UTF8 is the default encoding for HTML5, CSS, JavaScript, PHP, SQL along with XML itself.